<h1>Introduction</h1>
<p>Hello all! In this notebook I'm going to analyze different products data and implement multiple 
    Machine Learning algorithms to predict the Demand of each Product</p>
<h3>My main objectives on this project are:</h3>   
<ul>
    <li>Applying exploratory data analysis and trying to get some insights about our dataset</li>
    <li>Getting data in better shape by transforming and feature engineering to help us in building better models</li>
    <li>Building and tuning a XGBRegressor to get some results on predicting Demand</li>
</ul>

<h2>Importing Libraries</h2>
<p>Lets start by importing some packages we are going to need</p>

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from matplotlib.ticker import MaxNLocator
import seaborn as sns

<h2>Meeting the data</h2>
<p>Lets open the data and see what we have</p>

In [None]:
#Opening the data
originalTrain = pd.read_csv("../input/predict-demand/train.csv")
originalTest = pd.read_csv("../input/predict-demand/test.csv")

In [None]:
#Lets see the shapes of the data so we know what we are dealing with
originalTrain.shape, originalTest.shape

<p>We can see that we have 7560 rows on the train dataframe, and 1080 rows on the test dataframe, both with 12 columns.

With that information, we can already calculate the distribution of train - test data:
percentage_train_rows = 7560*100/(7560+1080) = 87.5%
percentage_test_rows = 100% - 87.5% = 12.5%
7/8 of the dataset belongs to train data and the remaining 1/8 belongs to test data

Now lets observe some of their elements</p>

In [None]:
originalTrain.head()

In [None]:
originalTest.head()

In [None]:
originalTrain.describe()

In [None]:
originalTest.describe()

<li>Id column looks useless, so we can safely drop it from both. I'm going to save our target (quantity) on a different variable so we can use it in future.</li>
<li>Lets first make a copy of the dataframes so we can keep the originals intact</li>

In [None]:
train = originalTrain.copy()
test = originalTest.copy()

#Dropping unnecessary Id column.

train.drop('id', axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)

#Dropping rows without quantity

train.dropna(axis=0, subset=['quantity'], inplace=True)
test.dropna(axis=0, subset=['quantity'],inplace=True)

#Backing up target variables and dropping them from train data.
y_train = train['quantity']
X_train = train
X_train.drop(columns=["quantity"], inplace=True)

y_test = test['quantity']
X_test = test
X_test.drop(columns=["quantity"], inplace=True)

<h2>EDA</h2>
<p>Exploratory Data Analysis</p>

<p>We're going to start with basic correlation table here. I dropped the top part since it's just mirror of the other part below. With this table we can understand some linear relations between different features.</p>

In [None]:
# Display numerical correlations between features.

sns.set(font_scale=1.2)
correlation_train = train.corr()
mask = np.triu(correlation_train.corr())
plt.figure(figsize=(8, 8))
sns.heatmap(correlation_train,
            annot=True,
            fmt='.1f',
            cmap='coolwarm',
            square=True,
            mask=mask,
            linewidths=1,
            cbar=False)

plt.show()

<h4>Observations</h4>
<li>We can see there's a negative correlation between quantity and price,
indicating, that quantity tends to get lower as price increases </li>

<h2>Missing Data</h2>
<ul>
    <li>Merge the datasets to see how many missing values there are and visualize them</li>
</ul>

In [None]:
features = pd.concat([X_train, X_test]).reset_index(drop=True)
#Lets see the new shape of the features dataframe
print(features.shape)

In [None]:
def missing_percentage(df):
    
    #Defining a function for returning missing ratios
    
    total = df.isnull().sum().sort_values(
        ascending=False)[df.isnull().sum().sort_values(ascending=False) != 0]
    percent = (df.isnull().sum().sort_values(ascending=False) / len(df) *
              100)[(df.isnull().sum().sort_values(ascending=False) / len(df) *
                   100) != 0]
    return pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])


In [None]:
#Checking 'NaN' values.

missing = missing_percentage(features)

fig, ax = plt.subplots(figsize=(20, 5))
sns.barplot(x=missing.index, y='Percent', data=missing, palette='Reds_r')
plt.xticks(rotation=90)

display(missing.T.style.background_gradient(cmap='Reds', axis=1))


<h2>Pipeline</h2>
<p>Steps:</p>
<ol>
    <li>Extract year, month and day from date so we can use them as numerical features</li>
    <li>Add Year, Month and Day columns to the dataset</li>
    <li>Eliminate date column from the dataset</li>
    <li>
        <ol>
            <li>Fill long, lat, price and pop columns with their mean values</li>
            <li>Fill capacity, brand, shop, container, city, year, month, day with their most-repeated values</li>
        </ol>
    </li>
    <li>One Hot Encode capacity, brand, shop, container, city, year, month and day</li>
    <li>Fit the model</li>
</ol>

In [None]:
#Import neccesary packages to create the pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

In [None]:
#Import the BaseEstimator
from sklearn.base import BaseEstimator

#Define pre-processor classes
class DateProcessor(BaseEstimator):

    def __init__(self):
        pass

    def fit(self, documents, y=None):
        return self

    def transform(self, df):
        new_df = df.copy()
        new_df['date'] = pd.to_datetime(new_df['date'], errors="coerce")
        #df.dropna(axis=1, subset=['date'], inplace=True)
        #format="%d%m%Y",errors="ignore"
        new_df['day'] = new_df['date'].dt.day
        new_df['month'] = new_df['date'].dt.month
        new_df['year'] = new_df['date'].dt.year
        
        new_df.drop(inplace=True, columns='date')
        return new_df

class CategoricalProcessor(BaseEstimator):
    def __init__(self):
        pass
    
    def fit(self, documents, y=None):
        return self
    
    def _getDictionary(self, df):
        new_df = df.copy()
        cat_columns = new_df.select_dtypes(include=['object']).columns
        dict = {}
        for col in cat_columns:
            tempMode = new_df.mode()[col][0]
            dict[col] = tempMode
        
        return dict
    
    def transform(self, df):
        new_df = df.copy()
        imputer = self._getDictionary(new_df)
        new_df = new_df.fillna(imputer)
        new_df = pd.get_dummies(new_df)
        
        return new_df
    
class NumericalProcessor(BaseEstimator):
    def __init__(self):
        pass
    
    def fit(self, documents, y=None):
        return self
    
    def _getDictionary(self, df):
        new_df = df.copy()
        num_columns = new_df.select_dtypes(include=['float64', 'int64']).columns
        dict = {}
        for col in num_columns:
            tempMean = new_df[col].mean()
            dict[col] = tempMean
        
        return dict
    
    def transform(self, df):
        new_df = df.copy()
        imputer = self._getDictionary(new_df)
        new_df = new_df.fillna(imputer)
        
        return new_df

In [None]:
#Import the model and GridSearch for Hyperparameter Optimization
import xgboost as xgb
from sklearn.model_selection import GridSearchCV

In [None]:
#Defining the pipeline
xgb = xgb.XGBRegressor()
model_pipeline = Pipeline(steps=[
                                ('process_date', DateProcessor()),
                                ('num_process', NumericalProcessor()),
                                ('cat_process', CategoricalProcessor()),
                                ('XGBoost', xgb)
                                ])

In [None]:
param_xgb = {'XGBoost__nthread':[4], #when use hyperthread, xgboost may become slower
              'XGBoost__objective':['reg:linear'],
              'XGBoost__learning_rate': [.03, 0.05, .07], #so called `eta` value
              'XGBoost__max_depth': [5, 6, 7],
              'XGBoost__min_child_weight': [4],
              'XGBoost__silent': [1],
              'XGBoost__subsample': [0.7],
              'XGBoost__colsample_bytree': [0.7],
              'XGBoost__tree_method': ['gpu_hist'],
              'XGBoost__n_estimators': [500]}

In [None]:
grid_search_xgb = GridSearchCV(estimator = model_pipeline, param_grid = param_xgb, n_jobs = -1, verbose = 2, cv = 5)
grid_search_xgb.fit(X_train, y_train)

In [None]:
#Lets observe the best parameters for our XGBRegressor
print("Best parameter (CV score=%0.3f):" % grid_search_xgb.best_score_)
print(grid_search_xgb.best_params_)

In [None]:
#Score of our model after Hyperparameter optimization
grid_search_xgb.score(X_test, y_test)

In [None]:
#Just to see
grid_search_xgb.score(X_train, y_train)