<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-required-libraries-and-set-configurations" data-toc-modified-id="Import-required-libraries-and-set-configurations-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import required libraries and set configurations</a></span></li><li><span><a href="#Import-required-dataset" data-toc-modified-id="Import-required-dataset-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Import required dataset</a></span></li><li><span><a href="#Train/Test-split" data-toc-modified-id="Train/Test-split-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Train/Test split</a></span></li><li><span><a href="#Specify-X-and-y-for-train-and-test" data-toc-modified-id="Specify-X-and-y-for-train-and-test-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Specify X and y for train and test</a></span></li><li><span><a href="#Feature-Selection" data-toc-modified-id="Feature-Selection-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Feature Selection</a></span></li><li><span><a href="#Scale-the-appropriate-columns" data-toc-modified-id="Scale-the-appropriate-columns-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Scale the appropriate columns</a></span></li><li><span><a href="#Model-Build" data-toc-modified-id="Model-Build-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Model Build</a></span></li></ul></div>

### Import required libraries and set configurations

In [1]:
# Import required libraries
import os
import numpy as np
import pandas as pd

In [2]:
from xgboost import XGBRegressor
from dask.distributed import Client
from dask_searchcv import GridSearchCV
from dask.diagnostics import ProgressBar
from sklearn.ensemble import RandomForestRegressor
from sklearn.externals.joblib import parallel_backend
from sklearn.metrics.scorer import make_scorer
from sklearn.feature_selection import VarianceThreshold

In [3]:
# Set configuration variables
DATA_OUTPUT = '../data/processed/'
client = Client()

In [4]:
def mean_absolute_percentage_error(y_true, y_pred): 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

### Import required dataset

In [5]:
# Read the pickle file
df = pd.read_pickle(os.path.join(DATA_OUTPUT, "master.pickle"))

# Check the shape of the file for consistency
print("Shape of the dataframe is : ", df.shape)

Shape of the dataframe is :  (6798, 183)


### Train/Test split

The time frame of the data is from Apr-Sept. There are only few records of Apr (around 3-4). We shall be excluding them. The train data would consist of data from Months May-Aug, while test data would be Sep data.

In [6]:
# Split into train/test
train = df[df.index.str.contains("May|June|July|August")]
test = df[df.index.str.contains("September")]

# Print the dimensions of the train and test
print("Shape of the training data is : ", train.shape)
print("Shape of the testing data is : ", test.shape)

Shape of the training data is :  (6601, 183)
Shape of the testing data is :  (194, 183)


### Specify X and y for train and test

In [7]:
# Specify target
target_cols = "Quantity"

In [8]:
# For train
X_train = train.copy().drop(target_cols, axis=1)
y_train = train[target_cols].tolist()

# For test
X_test = test.copy().drop(target_cols, axis=1)
y_test = test[target_cols].tolist()

### Feature Selection

We shall be selecting features with variance threshold of at least 10%

In [9]:
# Select the threshold
fs = VarianceThreshold(threshold=.05)

# Select features 
fs.fit(X_train)
X_train = X_train[X_train.columns[fs.get_support(indices=True)]]
X_test = X_test[X_test.columns[fs.get_support(indices=True)]]

### Scale the appropriate columns

We shall be scalling **NominalofPulsa**.

In [10]:
# Find the mean and standard deviation of NominalofPulsa of the train and scale wrt to that
mean = X_train['NominalofPulsa'].mean()
std = X_train['NominalofPulsa'].std()
X_train['NominalofPulsa'] = X_train['NominalofPulsa'].apply(lambda x : ((x-mean)/std))
X_test['NominalofPulsa'] = X_test['NominalofPulsa'].apply(lambda x : ((x-mean)/std))

### Model Build

We shall be doing hyper parameter tuning with cross validation Also to optimize for efficiency, we shall be utilizing dask for grid search as it removes some amount of overhead which cannot be avoided with sklearns version of grid search.

In [11]:
# Specify parameters to tune on
params = dict(n_estimators=[50, 100, 200], 
              max_depth=[3, 6, 10],
              gamma=[0.1, 0.2, 0.3])

# Intialize the model
model = XGBRegressor(random_state=1, 
                     n_jobs=-1, 
                     subsample = 0.70, 
                     colsample_bytree = 0.7, 
                     eta = 0.05)

# Initize the grid
grid = GridSearchCV(model, 
                    params, 
                    n_jobs=-1,
                    cv=5)

# Fit the grid
with ProgressBar():
    grid.fit(X_train, y_train)

[########################################] | 100% Completed |  1min 37.9s


In [12]:
# Select best model parameters
model = grid.best_estimator_

In [13]:
# Training predictions
train_pred = model.predict(X_train)

# Testing predictions
test_pred = model.predict(X_test)

In [14]:
# MAPE Score
print('MAPE score for Train data is : ',
      mean_absolute_percentage_error(y_train, train_pred))
print('MAPE score for Test data is : ',
      mean_absolute_percentage_error(y_test, test_pred))

MAPE score for Train data is :  29.570597000075878
MAPE score for Test data is :  378.4300524671593
