### Introduction

In this notebook, the goal was to try tree based models because our dataset contains a lot of categorical variables and tree based models perform well on it.

**Models**
1. Decision Tree
2. Random Forest
3. ExtraTreesRegressor
4. AdaBoost
5. Gradient Boosting
6. Voting Regression
7. Light Gradient Boosting
8. Extreme Gradient Boosting

**Workflow**
1. Remove outliers
2. Encode categorical variables and add date features
3. Train Models

**Improvements**
1. Add Hyperparameter tuning
2. RMSE of the model increased after removing outliers from bottom level. Maybe the outlier removal process is too conservative and removing actual data points leading to decrease in RMSE. So, instead I tried removing outliers at top level and got intermediate results between having outliers and not having outliers at all.
3. Add visualizations to check the fit.

In [1]:
import pandas as pd
import numpy as np

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor, GradientBoostingRegressor
from sklearn.ensemble import VotingRegressor
from sklearn.metrics import r2_score, mean_squared_error

from lightgbm import LGBMRegressor
from xgboost import XGBRegressor

### Load and Process Data

In [2]:
dataset_path = '../data/DS_ML Coding Challenge Dataset.xlsx'
train_dataset = pd.read_excel(dataset_path, sheet_name='Training Dataset')
test_dataset = pd.read_excel(dataset_path, sheet_name='Test Dataset')

### Remove Outliers from Training Data at Bottom Level

In [3]:
'''
# Renaming columns
train_dataset.rename(columns={'ProductType':'ProductName'}, inplace=True)
train_dataset.columns = [column_name.replace(' ','') for column_name in train_dataset.columns]

# Renaming columns
test_dataset.rename(columns={'ProductType':'ProductName'}, inplace=True)
test_dataset.columns = [column_name.replace(' ','') for column_name in test_dataset.columns]

# Creating Combined Column
initial_column = 'ProductName'
prev_column = initial_column
column_order = ['AreaCode','Manufacturer','SourcingChannel','ProductSize','ProductType']

for column in column_order:
    initial_column = initial_column + '_' + column
    train_dataset[initial_column] = train_dataset[prev_column].map(str) + '_' + train_dataset[column]
    prev_column = initial_column

# Combined column name
column_name = 'ProductName_AreaCode_Manufacturer_SourcingChannel_ProductSize_ProductType'

# Grouping Products by CombinedKey
gb = train_dataset.groupby([column_name])
groups = [gb.get_group(group_name) for group_name in gb.groups]

new_train_dataset = pd.DataFrame()
for group in groups:
    df = group[[column_name,'SourcingCost','MonthofSourcing']].reset_index(drop=True)
    
    # Removing Outliers using Inter Quartile Range
    Q1 = np.percentile(df['SourcingCost'], 25, interpolation = 'midpoint') 
    Q3 = np.percentile(df['SourcingCost'], 75, interpolation = 'midpoint') 
    IQR = Q3 - Q1 
    old_shape = df.shape
    upper = np.where(df['SourcingCost'] > (Q3+1.5*IQR))
    lower = np.where(df['SourcingCost'] < (Q1-1.5*IQR))
    df.drop(upper[0], axis=0, inplace = True)
    df.drop(lower[0], axis=0, inplace = True)
    #print("Removed Outliers: ", old_shape[0]-df.shape[0])
    
    # Append to new dataframe
    new_train_dataset = new_train_dataset.append(df)
    
new_train_dataset[column_name.split('_')] = new_train_dataset[column_name].str.split('_',expand=True)
new_train_dataset.drop([column_name], axis=1, inplace=True)
'''

'\n# Renaming columns\ntrain_dataset.rename(columns={\'ProductType\':\'ProductName\'}, inplace=True)\ntrain_dataset.columns = [column_name.replace(\' \',\'\') for column_name in train_dataset.columns]\n\n# Renaming columns\ntest_dataset.rename(columns={\'ProductType\':\'ProductName\'}, inplace=True)\ntest_dataset.columns = [column_name.replace(\' \',\'\') for column_name in test_dataset.columns]\n\n# Creating Combined Column\ninitial_column = \'ProductName\'\nprev_column = initial_column\ncolumn_order = [\'AreaCode\',\'Manufacturer\',\'SourcingChannel\',\'ProductSize\',\'ProductType\']\n\nfor column in column_order:\n    initial_column = initial_column + \'_\' + column\n    train_dataset[initial_column] = train_dataset[prev_column].map(str) + \'_\' + train_dataset[column]\n    prev_column = initial_column\n\n# Combined column name\ncolumn_name = \'ProductName_AreaCode_Manufacturer_SourcingChannel_ProductSize_ProductType\'\n\n# Grouping Products by CombinedKey\ngb = train_dataset.groupb

### Remove Outliers from Product at Top Level

In [4]:

# Renaming columns
train_dataset.rename(columns={'ProductType':'ProductName'}, inplace=True)
train_dataset.columns = [column_name.replace(' ','') for column_name in train_dataset.columns]

# Renaming columns
test_dataset.rename(columns={'ProductType':'ProductName'}, inplace=True)
test_dataset.columns = [column_name.replace(' ','') for column_name in test_dataset.columns]

new_train_dataset = pd.DataFrame()
top_level = ['NTM1','NTM2','NTM3']
for level in top_level:
    df = train_dataset[train_dataset.ProductName==level].copy().reset_index(drop=True)
    # Removing Outliers using Inter Quartile Range
    Q1 = np.percentile(df['SourcingCost'], 25, interpolation = 'midpoint') 
    Q3 = np.percentile(df['SourcingCost'], 75, interpolation = 'midpoint') 
    IQR = Q3 - Q1 
    old_shape = df.shape
    upper = np.where(df['SourcingCost'] > (Q3+1.5*IQR))
    lower = np.where(df['SourcingCost'] < (Q1-1.5*IQR))
    df.drop(upper[0], axis=0, inplace = True)
    df.drop(lower[0], axis=0, inplace = True)
    #print("Removed Outliers: ", old_shape[0]-df.shape[0])
    
    # Append to new dataframe
    new_train_dataset = new_train_dataset.append(df)


### Create Features

In [5]:
def preprocess_data(dataset):
    '''
    Returns X and y after converting categorical variables to one-hot encoding and creating time features
    '''
    
    # Creating time features
    dataset['Year'] = pd.DatetimeIndex(dataset['MonthofSourcing']).year
    dataset['Month'] = pd.DatetimeIndex(dataset['MonthofSourcing']).month
    
    # Creating one-hot-encoding for categorical variables
    dataset = pd.get_dummies(dataset, columns=['ProductName'], drop_first=True, prefix='ProductName')
    dataset = pd.get_dummies(dataset, columns=['Manufacturer'], drop_first=True, prefix='Manufacturer')
    dataset = pd.get_dummies(dataset, columns=['AreaCode'], drop_first=True, prefix='AreaCode')
    dataset = pd.get_dummies(dataset, columns=['SourcingChannel'], drop_first=True, prefix='SourcingChannel')
    dataset = pd.get_dummies(dataset, columns=['ProductSize'], drop_first=True, prefix='ProductSize')
    dataset = pd.get_dummies(dataset, columns=['ProductType'], drop_first=True, prefix='ProductType')
    
    # Creating X and y
    X = dataset.drop(['MonthofSourcing','SourcingCost'], axis=1).values
    y = dataset['SourcingCost'].values
    
    return X, y

In [6]:
X_train, y_train = preprocess_data(new_train_dataset)
X_test, y_test = preprocess_data(test_dataset)

### Tree Based Models

In [7]:
regressors = [DecisionTreeRegressor(), RandomForestRegressor(), ExtraTreesRegressor(), AdaBoostRegressor(),\
              GradientBoostingRegressor()]

model_metrics = {}
for reg in regressors:
    print('Started Training', reg.__class__.__name__)
    trained_model = reg.fit(X_train, y_train)
    y_pred = trained_model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    model_metrics[reg.__class__.__name__] = [rmse, r2]
    print('Ended Training', reg.__class__.__name__)

Started Training DecisionTreeRegressor
Ended Training DecisionTreeRegressor
Started Training RandomForestRegressor
Ended Training RandomForestRegressor
Started Training ExtraTreesRegressor
Ended Training ExtraTreesRegressor
Started Training AdaBoostRegressor
Ended Training AdaBoostRegressor
Started Training GradientBoostingRegressor
Ended Training GradientBoostingRegressor


In [8]:
model_metrics

{'DecisionTreeRegressor': [34.12919305031901, 0.5706524836546772],
 'RandomForestRegressor': [34.1144924600023, 0.5710222728031951],
 'ExtraTreesRegressor': [34.024829092084374, 0.5732742799678298],
 'AdaBoostRegressor': [38.25561832984171, 0.46055475802549506],
 'GradientBoostingRegressor': [33.99121520896884, 0.5741170070894932]}

### Voting Regressor

In [9]:
k = [(r.__class__.__name__,r) for r in regressors.copy()[1:]]

In [10]:
vr = VotingRegressor(k)
vr.fit(X_train, y_train)
y_pred = vr.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(vr.__class__.__name__ , ':',rmse, r2)

VotingRegressor : 31.334522814414328 0.6380879146973715


### Light GBM

In [11]:
lgbm = LGBMRegressor()
lgbm.fit(X_train, y_train)
y_pred = lgbm.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(lgbm.__class__.__name__ , ':',rmse, r2)

LGBMRegressor : 33.16664421051016 0.5945288311357748


### XGBoost

In [12]:
xgbm = XGBRegressor()
xgbm.fit(X_train, y_train)
y_pred = xgbm.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(xgbm.__class__.__name__ , ':',rmse, r2)

XGBRegressor : 33.956663530227814 0.5749823773601317


## Conclusion

I trained 4 classes of models during Experiment:

**GROUPED TIME SERIES APPROACH** `DID NOT WORK`
1. RMSE of Bottom Up Approach: 376.75748084933656
2. Top Down Approach - Not used due to wrong assumption
3. Middle Out Approach - Not used due to lack of heirarchy

This approach performed best because it was based on the wrong assumption that we had to forecast monthly sum of Sourcing cost. After realising the wrong assumption, I tried to forecast monthly mean which gave this result. Also, the bottom up approach used for Grouped Time Series utilises autoarima to fit the models. After outlier removals, the data at bottom level(i.e. individual product level) became stationary and therefore ARIMA models were too complex for data and performed poorly.

**DEEP LEARNING** `NOTEBOOK NOT INCLUDED`

LSTM are used to model long range dependencies and hence are suitable for time series forecasting. I faced problems in using this approach due to categorical columns. I read about utilising categorical variables as an auxilary input. However, due to time constraints, wasn't able to do so.

**TREE BASED MODELS** `NOTEBOOK INCLUDED`
1. RMSE of DecisionTreeRegressor: 34.12919305031901
2. RMSE of RandomForestRegressor: 34.1144924600023
3. RMSE of ExtraTreesRegressor': 34.024829092084374
4. RMSE of AdaBoostRegressor: 38.25561832984171
5. RMSE of GradientBoostingRegressor: 33.99121520896884
6. RMSE of VotingRegressor : 31.334522814414328
7. RMSE of LGBMRegressor : 33.16664421051016
8. RMSE of XGBRegressor : 33.956663530227814

Tree Based models were chosen as they perform well will categorical variables. 
Outliers were removed using IQR and categorical variables were one-hot encoded.
For outliers removal, there were three choices. Not removing outliers gave the best results. Removing outliers at top product level(i.e. NTM1, NTM2) gave intermediate results and removing outliers at bottom unit level(i.e NTM1_A10_X1_DIRECT_Large_Powder) gave worst results.

I would have tuned the hyperparameters of the model but was unable to do so due to time constraints. This would increase the performance(i.e. decrease RMSE)


**SIMPLE TIME SERIES BASED MODELS** `NOTEBOOK INCLUDED`

Models Used
1. RMSE of Simple Moving Average: 37.291999530367285
2. RMSE of Exponential Weighted Average: 36.49481290429226
3. RMSE of Simple Exponential Smoothing: 37.283745121556244
4. RMSE of Additive Exponential Smoothing: 36.173433451689036
5. RMSE of Holt's Method: 36.173433451689036

Models Not Used due to Complexity or lack of Seasonality
1. ARIMA
2. SARIMAX
3. Holt Winter's method

Outliers were removed at the at bottom unit level(i.e NTM1_A10_X1_DIRECT_Large_Powder) using IQR. Then stationarity was tested using Augmented Dickey Fuller Test. After that these simple models were fitted. Trying to fit complex models such as ARIMA using autoarima in pmdarima library result in ARIMA(0,0,0) which means that the model is too complex as the data points were 11 months and even less for some.

**FINAL CHOICE**
Given all these models, Voting Regressor performed best. But, I would go for a simpler time series model such as Exponential Smoothing because it is easy to understand and diagnose any problems if it happens in future. Given more time, I would like to experiment using LSTMs for the data.

**NOTE**
1. IQR -> Inter Quartile Range
2. All notebooks have more detailed explanations inside them. Please look them through.
3. Code would have been better organised given more time.