### Introduction

In this notebook, the goal was to try tree based models because our dataset contains a lot of categorical variables and tree based models perform well on it.

**Models**
1. Decision Tree
2. Random Forest
3. ExtraTreesRegressor
4. AdaBoost
5. Gradient Boosting
6. Voting Regression
7. Light Gradient Boosting
8. Extreme Gradient Boosting

**Workflow**
1. Remove outliers
2. Encode categorical variables and add date features
3. Train Models

**Improvements**
1. Add Hyperparameter tuning
2. RMSE of the model increased after removing outliers. Maybe the outlier removal process is too conservative and removing actual data points leading to decrease in RMSE. Try or play with more outlier removal methods.
3. Add visualizations to check the fit.

In [3]:
import pandas as pd
import numpy as np

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor, GradientBoostingRegressor
from sklearn.ensemble import VotingRegressor
from sklearn.metrics import r2_score, mean_squared_error

#from lightgbm import LGBMRegressor
#from xgboost import XGBRegressor

### Load and Process Data

In [4]:
dataset_path = '../data/DS_ML Coding Challenge Dataset.xlsx'
train_dataset = pd.read_excel(dataset_path, sheet_name='Training Dataset')
test_dataset = pd.read_excel(dataset_path, sheet_name='Test Dataset')

### Remove Outliers from Training Data

In [5]:
# Renaming columns
train_dataset.rename(columns={'ProductType':'ProductName'}, inplace=True)
train_dataset.columns = [column_name.replace(' ','') for column_name in train_dataset.columns]

# Renaming columns
test_dataset.rename(columns={'ProductType':'ProductName'}, inplace=True)
test_dataset.columns = [column_name.replace(' ','') for column_name in test_dataset.columns]

# Creating Combined Column
initial_column = 'ProductName'
prev_column = initial_column
column_order = ['AreaCode','Manufacturer','SourcingChannel','ProductSize','ProductType']

for column in column_order:
    initial_column = initial_column + '_' + column
    train_dataset[initial_column] = train_dataset[prev_column].map(str) + '_' + train_dataset[column]
    prev_column = initial_column

# Combined column name
column_name = 'ProductName_AreaCode_Manufacturer_SourcingChannel_ProductSize_ProductType'

# Grouping Products by CombinedKey
gb = train_dataset.groupby([column_name])
groups = [gb.get_group(group_name) for group_name in gb.groups]

new_train_dataset = pd.DataFrame()
for group in groups:
    df = group[[column_name,'SourcingCost','MonthofSourcing']].reset_index(drop=True)
    
    # Removing Outliers using Inter Quartile Range
    Q1 = np.percentile(df['SourcingCost'], 25, interpolation = 'midpoint') 
    Q3 = np.percentile(df['SourcingCost'], 75, interpolation = 'midpoint') 
    IQR = Q3 - Q1 
    old_shape = df.shape
    upper = np.where(df['SourcingCost'] > (Q3+1.5*IQR))
    lower = np.where(df['SourcingCost'] < (Q1-1.5*IQR))
    df.drop(upper[0], axis=0, inplace = True)
    df.drop(lower[0], axis=0, inplace = True)
    #print("Removed Outliers: ", old_shape[0]-df.shape[0])
    
    # Append to new dataframe
    new_train_dataset = new_train_dataset.append(df)
    
new_train_dataset[column_name.split('_')] = new_train_dataset[column_name].str.split('_',expand=True)
new_train_dataset.drop([column_name], axis=1, inplace=True)

### Create Features

In [6]:
def preprocess_data(dataset):
    '''
    Returns X and y after converting categorical variables to one-hot encoding and creating time features
    '''
    
    # Creating time features
    dataset['Year'] = pd.DatetimeIndex(dataset['MonthofSourcing']).year
    dataset['Month'] = pd.DatetimeIndex(dataset['MonthofSourcing']).month
    
    # Creating one-hot-encoding for categorical variables
    dataset = pd.get_dummies(dataset, columns=['ProductName'], drop_first=True, prefix='ProductName')
    dataset = pd.get_dummies(dataset, columns=['Manufacturer'], drop_first=True, prefix='Manufacturer')
    dataset = pd.get_dummies(dataset, columns=['AreaCode'], drop_first=True, prefix='AreaCode')
    dataset = pd.get_dummies(dataset, columns=['SourcingChannel'], drop_first=True, prefix='SourcingChannel')
    dataset = pd.get_dummies(dataset, columns=['ProductSize'], drop_first=True, prefix='ProductSize')
    dataset = pd.get_dummies(dataset, columns=['ProductType'], drop_first=True, prefix='ProductType')
    
    # Creating X and y
    X = dataset.drop(['MonthofSourcing','SourcingCost'], axis=1).values
    y = dataset['SourcingCost'].values
    
    return X, y

In [7]:
X_train, y_train = preprocess_data(new_train_dataset)
X_test, y_test = preprocess_data(test_dataset)

### Tree Based Models

In [8]:
regressors = [DecisionTreeRegressor(), RandomForestRegressor(), ExtraTreesRegressor(), AdaBoostRegressor(),\
              GradientBoostingRegressor()]

model_metrics = {}
for reg in regressors:
    print('Started Training', reg.__class__.__name__)
    trained_model = reg.fit(X_train, y_train)
    y_pred = trained_model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    model_metrics[reg.__class__.__name__] = [rmse, r2]
    print('Ended Training', reg.__class__.__name__)

Started Training DecisionTreeRegressor
Ended Training DecisionTreeRegressor
Started Training RandomForestRegressor
Ended Training RandomForestRegressor
Started Training ExtraTreesRegressor
Ended Training ExtraTreesRegressor
Started Training AdaBoostRegressor
Ended Training AdaBoostRegressor
Started Training GradientBoostingRegressor
Ended Training GradientBoostingRegressor


In [9]:
model_metrics

{'DecisionTreeRegressor': [36.740695189343626, 0.5024329024702691],
 'RandomForestRegressor': [36.73549639053813, 0.5025737037374551],
 'ExtraTreesRegressor': [36.64166663085347, 0.5051115098302945],
 'AdaBoostRegressor': [39.74390240779491, 0.41776550698904313],
 'GradientBoostingRegressor': [33.47924402152275, 0.5868495779485705]}

### Voting Regressor

In [10]:
k = [(r.__class__.__name__,r) for r in regressors.copy()[1:]]

In [11]:
vr = VotingRegressor(k)
vr.fit(X_train, y_train)
y_pred = vr.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(vr.__class__.__name__ , ':',rmse, r2)

VotingRegressor : 31.568090657221372 0.6326724135199246


### Light GBM

In [None]:
lgbm = LGBMRegressor()
lgbm.fit(X_train, y_train)
y_pred = lgbm.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(lgbm.__class__.__name__ , ':',rmse, r2)

### XGBoost

In [None]:
xgbm = XGBRegressor()
xgbm.fit(X_train, y_train)
y_pred = xgbm.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(xgbm.__class__.__name__ , ':',rmse, r2)