### Introduction

In this notebook, the goal was to try traditional time series models on the dataset. 

**Models**
1. Simple Moving Average
2. Exponential Weighted Average
3. Simple Exponential Smoothing
4. Additive Exponential Smoothing
5. Holt's method

**Models Tried But Not Used**
1. ARIMA - For Arima, instead of using ACF and PACF plots, We used pmdarima's **autoarima** to test different models but got ARIMA(0,0,0) as the model is too complex for data. Deciding values based on ACF and PACF plots is prone to human error.
2. SARIMAX - Not tried, as there is no seasonal component.
3. Holt Winter's method - Not tried, as there is no seasonal component.

**Workflow**
1. Removed outliers using Inter Quartile range
2. Tested for stationarity using Augmented Dickey Fuller Test. We got NaN p-values for some data. This data was visually inspected and found to be stationary and hence the function was modified to call it as stationary.
3. Different models used for forecasting.
4. Finally RMSE was calculated using Test data.

**Improvements**
1. Add visualizations to check the fit

In [7]:
import pandas as pd
import numpy as np

from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.holtwinters import SimpleExpSmoothing
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from statsmodels.tsa.holtwinters import Holt
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings("ignore")

### Augmented Dickey Fuller Test

In [8]:
def adf_test(df):
    """
    Takes in a time series and returns stationary or non-stationary
    """
    pvalue = adfuller(df,autolag='AIC')[1]
    
    # print('p-value', pvalue)
    # Visual inspection for data with p value of NaN
    '''
    if(np.isnan(result[1])):
        df.plot(figsize=(15,5))
    '''
    
    if (pvalue <= 0.05 or np.isnan(pvalue)):
        print("Data is stationary")
    else:
        print("Data is non-stationary")

### Load Dataset

In [9]:
dataset_path = '../data/DS_ML Coding Challenge Dataset.xlsx'
train_dataset = pd.read_excel(dataset_path, sheet_name='Training Dataset')
test_dataset = pd.read_excel(dataset_path, sheet_name='Test Dataset')

# Renaming columns
train_dataset.rename(columns={'ProductType':'ProductName'}, inplace=True)
train_dataset.columns = [column_name.replace(' ','') for column_name in train_dataset.columns]

# Renaming columns
test_dataset.rename(columns={'ProductType':'ProductName'}, inplace=True)
test_dataset.columns = [column_name.replace(' ','') for column_name in test_dataset.columns]

# Adding ProductID column to uniquely identify each sourced unit
train_dataset['ProductID'] = train_dataset['ProductName'].map(str) + train_dataset['Manufacturer'] + \
                             train_dataset['AreaCode'] + train_dataset['SourcingChannel'] + \
                             train_dataset['ProductSize'] + train_dataset['ProductType']

test_dataset['ProductID'] = test_dataset['ProductName'].map(str) + test_dataset['Manufacturer'] + \
                            test_dataset['AreaCode'] + test_dataset['SourcingChannel'] + \
                            test_dataset['ProductSize'] + test_dataset['ProductType'] 

### Train Models

In [10]:
# Grouping Products by ProductID
gb = train_dataset.groupby(['ProductID'])
groups = [gb.get_group(group_name) for group_name in gb.groups]

# Creating new dataframe for storing predictions
predictions = test_dataset[['ProductID','SourcingCost']].copy()

In [11]:
for group in groups:
    PID = group.ProductID.unique()[0]
    df = group[['SourcingCost','MonthofSourcing']].reset_index(drop=True)
    
    # Removing Outliers using Inter Quartile Range
    Q1 = np.percentile(df['SourcingCost'], 25, interpolation = 'midpoint') 
    Q3 = np.percentile(df['SourcingCost'], 75, interpolation = 'midpoint') 
    IQR = Q3 - Q1 
    old_shape = df.shape
    upper = np.where(df['SourcingCost'] > (Q3+1.5*IQR))
    lower = np.where(df['SourcingCost'] < (Q1-1.5*IQR))
    df.drop(upper[0], axis=0, inplace = True)
    df.drop(lower[0], axis=0, inplace = True)
    #print("Removed Outliers: ", old_shape[0]-df.shape[0])
    
    # Using Augmented Dickey Fuller Test for checking stationarity
    #adf_test(df['SourcingCost'])
    
    df = df.groupby('MonthofSourcing').mean()
    #df.index.freq='MS'

    df['SMA2'] = df['SourcingCost'].rolling(window=2).mean()
    df['EWMA3'] = df['SourcingCost'].ewm(span=3,adjust=False).mean()

    # Simple Exponential Smoothing Model
    sim_exp_model = SimpleExpSmoothing(df['SourcingCost']).fit(optimized=True)
    y_pred_sim_exp = sim_exp_model.forecast(1).values[0]

    # Additive Exponential Smoothing Model
    exp_add_model = ExponentialSmoothing(df['SourcingCost'],trend='add').fit(optimized=True)
    y_pred_exp_add = exp_add_model.forecast(1).values[0]

    # Holt's Model
    holt_model = Holt(df['SourcingCost']).fit(optimized=True)
    y_pred_holt = holt_model.forecast(1).values[0]

    # Simple Moving Average
    y_pred_sma = df.loc[df.index[-1], "SMA2"]
    
    # Exponential Weighted Average
    y_pred_ewma = df.loc[df.index[-1], "EWMA3"]

    # Adding to predictions
    predictions.loc[predictions.ProductID==PID,'SMA2']=y_pred_sma
    predictions.loc[predictions.ProductID==PID,'EWMA3']=y_pred_ewma
    predictions.loc[predictions.ProductID==PID,'SES']=y_pred_sim_exp
    predictions.loc[predictions.ProductID==PID,'ESA']=y_pred_exp_add
    predictions.loc[predictions.ProductID==PID,'Holt']=y_pred_holt

### Test Models

In [12]:
# Calculating RMSE for each Model
rmse_sma2 = np.sqrt(mean_squared_error(predictions['SourcingCost'].values,predictions['SMA2'].values))
rmse_ewma3 = np.sqrt(mean_squared_error(predictions['SourcingCost'].values,predictions['EWMA3'].values))
rmse_ses = np.sqrt(mean_squared_error(predictions['SourcingCost'].values,predictions['SES'].values))
rmse_esa = np.sqrt(mean_squared_error(predictions['SourcingCost'].values,predictions['ESA'].values))
rmse_holt = np.sqrt(mean_squared_error(predictions['SourcingCost'].values,predictions['Holt'].values))

# Printing RMSE
print("RMSE of Simple Moving Average:",rmse_sma2)
print("RMSE of Exponential Weighted Average:",rmse_ewma3)
print("RMSE of Simple Exponential Smoothing:",rmse_ses)
print("RMSE of Additive Exponential Smoothing:",rmse_esa)
print("RMSE of Holt's Method:",rmse_holt)

RMSE of Simple Moving Average: 37.291999530367285
RMSE of Exponential Weighted Average: 36.49481290429226
RMSE of Simple Exponential Smoothing: 37.283745121556244
RMSE of Additive Exponential Smoothing: 36.173433451689036
RMSE of Holt's Method: 36.173433451689036


## Conclusion

I trained 4 classes of models during Experiment:

**GROUPED TIME SERIES APPROACH** `DID NOT WORK`
1. RMSE of Bottom Up Approach: 376.75748084933656
2. Top Down Approach - Not used due to wrong assumption
3. Middle Out Approach - Not used due to lack of heirarchy

This approach performed best because it was based on the wrong assumption that we had to forecast monthly sum of Sourcing cost. After realising the wrong assumption, I tried to forecast monthly mean which gave this result. Also, the bottom up approach used for Grouped Time Series utilises autoarima to fit the models. After outlier removals, the data at bottom level(i.e. individual product level) became stationary and therefore ARIMA models were too complex for data and performed poorly.

**DEEP LEARNING** `NOTEBOOK NOT INCLUDED`

LSTM are used to model long range dependencies and hence are suitable for time series forecasting. I faced problems in using this approach due to categorical columns. I read about utilising categorical variables as an auxilary input. However, due to time constraints, wasn't able to do so.

**TREE BASED MODELS** `NOTEBOOK INCLUDED`
1. RMSE of DecisionTreeRegressor: 34.12919305031901
2. RMSE of RandomForestRegressor: 34.1144924600023
3. RMSE of ExtraTreesRegressor': 34.024829092084374
4. RMSE of AdaBoostRegressor: 38.25561832984171
5. RMSE of GradientBoostingRegressor: 33.99121520896884
6. RMSE of VotingRegressor : 31.334522814414328
7. RMSE of LGBMRegressor : 33.16664421051016
8. RMSE of XGBRegressor : 33.956663530227814

Tree Based models were chosen as they perform well will categorical variables. 
Outliers were removed using IQR and categorical variables were one-hot encoded.
For outliers removal, there were three choices. Not removing outliers gave the best results. Removing outliers at top product level(i.e. NTM1, NTM2) gave intermediate results and removing outliers at bottom unit level(i.e NTM1_A10_X1_DIRECT_Large_Powder) gave worst results.

I would have tuned the hyperparameters of the model but was unable to do so due to time constraints. This would increase the performance(i.e. decrease RMSE)


**SIMPLE TIME SERIES BASED MODELS** `NOTEBOOK INCLUDED`

Models Used
1. RMSE of Simple Moving Average: 37.291999530367285
2. RMSE of Exponential Weighted Average: 36.49481290429226
3. RMSE of Simple Exponential Smoothing: 37.283745121556244
4. RMSE of Additive Exponential Smoothing: 36.173433451689036
5. RMSE of Holt's Method: 36.173433451689036

Models Not Used due to Complexity or lack of Seasonality
1. ARIMA
2. SARIMAX
3. Holt Winter's method

Outliers were removed at the at bottom unit level(i.e NTM1_A10_X1_DIRECT_Large_Powder) using IQR. Then stationarity was tested using Augmented Dickey Fuller Test. After that these simple models were fitted. Trying to fit complex models such as ARIMA using autoarima in pmdarima library result in ARIMA(0,0,0) which means that the model is too complex as the data points were 11 months and even less for some.

**FINAL CHOICE**
Given all these models, Voting Regressor performed best. But, I would go for a simpler time series model such as Exponential Smoothing because it is easy to understand and diagnose any problems if it happens in future. Given more time, I would like to experiment using LSTMs for the data.

**NOTE**
1. IQR -> Inter Quartile Range
2. All notebooks have more detailed explanations inside them. Please look them through.
3. Code would have been better organised given more time.