# Advanced Machine Learning Application - Assignment 2
### Rohan Rocky Britto - Student ID: 24610990

## Data Import and Preparation

Importing required packages

In [1]:
import pandas as pd
import numpy as np
from joblib import dump
from prophet import Prophet
from sklearn.metrics import mean_absolute_error, mean_squared_error

  from .autonotebook import tqdm as notebook_tqdm
Importing plotly failed. Interactive plots will not work.


Importing the function developed and saved to evaluate the model

In [2]:
import sys
sys.path.append('../../src')
from functions import evaluate_model

Read the training and validation files

In [3]:
df_train = pd.read_csv('../../data/processed/train_processed.csv')
df_validation = pd.read_csv('../../data/processed/validation_processed.csv')
df_test = pd.read_csv('../../data/processed/test_processed.csv')

  df_train = pd.read_csv('../../data/processed/train_processed.csv')
  df_validation = pd.read_csv('../../data/processed/validation_processed.csv')
  df_test = pd.read_csv('../../data/processed/test_processed.csv')


Grouping all the rows based on dates to find the total sales revenue on a particular day

In [4]:
df_train_grouped = df_train[['date', 'sale_revenue']].groupby(['date'], as_index=False).sum()
df_validation_grouped = df_validation[['date', 'sale_revenue']].groupby(['date'], as_index=False).sum()
df_test_grouped = df_test[['date', 'sale_revenue']].groupby(['date'], as_index=False).sum()

Converting the datatype of date field to datetime and sale_revenue to int

In [5]:
df_train_grouped['date'] = pd.to_datetime(df_train_grouped['date'])
df_train_grouped['sale_revenue'] = df_train_grouped['sale_revenue'].astype(int)

df_validation_grouped['date'] = pd.to_datetime(df_validation_grouped['date'])
df_validation_grouped['sale_revenue'] = df_validation_grouped['sale_revenue'].astype(int)

df_test_grouped['date'] = pd.to_datetime(df_test_grouped['date'])
df_test_grouped['sale_revenue'] = df_test_grouped['sale_revenue'].astype(int)

Checking the basic features and values of the dataset

In [6]:
df_train_grouped.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1071 entries, 0 to 1070
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   date          1071 non-null   datetime64[ns]
 1   sale_revenue  1071 non-null   int32         
dtypes: datetime64[ns](1), int32(1)
memory usage: 12.7 KB


In [7]:
df_train_grouped.head()

Unnamed: 0,date,sale_revenue
0,2011-01-29,81650
1,2011-01-30,78970
2,2011-01-31,57706
3,2011-02-01,60761
4,2011-02-02,46959


Changing the column names of the train and test set as required by the prophet model

In [8]:
df_train_grouped.columns = ['ds', 'y']
df_validation_grouped.columns = ['ds', 'y']
df_test_grouped.columns = ['ds', 'y']

Storing the target values in a separate variable

In [9]:
train_target = df_train_grouped['y']
validation_target = df_validation_grouped['y']
test_target = df_test_grouped['y']

### Baseline model

Create a baseline model predicting the mean value of sale revenue for model performance comparison

In [10]:
mean_value = train_target.mean()
base_preds = np.full((len(train_target), 1), mean_value)

In [34]:
print('The Mean Absolute Error for the baseline model is ', mean_absolute_error(train_target, base_preds))
print('The Mean Squared Error for the baseline model is ', mean_squared_error(train_target, base_preds, squared=False))

The Mean Absolute Error for the baseline model is  15218.63201053842
The Mean Squared Error for the baseline model is  19756.530946780458


### Prophet model without event data

Building a facebook prophet model

In [12]:
prop = Prophet()

In [13]:
prop.fit(df_train_grouped)

22:22:57 - cmdstanpy - INFO - Chain [1] start processing
22:22:58 - cmdstanpy - INFO - Chain [1] done processing


<prophet.forecaster.Prophet at 0x24562a03710>

In [14]:
train_preds = prop.predict(df_train_grouped[['ds']])['yhat']

In [15]:
validation_preds = prop.predict(df_validation_grouped[['ds']])['yhat']

In [16]:
evaluate_model(train_target, train_preds, validation_target, validation_preds)

The Mean Absolute Error for training set is  6280.557595565194
The Mean Absolute Error for validation set is  8028.299534720642
The Root Mean Squared Error for training set is  9631.75791947105
The Root Mean Squared Error for validation set is  12248.39281334324


In [17]:
df_train_grouped['prop_pred'] = prop.predict(df_train_grouped[['ds']])['yhat']

In [18]:
df_validation_grouped['prop_pred'] = prop.predict(df_validation_grouped[['ds']])['yhat']

### Prophet model with holiday/event data

Passing the holiday/event data to the model to check if there is an improvement in performance

In [19]:
df_holiday = pd.read_csv('../../data/raw/calendar_events.csv')

In [20]:
df_holiday = df_holiday[['date', 'event_name']]

In [21]:
df_holiday.columns = ['ds', 'holiday']
df_holiday['ds'] = pd.to_datetime(df_holiday['ds'])

In [22]:
prop_hol = Prophet(holidays=df_holiday)

In [23]:
prop_hol.fit(df_train_grouped)

22:22:59 - cmdstanpy - INFO - Chain [1] start processing
22:22:59 - cmdstanpy - INFO - Chain [1] done processing


<prophet.forecaster.Prophet at 0x24564154ad0>

In [24]:
train_preds = prop_hol.predict(df_train_grouped[['ds']])['yhat']

In [25]:
validation_preds = prop_hol.predict(df_validation_grouped[['ds']])['yhat']

In [26]:
evaluate_model(train_target, train_preds, validation_target, validation_preds)

The Mean Absolute Error for training set is  5548.515861379955
The Mean Absolute Error for validation set is  7856.410096578735
The Root Mean Squared Error for training set is  7204.706030049829
The Root Mean Squared Error for validation set is  11400.73895079924


In [27]:
df_train_grouped['prop_hol_pred'] = prop_hol.predict(df_train_grouped[['ds']])['yhat']

In [28]:
df_validation_grouped['prop_hol_pred'] = prop_hol.predict(df_validation_grouped[['ds']])['yhat']

In [29]:
dump(prop_hol, '../../models/forecasting/prop_hol.joblib')

['../../models/forecasting/prop_hol.joblib']

The model with holiday/event data seems to be performing better than the model without it. Hence, we will use this model to evaluate its performance on test data and also view some sample predictions vs actuals.

In [30]:
df_train_grouped.sample(10)

Unnamed: 0,ds,y,prop_pred,prop_hol_pred
422,2012-03-26,76432,82533.628412,81952.331022
892,2013-07-09,99078,91339.082073,96725.526423
490,2012-06-02,117199,108726.891295,108587.13832
759,2013-02-26,76897,89988.714749,90281.264433
25,2011-02-23,55070,57696.435793,58731.084424
896,2013-07-13,122543,120680.400734,119685.617208
252,2011-10-08,92848,93931.088591,93572.628776
818,2013-04-26,87257,97514.776894,94663.075257
1025,2013-11-19,78798,84226.494204,85724.220825
389,2012-02-22,67501,75946.647498,78910.986994


In [31]:
df_validation_grouped.sample(10)

Unnamed: 0,ds,y,prop_pred,prop_hol_pred
434,2015-03-14,137480,129469.339212,127272.860229
285,2014-10-16,92675,94406.001226,93533.906572
36,2014-02-09,137581,124218.610817,124489.808654
144,2014-05-28,81881,92358.643103,92380.825065
438,2015-03-18,93405,98968.0932,98005.591485
230,2014-08-22,104528,110565.634941,108317.019134
354,2014-12-24,90657,89697.317891,97423.655055
77,2014-03-22,117496,123230.823805,122439.269866
208,2014-07-31,89807,95249.421819,94878.622061
421,2015-03-01,148438,131797.993597,129604.667974


In [35]:
df_test_grouped['prop_hol_pred'] = prop_hol.predict(df_test_grouped[['ds']])['yhat']

In [33]:
df_test_grouped.sample(10)

Unnamed: 0,ds,y,prop_hol_pred
131,2015-08-28,106949,90963.677599
129,2015-08-26,96770,91406.664486
283,2016-01-27,99298,94509.215184
25,2015-05-14,102147,89788.777461
182,2015-10-18,155720,125605.908706
53,2015-06-11,106235,95131.766644
368,2016-04-21,110410,95265.116381
78,2015-07-06,124382,122931.194501
274,2016-01-18,121003,125785.468905
142,2015-09-08,107219,103324.037388


**Conclusion:** The model seems to be performing much better than the baseline model and could be deployed in production. However, we will be working on reducing the overfitting of the model.