<a href="https://colab.research.google.com/github/sundar911/retail_analytics/blob/main/demand_forecasting__final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

In [None]:
df_stores = pd.read_csv('../input/retaildataset/stores data-set.csv')
df_features = pd.read_csv('../input/retaildataset/Features data set.csv', parse_dates = ['Date'])
df_sales = pd.read_csv('../input/retaildataset/sales data-set.csv', parse_dates = ['Date'])

## Cleaning and preprocessing

In [None]:
df_sales.head()

In [None]:
df_sales.Date.value_counts()

In [None]:
df_sales.info()

In [None]:
df_features.head()

In [None]:
df_features.isna().sum()

In [None]:
df_features.Unemployment.plot();

In [None]:
df_features.CPI.plot();

In [None]:
df_features[df_features.Store == 20].CPI.plot();

In [None]:
df_features[df_features.Store == 40].CPI.plot();

In [None]:
df_features[df_features.CPI.isna()]

Not wise to impute in missing values for the markdown columns as they have a lot of missing values. Considering only about 5% of CPI and Unemployment values are missing, we can look to impute in the missing values. 

In [None]:
for i in range(1,46):
  df_features[df_features.Store == i]= df_features[df_features.Store == i].interpolate()

In [None]:
df_features[df_features.Store == 20].CPI.plot();

In [None]:
df_features.isna().sum()

In [None]:
df_features[df_features.columns[4:9]] = df_features[df_features.columns[4:9]].fillna(0)

In [None]:
df_features.isna().sum()

In [None]:
df_stores.head()

Merging features and sales on column columns (Date, Store and IsHoliday) coupled with a right join on sales. This is because sales has data till 2012 but features has data till 2013 so let's train our model on the data till 2012 and forecast the 2013 data. 

In [None]:
df_all_1 = df_features.merge(df_sales, 'right', on = ['Date', 'Store', 'IsHoliday'])

In [None]:
df_all_1

Merging sales+features with stores 

In [None]:
df_all = df_all_1.merge(df_stores, 'left', on = 'Store')

In [None]:
df_all = df_all.sort_values('Date')

In [None]:
df_all.reset_index(inplace = True)

In [None]:
df_all.drop(['index'], axis = 1, inplace = True)

In [None]:
df_all.head()

In [None]:
df_all.describe()

In [None]:
df_all.info()

In [None]:
df_all_copy = df_all.copy()

In [None]:
mapping_dict = {'IsHoliday':{True:1, False:0}}
df_all.replace(mapping_dict, inplace=True)
mapping_dict_1 = {'Type':{'A':3, 'B':2, 'C':1}}
df_all.replace(mapping_dict_1, inplace=True)

In [None]:
df_all.info()

In [None]:
df_all.head()

# EDA

### Setting up for some time series analysis

In [None]:
df_by_date = df_all.groupby('Date', as_index=False).agg({'Temperature': 'mean',
                                                        'Fuel_Price': 'mean',
                                                        'CPI': 'mean',
                                                        'Unemployment': 'mean', 
                                                        'Weekly_Sales': 'sum',
                                                        'IsHoliday': 'mean'})

In [None]:
df_by_date.Date = pd.to_datetime(df_by_date.Date, errors='coerce')
df_by_date.set_index('Date', inplace=True)

In [None]:
df_by_date.head()

resampling (weekly, backfill) with some dummy dates in the middle as the above data doesn't have a definitive frequency 

In [None]:
df_by_date_new = df_by_date.resample('W').mean().fillna(method='bfill')

In [None]:
df_by_date_new[0:10]

### Decomposing

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose

Seasonal Decompose gives the decomposition of the time series into its estimated trend component, estimated seasonal component, and estimated residual. We can also plot the original data to look at what components of the data influence its true value the most.  

In [None]:
multi_plot = seasonal_decompose(df_by_date_new['Weekly_Sales'], model = 'add', extrapolate_trend='freq')

plt.figure(figsize=(20,5))
multi_plot.observed.plot(title = 'weekly sales')

plt.figure(figsize=(20,5))
multi_plot.trend.plot(title = 'trend')

plt.figure(figsize=(20,5))
multi_plot.seasonal.plot(title = 'seasonal')

plt.figure(figsize=(20,5))
multi_plot.resid.plot(title = 'residual');

As it can be observed, the series is strongly influenced by the seasonal component 

### Correlations

In [None]:
plt.figure(figsize=(15,8))
sns.heatmap(df_by_date_new.corr('spearman'), annot = True);

strong +ve correlation b/w Fuel_Price and CPI and strong -ve correlations b/w Unmeployment and Fuel_Price and Unemployment and CPI. surprisingly, unemployment rate doesn't really seem to affect the weekly sales (directly at least)

### Holiday weeks

In [None]:
plt.figure(figsize=(15,8))
sns.boxplot(data = df_by_date, x = 'IsHoliday', y = 'Weekly_Sales');

holiday weeks don't necessarily mean that the weekly sales go up but it is often the case

###Analysis by store

In [None]:
df_by_store = df_all.groupby('Store').agg({'Temperature': 'mean',
                                           'Fuel_Price': 'mean',
                                           'CPI': 'mean',
                                           'Unemployment': 'mean', 
                                           'Weekly_Sales': 'sum',
                                           'IsHoliday': 'mean',
                                           'Type': 'max'})

In [None]:
df_by_store.describe()

In [None]:
plt.figure(figsize=(15,8))
sns.boxplot(data = df_by_store, x = 'Type', y = 'Weekly_Sales')
plt.tight_layout()

###Net sales (monthly)

In [None]:
monthly_sales = df_all.groupby(df_all.Date.dt.month).agg({'Weekly_Sales':'sum'})
plt.figure(figsize = (15,8))
sns.barplot(x = monthly_sales.index, y = monthly_sales.Weekly_Sales);

### Departments

In [None]:
df_by_dept = df_all.groupby('Dept', as_index=False).agg({'Weekly_Sales':'sum'})

In [None]:
df_by_dept

In [None]:
df_by_dept.sort_values(by = 'Weekly_Sales', ascending = False, inplace = True)

In [None]:
df_by_dept.reset_index(drop=True, inplace=True)

In [None]:
df_by_dept

best and worst performing departments can be seen above

In [None]:
sns.barplot(y='Weekly_Sales', x='Dept', data=df_by_dept[:5]);

In [None]:
sns.barplot(y='Weekly_Sales', x='Dept', data=df_by_dept[-5:]);

# Forecasting using the Holt-Winters Model

Exponential smoothing is a technique for smoothening time series data by giving different weights which are exponentially decreasing over time, unlike simple moving average method which assigns equal weightage to all observations.
Holt Winters exponential smoothening applies exponential smoothing three times, usually done when high frequency signal has to be removed.

### Train and test on 2012 data to determine accuracy

In [None]:
from statsmodels.tsa.holtwinters import ExponentialSmoothing

In [None]:
fit_model = ExponentialSmoothing(df_by_date_new['Weekly_Sales'][:120],
                                 trend = 'add',
                                 seasonal = 'add',
                                 seasonal_periods = 52).fit()

prediction = fit_model.forecast(34)
prediction

In [None]:
plt.figure(figsize=(20,10))
plt.plot(df_by_date_new.index[120:], prediction, label = 'predicted')
plt.plot(df_by_date_new.index[120:], df_by_date_new.Weekly_Sales[120:], label = 'actual')
plt.legend();

In [None]:
def mean_absolute_percentage_error(y_true, y_pred): 
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
print("Mean Absolute Percentage Error = {a}%".format(a=mean_absolute_percentage_error(df_by_date_new.Weekly_Sales[120:],prediction)))

### Forecasting 2013 sales

In [None]:
fit_model = ExponentialSmoothing(df_by_date_new['Weekly_Sales'][:-2],
                                 trend = 'add',
                                 seasonal = 'add',
                                 seasonal_periods = 52).fit()

future_prediction = fit_model.forecast(56)
future_prediction

In [None]:
plt.figure(figsize=(20, 10))
plt.plot(df_by_date_new.index, df_by_date_new.Weekly_Sales)
plt.plot(future_prediction, '--')
plt.legend(['2010-2012 actual', '2013 forecast'])