#M5 Forecasting - Accuracy
This is the **fifth iteration** of **Makridakis competition** organized by  **Makridakis Open Forecasting Centre (MOFC)** at the **University of Nicosia**.

<h2>Business objective:</h2>

> To **forecast unit sales figures** of retail goods for next **28 days** based on historical unit sales time series data made available by Walmart.

<h2>Data information:</h2>

>  The M5 dataset consists of the following files:
1. **calendar.csv** - Contains information about the day of the week, special events for dates on which the products are sold.
2. **sales_train_validation.csv** - Contains the historical daily unit sales data per product and store [d_1 - d_1913].
3. **sell_prices.csv** - Contains information about the price of the products sold per store and date.
4. **sales_train_evaluation.csv** - Includes sales [d_1 - d_1941] (labels used for the public leaderboard)


> The **historical sales data** we are working with, comprises sales figures for  **3049 unique products** sold in **3 states** having **10 stores** in total, all stores having **3 product categories** further divided into **7 departments** in total. 

The aggregation hierarchy in sales data is shown below.
<img src='https://miro.medium.com/max/3000/1*lPCqY7i6GRRx_eirsSov-Q.jpeg'>









<h2>Evaluation Metics used: </h2>

**r2_score** to evaluate the performance of our model.

**Why?**

* **r2_score** evaluates the performance of a model by calculating the percentage of the differnce between **variance around mean of data** and **variance around the model fitted to the data** which is easy to calculate and gives a good idea about the performance of model. 

* If r2_score is 0.60, it means that the **fitted model** has 60% less variance than the **mean variance**, which is a good fit.


**MAPE(Mean Absolute Percentage Error)** to evaluate our predictions.

**Why?**
* Unlike other evaluation metrics like MAE and RMSE, MAPE is scale independent i.e it does not depend on the range of values.

**MSE(mean_squared_error)** can be used as a loss function.

**Why?**
* Unlike MAE and MAPE, **MSE** has a non linear gradient which results in slowing down the weight updates as the model approaches close to 0.  

In [None]:
%config InlineBackend.figure_formats = ['svg']
import plotly.graph_objects as go
import plotly.express as px
import numpy as np
import pandas as pd
from statsmodels.tsa.seasonal import seasonal_decompose
from plotly.subplots import make_subplots
from statsmodels.graphics.tsaplots import plot_acf
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from tqdm import tqdm
warnings.filterwarnings("ignore")




In [None]:
sales_path='../input/m5-forecasting-accuracy/sales_train_evaluation.csv'
price_path='../input/m5-forecasting-accuracy/sell_prices.csv'
calendar_path='../input/m5-forecasting-accuracy/calendar.csv'

In [None]:
def reduce_usage_mem(df):
    for col in  df.columns:
        if str(df[col].dtype)=='int64':
            df[col]=df[col].astype('int16')
        if str(df[col].dtype)=='float64':
            df[col]=df[col].astype('float16')
    return df

### Historical sales data

In [None]:
sales_df=reduce_usage_mem(pd.read_csv(sales_path))
sales_df.head()

### Calender data

In [None]:
calendar_df=reduce_usage_mem(pd.read_csv(calendar_path)[:1941])
calendar_df

### Product price data

In [None]:
price_df=reduce_usage_mem(pd.read_csv(price_path))
price_df.head()

### State wise aggregate sales  


> Aggregate sales of all stores per state from day 0 to day 1941







In [None]:
sales_ca=sales_df[sales_df['state_id']=='CA'].loc[:,'d_1':].sum(axis=0)
sales_wi=sales_df[sales_df['state_id']=='WI'].loc[:,'d_1':].sum(axis=0)
sales_tx=sales_df[sales_df['state_id']=='TX'].loc[:,'d_1':].sum(axis=0)

state_sales_df=pd.DataFrame({'sales_ca':list(sales_ca),
                             'sales_wi':list(sales_wi),
                             'sales_tx':list(sales_tx),
                             'date':list(calendar_df['date']),
                             'weekday':calendar_df['weekday'],
                              'wday':calendar_df['wday'],
                              'month':calendar_df['month'],
                             'event_type_1':calendar_df['event_type_1'],
                             'event_type_2':calendar_df['event_type_2']})


fig=go.Figure()
fig.add_trace(go.Scatter(y=state_sales_df['sales_ca'],x=state_sales_df['date'],mode='lines',name='California(CA)'))
fig.add_trace(go.Scatter(y=state_sales_df['sales_wi'],x=state_sales_df['date'],mode='lines',name='Wisconsin(WI)'))
fig.add_trace(go.Scatter(y=state_sales_df['sales_tx'],x=state_sales_df['date'],mode='lines',name='Texas(TX)'))
fig.update_layout(title='Aggregate sales per state',
                  xaxis_title='year',
                  yaxis_title='sales')



<h4>Key finding(s):</h4>

* On zooming in, almost same seasonal pattern can be observed in all 3 states followed by a major dip in sales on 25 december (Christmas) as the stores are closed on that day. 

* An anomaly can be noticed in the aggregate sales pattern of **Texas**. Only on June 15,2015 there is an unusual spike in sales which doesn't seem to be the case with other states.







### State wise sales trend
Aggregate sales trend per state

In [None]:
state_ca_trend=seasonal_decompose(state_sales_df['sales_ca'],freq=180).trend
state_wi_trend=seasonal_decompose(state_sales_df['sales_wi'],freq=180).trend
state_tx_trend=seasonal_decompose(state_sales_df['sales_tx'],freq=180).trend


fig=go.Figure()
fig.add_trace(go.Scatter(y=state_ca_trend,x=state_sales_df['date'],mode='lines',name='California(CA)'))
fig.add_trace(go.Scatter(y=state_wi_trend,x=state_sales_df['date'],mode='lines',name='Wisconsin(WI)'))
fig.add_trace(go.Scatter(y=state_tx_trend,x=state_sales_df['date'],mode='lines',name='Texas(TX)'))
fig.update_layout(title='Sales trend per state',
                  xaxis_title='year',
                  yaxis_title='sales')

<h4>Key finding(s)</h4>



*   Walmart stores in both **California** and **Wisconsin** show an **upward trend** in **average aggregate sales** where **Wisconsin** stores show more consistent growth than **California**. 
*   Walmart stores in **Texas** show a **constant trend** in sales after July 2012 



### Zero vs non zero elements in sales

In [None]:
from tqdm import tqdm
df=sales_df.iloc[:,6:]
x=set(np.array(df).flatten())
full_sales_array=np.array(df).flatten()
y=[]

for i in tqdm(x):
    y.append(np.count_nonzero(df==i)/len(full_sales_array))
       
fig=go.Figure()
fig.add_trace(go.Bar(x=[np.count_nonzero(df==0)],y=['zero unit sales'],orientation='h',name=np.count_nonzero(df==0)))
fig.add_trace(go.Bar(x=[np.count_nonzero(df)],y=['nonzero unit sales'],orientation='h',name=np.count_nonzero(df)))
fig.update_layout(width=700,
                  height=300,
                  xaxis_title='Count')

fig1=px.bar(x=list(x),y=y,title='Probability density of unit sales figures')
fig1.update_layout(xaxis_title='unit sales figures',
                  yaxis_title='probability')

fig.show()
fig1.show()

**Key finding(s):**

* Number of zero unit sales figures is more than double than that of non zero ones.
* This seems to be a case of Tweedie distribution.

### Aggregate weekly average sales per state


In [None]:
sales_states_weekly=state_sales_df.groupby(by=['weekday','wday'],
                                           as_index=False)['sales_ca',
                                                           'sales_wi',
                                                           'sales_tx'].agg('mean')

fig=make_subplots(rows=1,cols=3)
fig.append_trace(go.Bar(x=sales_states_weekly['weekday'],y=sales_states_weekly['sales_ca'],name='California(CA)'),row=1,col=1)
fig.append_trace(go.Bar(x=sales_states_weekly['weekday'],y=sales_states_weekly['sales_wi'],name='Wisconsin(WI)'),row=1,col=2)
fig.append_trace(go.Bar(x=sales_states_weekly['weekday'],y=sales_states_weekly['sales_tx'],name='Texas(TX)'),row=1,col=3)
fig.update_layout(
                  title='Walmart weekly sales',
                  xaxis_title='weekday',
                  yaxis_title='Average sale')

<h4>Key finding(s):</h4>

* In all three states, average sales are noticeably higher on **weekends**(Saturday and Sunday) than weekdays.

* This pattern shows that **day of week** can be an important factor to be considered to predict unit sales.   

### Aggregate monthly average sales per state

In [None]:
sales_states_monthly=state_sales_df.groupby(by=['month'],
                                           as_index=False
                                           )['sales_ca',
                                             'sales_wi',
                                             'sales_tx'].agg('mean')

sales_states_monthly['month_name']=['Jan','Feb','Mar','Apr','May','June','July','Aug','Sept','Oct','Nov','Dec']
fig=make_subplots(rows=1,cols=3)
fig.append_trace(go.Line(x=sales_states_monthly['month_name'],y=sales_states_monthly['sales_ca'],name='California(CA)'),row=1,col=1)
fig.append_trace(go.Line(x=sales_states_monthly['month_name'],y=sales_states_monthly['sales_wi'],name='Wisconsin(WI)'),row=1,col=2)
fig.append_trace(go.Line(x=sales_states_monthly['month_name'],y=sales_states_monthly['sales_tx'],name='Texas(TX)'),row=1,col=3)
fig.update_layout(
                  title='Walmart monthly sales',
                  xaxis_title='month',
                  yaxis_title='Average sale')

<h4>Key finding(s):</h4>

* The average sales of **California** and **Texas** are highest in **August** where **Wisconsin** shows highest average sales in **February**.
* All 3 states show a major dip in average sales in the month of **May** 
* Average sales in all states differ widely for each month. This pattern shows us that **month of year** is an important factor to predict unit sales.

### Walmart sales auto correlation plot
To know more about auto correlation plot visit https://www.dummies.com/programming/big-data/data-science/autocorrelation-plots-graphical-technique-for-statistical-data/

In [None]:
walmart_sales=sales_df.loc[:,'d_1':].sum(axis=0)
plot_acf(walmart_sales,lags=60,title='Autocorrelation sales')
plt.ylabel('correlation')
plt.xlabel('time lag')
plt.show()

<h4>Key finding(s):</h4>

* The last observation in our **time series(unit sales)** shows high correlation with observation at every 6th, 7th, and 8th time lag.
* It shows even higher correlation with 7th and 28th time lag.
* Using past observations at time lags showing higher correlation as a feature(lag features) can also help in making accurate sales predictions.  

### Effect of events on sales per category

In [None]:
total_sales_hobbies=sales_df[sales_df['cat_id']=='HOBBIES'].loc[:,'d_1':].sum(axis=0)
total_sales_household=sales_df[sales_df['cat_id']=='HOUSEHOLD'].loc[:,'d_1':].sum(axis=0)
total_sales_foods=sales_df[sales_df['cat_id']=='FOODS'].loc[:,'d_1':].sum(axis=0)

category_sales_df=pd.DataFrame({'sales_foods':list(total_sales_foods),
                             'sales_hobbies':list(total_sales_hobbies),
                             'sales_household':list(total_sales_household),
                             'date':list(calendar_df['date']),
                              'event_type_1':calendar_df['event_type_1'],
                             'event_type_2':calendar_df['event_type_2']},
                               )
no_event_sales_category=category_sales_df.iloc[:,:3][category_sales_df['event_type_1'].isna()].mean()
cat_sales_on_events=category_sales_df.groupby(['event_type_1'],as_index=False)['sales_foods',
                                                           'sales_hobbies',
                                                           'sales_household'].agg('mean')

cat_sales_on_events['sales_foods_diff']=cat_sales_on_events['sales_foods']-no_event_sales_category[0]
cat_sales_on_events['sales_hobbies_diff']=cat_sales_on_events['sales_hobbies']-no_event_sales_category[1]
cat_sales_on_events['sales_household_diff']=cat_sales_on_events['sales_household']-no_event_sales_category[2]
fig=go.Figure()
fig.add_trace(go.Bar(x=cat_sales_on_events['sales_foods_diff'],y=cat_sales_on_events['event_type_1'],orientation='h',name='foods'))
fig.add_trace(go.Bar(x=cat_sales_on_events['sales_hobbies_diff'],y=cat_sales_on_events['event_type_1'],orientation='h',name='hobbies'))
fig.add_trace(go.Bar(x=cat_sales_on_events['sales_household_diff'],y=cat_sales_on_events['event_type_1'],orientation='h',name='household'))
fig.update_layout(width=700,
                  height=500,
                  title='Effect of events on sales per category',
                 yaxis_title='Effect of events on sales per category',
                 xaxis_title='Deviation from average sales')

<h4>Key finding(s):</h4>

* On **national events** we can notice a significant drop in average sales for all categories.
* Average sales of **FOODS** category incereases on **sporting events**.
* This plot shows that calendar events can be an important factor to predict unit sales.

### Effect of SNAP days on sales per state per category

In [None]:
state_cat_sales=pd.melt(sales_df.groupby(['state_id','cat_id'],as_index=False,axis=0).agg('mean'),
                        id_vars=['state_id','cat_id'],var_name='d',value_name='sales')

state_cat_sales=state_cat_sales.merge(calendar_df[['snap_CA','snap_TX','snap_WI','d']],on='d')

snap_CA_sales=state_cat_sales[state_cat_sales['state_id']=='CA'].groupby(['cat_id','snap_CA'],as_index=False).agg('mean')
snap_WI_sales=state_cat_sales[state_cat_sales['state_id']=='WI'].groupby(['cat_id','snap_WI'],as_index=False).agg('mean')
snap_TX_sales=state_cat_sales[state_cat_sales['state_id']=='TX'].groupby(['cat_id','snap_TX'],as_index=False).agg('mean')

fig,axes=plt.subplots(1,3,figsize=(12,5))
snap_CA_sales.pivot("cat_id", "snap_CA", "sales").plot(kind='bar',ax=axes[0],ylabel='avg sales')
snap_WI_sales.pivot("cat_id", "snap_WI", "sales").plot(kind='bar',ax=axes[1])
snap_TX_sales.pivot("cat_id", "snap_TX", "sales").plot(kind='bar',ax=axes[2])


<h4>Key finding(s):</h4>

* Average sales of **FOODS** category in all states increases on **SNAP days**.
* Considering SNAP can be an beneficial to predict unit sales.

### Sales trend vs price trend per department

In [None]:
price_df_1=price_df.merge(sales_df[['item_id','cat_id','dept_id']].drop_duplicates(),on='item_id').reset_index(drop=True)
del price_df
price_df_final=price_df_1.groupby(['wm_yr_wk','cat_id','dept_id'],as_index=False).agg('mean')
del price_df_1
price_df_final=price_df_final[price_df_final['wm_yr_wk']<=11613]


sales_df_1=sales_df.melt(id_vars=['id','item_id','dept_id','cat_id','store_id','state_id'],
              var_name='d',value_name='sales')
del sales_df
sales_df_2=sales_df_1.merge(calendar_df[['d','date','wm_yr_wk']].drop_duplicates(),on='d')
del sales_df_1
sales_df_final=sales_df_2.groupby(['wm_yr_wk','cat_id','dept_id'],as_index=False).agg('mean')


price_df_final['sales']=sales_df_final['sales']

for dept,row in zip(price_df_final['dept_id'].unique(),range(1,8)):
        fig=make_subplots(rows=1,cols=2)
        sales_trend=seasonal_decompose(price_df_final[price_df_final['dept_id']==dept]['sales'],freq=30).trend
        price_trend=seasonal_decompose(price_df_final[price_df_final['dept_id']==dept]['sell_price'],freq=30).trend
        
        fig.append_trace(go.Line(x=price_df_final[price_df_final['dept_id']==dept]['wm_yr_wk']
                ,y=sales_trend,name=dept+' weekly average sales trend'),row=1,col=1)
        
        fig.append_trace(go.Line(x=price_df_final[price_df_final['dept_id']==dept]['wm_yr_wk']
                ,y=price_trend,name=dept+' weekly average sell price trend'),row=1,col=2)
        
        fig.update_layout(xaxis_title='week_id')
        fig.show()  

<h4>Key finding(s):</h4>

* **FOODS_1, HOBBIES_2, and HOUSEHOLD_2** departments show a downward sales trend when their prices are up.

* The above plots show that change in price can effect unit sales.