# M5 Competition

The Makridakis Competitions, as quoted from wikipedia, "are a series of open competitions organized by teams led by forecasting researcher Spyros Makridakis and intended to evaluate and compare the accuracy of different forecasting methods". We Kagglers are fortunate enough to compete not only one but 2 m5 competitions that are being hosted on the Kaggle platform for the first time since it's inception in 1982.

The 2 competitions are Accuracy and Uncertainty:

* The accuracy competiton will be evaluated on Weighted Root Mean Squared Scaled Error (RMSSE)
* The uncertainty competition will be evaluated on Weighted Scaled Pinball Loss (WSPL)

Task: This is the Accuracy competition, and the goal of this competition is to forecast Walmart sales, 28 days into the future based on hierarchical sales data from 3 different states (California, Texas, and Wisconsin).
__________________________________________
### Things to Note
I do not do an in depth analysis of features in this notebook. There are many good exmaples that can be used for that, that I have linked below. These 3 Notebooks inparticular helped me become more acquainted with time-series analysis: 
* [@Headsortails](https://www.kaggle.com/headsortails): [Back to (predict) the Future - Interactive M5 EDA](https://www.kaggle.com/headsortails/back-to-predict-the-future-interactive-m5-eda)
* [@Robikscube](https://www.kaggle.com/robikscube): [M5 Forecasting - Starter Data Exploration](https://www.kaggle.com/robikscube/m5-forecasting-starter-data-exploration)
* [@Tarunpaparaju](https://www.kaggle.com/tarunpaparaju): [M5 Competition: EDA + Models](https://www.kaggle.com/tarunpaparaju/m5-competition-eda-models)
* [@leonzz](https://www.kaggle.com/leonzz): [M5-forecasting-arima](https://www.kaggle.com/leonzz/M5-forecasting-arima)

*I may not be very iterative on this notebook (in comparison to some of the other excellent notebooks) over the remainder of the competition, since my focus will be on the competition aspect, rather than the notebooks - however overtime I may revisit this for practice.*

Links for in-depth description: 
* 

In [None]:
import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
import plotly.graph_objects as go 
from plotly.subplots import make_subplots
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf

from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.stattools import adfuller

In [None]:
# Reading data
DATA_DIR= "../input/m5-forecasting-accuracy/"
CALENDAR= DATA_DIR + "calendar.csv"
# SALES_TRAIN_VALID= DATA_DIR + "sales_train_validation.csv"
SAMPLE_SUB= DATA_DIR + "sample_submission.csv"
SELL_PRICES= DATA_DIR + "sell_prices.csv"
FULL_TRAIN_DF= DATA_DIR + "sales_train_evaluation.csv"

calendar= pd.read_csv(CALENDAR)
# stv= pd.read_csv(SALES_TRAIN_VALID)
sub= pd.read_csv(SAMPLE_SUB)
sell_prices= pd.read_csv(SELL_PRICES)
full_df= pd.read_csv(FULL_TRAIN_DF)

print(f"Calendar Dataframe shape: {calendar.shape}")
# print(f"Sales Train Validation Dataframe shape: {stv.shape}")
print(f"Submission Dataframe shape: {sub.shape}")
print(f"Sell Prices Dataframe shape: {sell_prices.shape}")
print(f"Full training Dataframe shape: {full_df.shape}")

In [None]:
# https://www.kaggle.com/c/m5-forecasting-accuracy/discussion/134072
def rmsse(y_true, y_pred, y_hist): 
    h, n= len(y_true), len(y_hist)
    error= np.sum((y_true - y_pred)**2)
    deviation= (1/(n-1)) * np.sum((y_hist[1:] - y_hist[:-1])**2)
    rmsse = np.sqrt((1/h) * (error/deviation))
    return rmsse

In [None]:
# store of the sales data columns
d_cols = full_df.columns[full_df.columns.str.contains("d_")]

# group columns by store_id
df= full_df.groupby(full_df["store_id"]).sum()[d_cols].T
df.head()

In [None]:
df.shape

### Stationarity 
Does the data have constant mean, variance and no seasonality? 

There are many ways to check for stationarity, the two I will be using are: 
1. Visualization - Check for any obvious trends or seasonality 
3. Statistical Test - Check if the expectations for stationarity are met

The statistical test I will be doing is the [Dickey-Fuller Test](https://en.wikipedia.org/wiki/Dickey%E2%80%93Fuller_test).

In [None]:
# adding calendar.csv
df= df.reset_index().rename(columns= {"index": "d"}).merge(calendar, how= "left", validate="1:1")

In [None]:
df.head()

In [None]:
# store the store columns
stores= []
for word in df.columns:
    if word.isupper():
        stores.append(word)
stores

In [None]:
# plotting sales over time figure
fig = go.Figure(data= [{
        "x":df.date ,
        "y": df[col],
        "name": col} for col in stores])

fig.update_layout(
    title="Total sales per store",
    xaxis_title="Dates",
    yaxis_title="Units Sold",
    font=dict(
        family="Arial, monospace",
        size=14,
        color="#7f7f7f"
    )
)

fig.show()

In [None]:
df_weekend= df[-7:].groupby("weekday").sum()[stores]
# plotting the figure
fig = go.Figure(data= [{
        "x":df_weekend.index,
        "y": df_weekend[col],
        "name": col} for col in df_weekend])

fig.update_layout(
    title="Total sales by each store per Day of the Week of the last week of data",
    xaxis_title="Dates",
    yaxis_title="Units Sold",
    font=dict(
        family="Arial, monospace",
        size=14,
        color="#7f7f7f"
    )
)

fig.show()

There seems to be a trend in this data... Saturday and Sunday shows an increase in sales from all stores. 

In [None]:
# plotting the figure
df_monthly= df.groupby("month").sum()[stores]

fig = go.Figure(data= [{
        "x":df_monthly.index,
        "y": df_monthly[col],
        "name": col} for col in df_monthly])

fig.update_layout(
    title="Monthly sales per store",
    xaxis_title="Dates",
    yaxis_title="Units Sold",
    font=dict(
        family="Arial, monospace",
        size=14,
        color="#7f7f7f"
    )
)

fig.show()

We can also observe that for each store, the earlier months usually show a massive amount of sales that slows down towards the end of the year. 

Looking at the plots visually would occassionally give some insight into whether the data is stationary.  

However, when it's not so clear, there are statistical test that can be done to make things clearer.

In [None]:
# Dickey-fuller statistical test 
def ad_fuller(timeseries: pd.DataFrame, significance_level= 0.05):
    
    non_stationary_cols= []
    stationary_cols= []
    
    for col in timeseries.columns: 
        dftest= adfuller(df[col], autolag="AIC")
        if dftest[1] <= significance_level:
            stationary_cols.append({col:{"Test Statistic": dftest[0],
                                         "p-value": dftest[1],
                                         "# Lags": dftest[2],
                                         "# Observations": dftest[3],
                                         "Critical Values": dftest[4],
                                         "Stationary": True}})
        else: 
            non_stationary_cols.append({col:{"Test Statistic": dftest[0],
                                         "p-value": dftest[1],
                                         "# Lags": dftest[2],
                                         "# Observations": dftest[3],
                                         "Critical Values": dftest[4],
                                         "Stationary": False}})
    return non_stationary_cols, stationary_cols
            

In [None]:
non_stationary_cols, stationary_cols= ad_fuller(df[stores])

len(non_stationary_cols), len(stationary_cols)

In [None]:
non_stationary_cols[0]

We must difference our data so that it is stationary since we failed to reject the null hypothesis, but first lets see what this looks like in a plot. 

In [None]:
rolling_mean= df["CA_1"].rolling(window=28, center=False).mean()
rolling_std= df["CA_1"].rolling(window=28, center=False).std() 

fig= go.Figure(data=
               [go.Scatter(x= df["date"],
                           y= df["CA_1"],
                           name= "original", 
                           showlegend=True,
                           marker=dict(color="blue"))])
fig.add_traces([
    go.Scatter(x= df["date"],
                         y=rolling_mean,
                         name= "rolling mean",
                         showlegend= True, 
                         marker=dict(color="red")),
    go.Scatter(x= df["date"],
                         y=rolling_std,
                         name= "rolling std",
                         showlegend= True, 
                         marker=dict(color="black"))])
fig.update_layout(
    title="Store CA_1 Total Sales",
    xaxis_title="Dates",
    yaxis_title="Units Sold",
    font=dict(
        family="Arial, monospace",
        size=14,
        color="#7f7f7f"
    )
)
fig.show()

In [None]:
# making the data stationary
df["lag-1_CA_1"]= df["CA_1"].diff().fillna(df["CA_1"])

# visualizing stationary data
rolling_mean= df["lag-1_CA_1"].rolling(window=28, center=False).mean()
rolling_std= df["lag-1_CA_1"].rolling(window=28, center=False).std() 

fig= go.Figure(data=
               [go.Scatter(x= df["date"],
                           y= df["lag-1_CA_1"],
                           name= "original", 
                           showlegend=True,
                           marker=dict(color="blue"))])
fig.add_traces([
    go.Scatter(x= df["date"],
                         y=rolling_mean,
                         name= "rolling mean",
                         showlegend= True, 
                         marker=dict(color="red")),
    go.Scatter(x= df["date"],
                         y=rolling_std,
                         name= "rolling std",
                         showlegend= True, 
                         marker=dict(color="black"))])
fig.update_layout(
    title="Store first difference CA_1 Total Sales",
    xaxis_title="Dates",
    yaxis_title="Units Sold",
    font=dict(
        family="Arial, monospace",
        size=14,
        color="#7f7f7f"
    )
)
fig.show()

In [None]:
# adding new col to stores
stores.append("lag-1_CA_1")

In [None]:
stores

In [None]:
# check for stationarity (our new col is the only stationary col)
_, stationary= ad_fuller(df[stores])
stationary

Let's see the ACF and PACF plots to determine the order for the model

In [None]:
_, ax= plt.subplots(1, 2, figsize= (10,8))
plot_acf(df["lag-1_CA_1"], lags=10, ax=ax[0]), plot_pacf(df["lag-1_CA_1"], lags=10, ax=ax[1])
plt.show()

by mere inspection of the PACF you can determine how many AR terms you need to use to explain the autocorrelation pattern in a time series: if the partial autocorrelation is significant at lag k and not significant at any higher order lags--i.e., if the PACF "cuts off" at lag k--then this suggests that you should try fitting an autoregressive model of order k.

Source: [Identifying the orders of AR and MA terms in an ARIMA model](https://people.duke.edu/~rnau/411arim3.htm)

In [None]:
model= ARIMA(df["lag-1_CA_1"], order=(8,1,0))
results= model.fit(disp=-1)

fig= go.Figure(data=
               [go.Scatter(x= df["date"],
                           y= df["lag-1_CA_1"],
                           name= "original", 
                           showlegend=True,
                           marker=dict(color="blue"))])
fig.add_trace(
    go.Scatter(x= df["date"],
               y=results.fittedvalues,
               name= "fitted values",
               showlegend= True, 
               marker=dict(color="red")))
fig.update_layout(
    title="Fitted values",
    xaxis_title="Dates",
    yaxis_title="Units Sold",
    font=dict(
        family="Arial, monospace",
        size=14,
        color="#7f7f7f"
    )
)
fig.show()


In [None]:
# a closer look
_, ax= plt.subplots(figsize=(12,8))
results.plot_predict(1799, 1940, dynamic=False, ax=ax)
plt.show()

In [None]:
compare_df= pd.DataFrame({"actual": df["CA_1"],
                          "predictions": pd.Series(results.fittedvalues.cumsum(), copy=True),
                          "d": df["d"]}).set_index("d")
compare_df.loc["d_1", "predictions"]= 0

In [None]:
fig= go.Figure(data=
               [go.Scatter(x= compare_df.index[-90:],
                           y= compare_df.iloc[-90:, 0],
                           name= "actual", 
                           showlegend=True,
                           marker=dict(color="blue"))])
fig.add_traces([
                go.Scatter(x= compare_df.index[-90:],
                           y=compare_df.iloc[-90:, 1],
                           name= "predictions",
                           showlegend= True, 
                           marker=dict(color="red"))])
fig.update_layout(
    title="Actual vs Predicted; RMSE %5f" % np.sqrt(sum((compare_df["actual"] - compare_df["predictions"])**2)/len(compare_df)),
    xaxis_title="Dates",
    yaxis_title="Units Sold",
    font=dict(
        family="Arial, monospace",
        size=14,
        color="#7f7f7f"
    )
)
fig.show()