# Introduction

This notebook was made in collaboration with: [Adil El Hakouni](https://www.kaggle.com/adilelhakouni)

Hi everyone and welcome to our very first notebook ^^

First of all I would like to highlight that there are many time series that we can consider in this competition, see the image below:


In [None]:
from IPython.display import Image
Image("/kaggle/input/image1/Annotation 2020-05-17 135651.png")

* There are many levels of aggregation for a total of 42,840 time series.
* The nature of these time series is not the same as we will see in the first part of the notebook, hence I decided to treat the up level series with the less number of zero values with a **univariate** (that does not include external variables) statistical model as the title demonstrates.
* The following notebook is a general guide to multi-seasonal time series forecasting using **SARIMAX** model.

# Some visualizations

The notebook doesn't contain a full explanatory analysis, but still some visualisations are essential so that we understand the nature of the data we are dealing with.
### Importing necessary libraries

In [None]:
import warnings
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import itertools as itr
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

plt.style.use('ggplot')

### The data

In [None]:
INPUT_DIR = '../input/m5-forecasting-accuracy'
cal = pd.read_csv(f'{INPUT_DIR}/calendar.csv')
stv = pd.read_csv(f'{INPUT_DIR}/sales_train_validation.csv')

stv.head()

This is a sample of the historic sales data in the sales_train_validation dataset.

* For each product in each store and department, we are given the number of sales for days d_1 to d_1913.
* days d_1914 - d_1941 represents the validation rows which we will predict in stage 1.
* days d_1942 - d_1969 represents the evaluation rows which we will predict for the final submission.

### Unit sales for a random product in a random store 

In [None]:
#select all the columns that contain d_
d_cols = [c for c in stv.columns if 'd_' in c] # sales data columns

stv.loc[stv['id'] == 'FOODS_2_317_TX_3_validation'][d_cols].T \
    .plot(figsize=(20, 5),
          title='FOODS_2_317_TX_3 sales by "d" number')
plt.legend('')
plt.show()

The first remark we can claim is the numerous days where 0 items were sold. Furthermore, this serie doesn't show any clear trend or seasonality thus for an illustrative purpose it's not the best to use in our approach. 

Our approach is based on a simple univariate model ignoring all the clear dependencies with daily prices, snap days and event days.

### Unit sales of all products, aggregated for all stores/states

In [None]:
all_data = stv[d_cols] \
    .sum(axis=0) \
    .T \
    .reset_index()

all_data.columns = ['d','sales']

print(all_data)

In [None]:
all_data.plot(figsize=(20, 5),
          title='unit sales of all products by "d" number')
plt.legend('')
plt.show()

This time serie is more adapted to the SARIMAX model we want to use.

However, the outliers might be nuisible to our model as they are clearly distinquished from the plot (like the christmas day where the stores are closed).

calendar.csv provide us with the data about the events that have occured during both the training period and the test period, let's look at a sample: 

In [None]:
cal.head()

Here are the many event days that exist:

In [None]:
print(cal['event_name_1'].unique())
print(cal['event_name_2'].unique())

# Cleaning the data

We will consider that only the events' days are outliers (This is a strong hypothesis and is far from being an exhaustive selection of outliers since snap days and low prices' days might be part of it too), and we will replace their values with the mean value of the previous and the next day.

### Merging the stv dataset and the calendar dataset

In [None]:
# Merge calendar on our items' data

all_data_merged = all_data.merge(cal, how='left', validate='1:1')
all_data_merged.head()

### Replacing the events' values

In [None]:
#y is the sales data we are gonna work with from now on
y = all_data_merged.set_index('date')['sales']

#Detect days that have either event_1 or event_2
places = all_data_merged.loc[~(all_data_merged['event_name_1'].isna()) | ~(all_data_merged['event_name_2'].isna())]['d']

change = list(all_data_merged.d.isin(list(places)))
for i in range(len(change)):
    if change[i] == True:
        y.iloc[i] = (y.iloc[i-1] + y.iloc[i+1]) / 2

y.plot(figsize=(20, 5),
          title='cleaned unit sales of all products by day')
plt.legend('')
plt.show()

# Selecting seasonalities

It is clear that this final dataset will have multiple seasonalities. It is also logical that the sales would follow a pattern in a period of a week (people will have the tendency to purchase more on weekends), of a month (depends on which time of the month the majority of the population get their salary in) and also a year (summer is never like winter when it comes to purchasing).

The detection of the many seasonalities is a very important step in the deployment of our model, let's make sure that the frequencies that we have mentionned really exist.

### weekly seasonality

In [None]:
y.iloc[-30:,].plot(figsize=(20, 5),
          title='cleaned unit sales of all products by week')
plt.legend('')
plt.show()

This is a month of data and it shows a clear sinusoidal attitude of 4 periods which corresponds to the weekly seasonality.

### Monthly seasonality

In [None]:
y_month = y.reset_index()
y_month['date'] = pd.to_datetime(y_month['date'])
y_month = y_month.set_index('date')
y_month = y_month.resample('W').mean()

y_month.iloc[-49:,].plot(figsize=(20, 5),
          title='cleaned unit sales of all products by week')
plt.legend('')
plt.show()

Here again we can assume that every month the number of sales has a peak at the beginning of the month and then a trough, which demonstrates the existence of a monthly seasonality.

### Yearly seasonality

In [None]:
y_year = y.reset_index()
y_year['date'] = pd.to_datetime(y_year['date'])
y_year = y_year.set_index('date')
y_year = y_year.resample('M').mean()

y_year.iloc[-40:,].plot(figsize=(20, 5),
          title='cleaned unit sales of all products by month')
plt.legend('')
plt.show()

Bingo! There is a clear increase in sales during summer comparing to winter (disregarding the trend).

# Removing trend and seasonalities

Starting from an additive model hypothesis [1], the idea is fitting a linear regression model over the fourier series corresponding to the different seasonalities that we have  [2]. 

Indeed, we will be able to extract the trend and the seasonality components leaving only the noise to predict with the SARIMAX model, see the formulas below:

$$y_{sales} = Constant + Trend + Seasonality + Noise \space \space \space [1]$$

$$y_{sales} = b_{0} + b_{1}t + b_{2}\cos(\frac{2\pi t}{365}) + b_{3}\sin(\frac{2\pi t}{365}) + b_{4}\cos(\frac{2\pi t}{30}) + b_{5}\sin(\frac{2\pi t}{30}) + b_{6}\cos(\frac{2\pi t}{7}) + b_{7}\sin(\frac{2\pi t}{7}) + \epsilon \space \space \space[2]$$


The performance of the linear regression model will be evaluated using the **coefficient of determination**, denoted as *R²*. it tells us which amount of variation in y_sales can be explained by the dependence on the periodic functions we used as regressors.

In [None]:
y_sales = y.reset_index().drop(['date'],axis=1)

#Time scale
predic1 = range(1913)

#Applying the Fourier series to the time scale
predic_annual_cos = list(map(lambda x: math.cos(2*math.pi*x/365), predic1))
predic_annual_sin = list(map(lambda x: math.sin(2*math.pi*x/365), predic1))

predic_month_cos = list(map(lambda x: math.cos(2*math.pi*x/30), predic1))
predic_month_sin = list(map(lambda x: math.sin(2*math.pi*x/30), predic1))

predic_week_cos = list(map(lambda x: math.cos(2*math.pi*x/7), predic1))
predic_week_sin = list(map(lambda x: math.sin(2*math.pi*x/7), predic1))

#assembling the regressors
reg = pd.DataFrame(list(zip(predic1, predic_annual_cos, predic_annual_sin, predic_month_cos, predic_month_sin, predic_week_cos, predic_week_sin)), 
               columns =['predic1', 'predic_annual_cos', 'predic_annual_sin', 'predic_month_cos', 'predic_month_sin', 'predic_week_cos', 'predic_week_sin']) 

#Model
model = LinearRegression().fit(reg, y_sales)

#The estimated parameters
r2 = model.score(reg, y_sales)
print('coefficient of determination:', r2)

The more *R²* is close to 1 the more our model is accurate. Here with almost 70% of the variance explained by our model, we are sure that the seasonalities we included are relevant to the serie.

Let's extract the trend and the seasonalities estimated and visualize them:

In [None]:
trend = model.intercept_ + model.coef_[0][0]*np.array(predic1)
seas_annual = model.coef_[0][1]*np.array(predic_annual_cos) + model.coef_[0][2]*np.array(predic_annual_sin)
seas_month = model.coef_[0][3]*np.array(predic_month_cos) + model.coef_[0][4]*np.array(predic_month_sin)
seas_week = model.coef_[0][5]*np.array(predic_week_cos) + model.coef_[0][6]*np.array(predic_week_sin)

trend_seas = trend + seas_annual + seas_month + seas_week

ax = pd.DataFrame(trend_seas, columns=['trend+seasonalities']).plot(figsize=(20,8))
y_sales.plot(ax=ax,alpha=0.7)

Which is visually not bad. Now that we have extracted the serie's trend and seasonality, let's extract them from the original serie so that only noise is left to estimate.

In [None]:
y_adjusted = np.array(list(y_sales['sales'])) - trend_seas
y_adjusted = pd.DataFrame(y_adjusted, columns=['noise'])
y_adjusted.plot(figsize=(20,8))

A SARIMAX model then will be used on this final dataset to predict the noise in the sales serie before addind up the trand and seasonalities again for real forecasts.

### Training and test datasets

We will test our model on 28 steps ahead just like the submission instructions.

In [None]:
y_train = y_adjusted.iloc[:-28,]
y_test = y_adjusted.iloc[-28:,]

# SARIMAX model

The Seasonal Autoregressive Integrated Moving Average, or SARIMA, model is an approach for modeling univariate time series data that may contain trend and seasonal components. This model has hyperparameters that control the nature of the model performed for the series:
* order: A tuple p, d, and q parameters for the modeling of the trend
* seasonal order: A tuple of P, D, Q, and s parameters for the modeling of the seasonality

*p*, *d* and *q* stand for the number of times or steps to consider for the Autoregressive, the differenciation and the moving average operators, although *s* is the period of the seasonality we want to include.

### Tuning the model

In our case, since we have already excluded the trend and the seasonalities, the parameter P, D, Q and s will be set to zero (I could have used an ARIMA model, I'm just used to the SARIMAX one).

In order to decide of the adequate values of the remaining hyperparameters, a grid search is done with the following code (The selection criterion is the [Akaike information criterion (AIC)](https://en.wikipedia.org/wiki/Akaike_information_criterion)):

In [None]:
# Define the p and q parameters to take any value between 0 and 4, d between 0 and 1
p = q = range(0, 6)
d = [0,1,2]

# Generate all different combinations of p, q and q triplets
pdq = list(itr.product(p, d, q))

# Generate all different combinations of seasonal p, q and q triplets
seasonal_pdq = [[0,0,0,0]]

warnings.filterwarnings("ignore") # specify to ignore warning messages
minimum = 500000 #initialize the minimum AIC variable with a high enough value
for param in pdq:
    for param_seasonal in seasonal_pdq:
        try:
            mod = sm.tsa.statespace.SARIMAX(y_train,
                                            order=param,
                                            seasonal_order=param_seasonal)

            results = mod.fit()
            
            if results.aic < minimum:
                minimum = results.aic
                param_ideal = param
                

            print('ARIMA{} - AIC:{}'.format(param, results.aic))
            
        except:
            print('none')
            continue
            
print('And the result is ARIMA{} - AIC:{}'.format(param_ideal, minimum))

### Evaluating the model

Now that we have set all the model's hyperparameters, let's evaluate its performance.

In [None]:
mod = sm.tsa.statespace.SARIMAX(y_train,
                                order=(3, 1, 5),
                                seasonal_order=(0, 0, 0, 0))
results = mod.fit()

print(results.summary().tables[1])

results.plot_diagnostics(figsize=(15, 12))
plt.show()

The standardized residuals are the error between the model's predictions and the actual observed values, two criterion need to be verified for the adequacy of our SARIMAX model: 
* The residuals should follow a Gaussian distribution centered on zero (which is almost the case from the Histogram plus estimated density plot and the Q-Q plot of theoretical quantiles)
* There should be no correlations between the residuals which would mean that there is still significant information to use in computing forecasts (which is verified from the correlogram)

### Testing the model

In [None]:
pred = results.get_forecast(steps=28)

ax = y_test.plot(figsize=(20, 10))

pd.DataFrame(pred.predicted_mean, columns=['forecast']).plot(ax=ax)

ax.fill_between(range(1885,1913), pred.conf_int()['lower noise'], pred.conf_int()['upper noise'], color='b', alpha=.04)

plt.legend()
plt.show()

Let's visualize the real forecasts now addind the trend and the seasonality we have extracted.

In [None]:
ax = pd.DataFrame(np.array(list(y_test['noise']))+trend_seas[1885:],index = range(1885,1913), columns=['sales']).plot(figsize=(20, 10))

pd.DataFrame(pred.predicted_mean+trend_seas[1885:], columns=['forecast']).plot(ax=ax)

ax.fill_between(range(1885,1913), pred.conf_int()['lower noise']+trend_seas[1885:], pred.conf_int()['upper noise']+trend_seas[1885:], color='b', alpha=.04)

plt.legend()
plt.show

The result is satisfying and our model is precisely following the original dataset with some important gaps at the peaks. This is also confirmed by the next plot of both the training and the test periods:

In [None]:
pred_all = results.predict(start=0, end=1912)

ax = y_sales.plot(figsize=(20, 10))

pd.DataFrame(pred_all + trend_seas, columns=['model']).plot(ax=ax)

plt.legend()
plt.show()

# Conclusion

With all the strong hypothesis made throughout the notebook, the final forecast are very good and describe accurately the serie, tracks of improvement to this method could be:

* Detecting more periods to the seasonality and include more Fourier series to the linear regression model
* Use the seasonal components of the SARIMAX model if any was left after the extraction phase
* Include the external variables available in order to boost the accuracy

Please feel free to share your opinion about my notebook or any inconsistency in the approach.
Thank you for reading.

### Some external resources:

* [Linear Regression in Python](https://realpython.com/linear-regression-in-python/) by Mirko Stojiljkovic

* [A Gentle Introduction to SARIMA for Time Series Forecasting in Python](https://machinelearningmastery.com/sarima-for-time-series-forecasting-in-python/) by Jason Brownlee