## Table of contents:

* Introduction
* Preparation
* Data overview
* Basic exploration
* Timeseries
* Simple TS models
    


![](https://i.imgur.com/vnJHx1k.png)

## Introduction

This is a EDA notebook for the **Tabular Playground Series - Jan 2022** competition. The main goal of the notebook is to provide basic steps needed to get some insights about the data and to build a simple predictive model using Python (and to ensure that I haven't forgotten Python).

This is a supervised machine learning problem which is evaluated on the [symmetric mean absolute percentage error](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error): 
$$
SMAPE=\frac{100%}{n}\sum^n_{t=1}\frac{|F_t-A_t|}{\frac{|A_t|+|F_t|}{2}},
$$
where $A_t$ is the actual value and $F_t$ is the forecast value.

## Preparation

In [None]:
!pip install pmdarima

In [None]:
# Basic packages
import numpy as np
import pandas as pd
import statsmodels as sm
import matplotlib as mp
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Statistical packages
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.statespace.tools import diff
import pmdarima as pm

In [None]:
# Settings
warnings.filterwarnings("ignore")
mp.rcParams['figure.dpi'] = 80
sns.set(rc={'figure.figsize':(10, 7)})
sns.set(style="ticks")
sns.set_palette("Paired")

In [None]:
!ls ../input/*

In [None]:
# Load data
path = '../input/tabular-playground-series-jan-2022/'
tr = pd.read_csv(path+'train.csv')
te = pd.read_csv(path+'test.csv')
sub = pd.read_csv(path+'sample_submission.csv')

## Data overview

In [None]:
print(f'Train set shape: {tr.shape}\nTest set shape: {te.shape}')

In [None]:
tr.info()

In [None]:
tr.sample(5, random_state=1)

In [None]:
tr.date = pd.to_datetime(tr.date, format='%Y-%m-%d')
te.date = pd.to_datetime(te.date, format='%Y-%m-%d')

In [None]:
cats = tr.select_dtypes(include='object').columns.tolist()
print(*cats)

We have 6 columns in total. Among them there is:

* a **date** column
* 3 categorical columns (**country**, **store**, **product**)
* a numerical column **num_sold** to predict - this is a target variable

Let's find out unique values of the categorical features:

In [None]:
tr[cats].apply(np.unique)

In [None]:
sub.head()

We have to ensure that the id columns in the test and submission files are identical:

In [None]:
all(te.row_id == sub.row_id)

In [None]:
tr.isna().sum()

In [None]:
te.isna().sum()

There are no NA cells in the data sets.

## Basic exploration

In [None]:
tr.groupby('country')['num_sold'].sum()

In [None]:
g = sns.boxplot(x='country', y='num_sold', hue='store', data=tr, 
                linewidth=1, flierprops = dict(markersize = 0.5))
g.set(ylabel=None);

In [None]:
tr.groupby('store')['num_sold'].sum()

In [None]:
g = sns.boxplot(x='store', y='num_sold', hue='product', data=tr, 
                linewidth=1, flierprops = dict(markersize = 0.5))
g.set(ylabel=None);

In [None]:
tr.groupby('product')['num_sold'].sum()

In [None]:
g = sns.boxplot(x='country', y='num_sold', hue='product', data=tr, 
                linewidth=1, flierprops = dict(markersize = 0.5))
g.set(ylabel=None);

From the data above we can conclude:

* the most popular product is Kaggle Hat
* KaggleRama has more sales than KaggleMart
* in general they sell more stuff in Norway than in any other country.

## Timeseries

In [None]:
tr['day'] = tr.date.dt.day
tr['week'] = tr.date.dt.week
tr['month'] = tr.date.dt.month
tr['year'] = tr.date.dt.year
tr['year_month'] = tr.date.map(lambda x: x.strftime('%Y-%m'))

In [None]:
g = sns.lineplot(x='year_month', y='num_sold', data=tr)
g.set(xlabel=None, ylabel=None);
g.xaxis.set_major_locator(mp.ticker.MultipleLocator(5));
plt.xticks(rotation=30);

In [None]:
g = sns.lineplot(x='year_month', y='num_sold', hue='country', data=tr)
g.set(xlabel=None, ylabel=None);
g.xaxis.set_major_locator(mp.ticker.MultipleLocator(5));
plt.xticks(rotation=30);

In [None]:
g = sns.lineplot(x='year_month', y='num_sold', hue='store', data=tr)
g.set(xlabel=None, ylabel=None);
g.xaxis.set_major_locator(mp.ticker.MultipleLocator(5));
plt.xticks(rotation=30);

In [None]:
g = sns.lineplot(x='year_month', y='num_sold', hue='product', data=tr)
g.set(xlabel=None, ylabel=None);
g.xaxis.set_major_locator(mp.ticker.MultipleLocator(5));
plt.xticks(rotation=30);

These are really nice time series. They have everything - trends, seasonality etc. Let's decompose them.

In [None]:
g = sns.relplot(x='month', y='num_sold', data=tr, 
                col='year', hue='country', style='store',
                kind='line', linewidth=2, zorder=5,
                col_wrap=2, height=5, aspect=1, legend=True)

Every year the series show the same seasonal pattern for each country and store - we can see some decline in autumn.The shapes of all the series look identical - up to some constant. We need to carry on some statistical tests concerning stationarity. There is no sense in testing all the data - we might partition the data set by the country/store/product or their combination and apply the same conclusions to each series. 

## Simple TS models

In [None]:
def check_stationarity(series):

    result = adfuller(series.values)
    
    print(f'ADF Statistic: {result[0]:.5f}')
    print(f'p-value: {result[1]:.5f}')
    print('Critical Values:')
    
    for key, value in result[4].items():
        print(f'\t{key}: {value:.3f}')

    if (result[1] <= 0.05) & (result[4]['5%'] > result[0]):
        print("Stationary")
    else:
        print("Non-stationary")

Let's explore a selected time series:

In [None]:
ts = tr.query('country=="Norway" & store=="KaggleRama" & product=="Kaggle Mug"')[['date', 'num_sold']]
ts.set_index('date', inplace=True, drop=True)
check_stationarity(ts)
ts.plot();

In [None]:
pm.plot_acf(ts, alpha=0.05, lags=50)
pm.plot_pacf(ts, alpha=0.05, lags=50)

What we have here is a non-stationary time series - it looks like an AR process, possibly, with a seasonal component. We might try something like a SARIMA model.

Below we use **auto_arima()** function, which returns the best ARIMA model according to AIC score. It runs unit root tests, minimisation of the AICc and MLE underneath. It conducts a search over possible model within the constraints provided. Among them there are (P, D, Q), which define seasonal parameters, and (p, d, q), which define a search space for ARIMA model itself.

In [None]:
m_aarima = pm.auto_arima(ts, 
                        start_p=0, start_q=0,
                        max_p=2, max_q=2, max_d=1, 
                        seasonal=True, m=14,
                        start_P=0, start_Q=0, 
                        max_P=2, max_Q=1, max_D=1, 
                        trace=True,
                        error_action='ignore',  
                        suppress_warnings=True, 
                        stepwise=True)
print(m_aarima.aic())

In [None]:
ts_te = te.query('country=="Norway" & store=="KaggleRama" & product=="Kaggle Mug"')['date']
h = (ts_te.max() - ts_te.min()).days+1
pred, ci = m_aarima.predict(h, return_conf_int=True, alpha=0.05)

So, let's plot our preditions:

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 7))
ax.plot(ts, color='blue', label='Training Data')
ax.plot(ts_te, pred, color='green', label='Predicted Values')
ax.set_xlabel('Dates')
ax.set_ylabel('Number sold')
conf_int = np.asarray(ci)
ax.fill_between(ts_te, conf_int[:, 0], conf_int[:, 1], alpha=0.9, color='orange', label="Confidence Intervals")
ax.legend(loc='upper left');

Pretty useless predictions, aren't they? The model successfully captured the properties of the last spike and based its predictions on it. This demonstrates that we should double check predictions of any model.

The key parameter of **auto_arima** model is **m** - the period for seasonal differencing. You might have already noticed those spikes at the beginning of each year. It means that for this data set the period for seasonal differencing should be around 365. Unfortunately, here we don't have enough computational resources to run **auto_arima(m=365)**. That's why I'm going to choose the model manually by AIC score. Nevertheless, I have to clip the time series in order to fit the model in memory. 

In [None]:
ts = tr.query('country=="Norway" & store=="KaggleRama" & product=="Kaggle Mug" & date>"2017-12-01"')[['date', 'num_sold']]
ts.set_index('date', inplace=True, drop=True)
check_stationarity(ts)
ts.plot();

In [None]:
m_arima = pm.arima.ARIMA(order=(3, 1, 0), seasonal_order=(1, 0, 0, 365))
m_arima.fit(ts)
print(m_arima.aic())  

In [None]:
pred, ci = m_arima.predict(h, return_conf_int=True, alpha=0.05)

fig, ax = plt.subplots(1, 1, figsize=(10, 7))
ax.plot(ts, color='blue', label='Training Data')
ax.plot(ts_te, pred, color='green', label='Predicted Values')
ax.set_xlabel('Dates')
ax.set_ylabel('Number sold')
conf_int = np.asarray(ci)
ax.fill_between(ts_te, conf_int[:, 0], conf_int[:, 1], alpha=0.9, color='orange', label="Confidence Intervals")
ax.legend(loc='upper left');

These predictions look much better. But what about other countries, shops and products? With this approach we have to build another 17 models, and it does not inspire.