<a id = "contents"></a>
# Table of Contents


[Intro](#intro)

[Quick look at the data](#quick_look)

-->[sales_train](#quick_look_sales_train)

-->[items](#quick_look_items)

-->[item_categories](#quick_look_item_categories)

-->[shops](#quick_look_shops)

-->[test](#quick_look_test)

-->[sample_submission](#quick_look_sample_submission)

[Dealing with the time variable](#time_variable)

[EDA](#EDA)

-->[Total sales per month](#sales_per_month)

-->[Total sales per week](#sales_per_week)

-->[Total sales per month and shop](#sales_per_month_shop)

-->[Total sales per month and category](#sales_per_month_category)

[Time Series Analysis](#ts_analysis)

-->[Autocorrelation and parcial correlation (ACF and PACF)](#autocorrelation)

-->[Decomposing the TS](#decomposing)

-->[Stationarity Test | Dickey-Fuller](#stationarity)

-->[Remove Non-Stationarity effects](#remove_stationarity)


[Breaf housekeeping](#housekeeping)

-->[Userful functions](#useful)

-->[Validation dataframe](#validationdf)

[Possible approaches](#approaches)

-->[Hierarchical Time Series](#hierarchical)


[Approach 1: ARIMA](#approach1)

-->[ARIMA - Quick look](#arima_quicklook)

-->[ARIMA - Modelling](#arima_modelling)

-->[ARIMA - Forecasting](#arima_prediction)


[Approach 2: Prophet](#approach2)

-->[Prophet - Data prepocessing](#prophet_prepro)

-->[Prophet - Quick look](#prophet_quicklook)

-->[Prophet - Modelling](#prophet_model)

-->[Prophet - Forecasting](#prophet_prediction)


[Approach 3: XGBOOST](#approach3)

-->[Convert to supervised learning](#xgboost)

-->[XGBOOST - Data prepocessing](#xgboost_prepro)

-->[XGBOOST - Modelling](#xgboost_model)

-->[XGBOOST - Forecasting](#xgboost_prediction)


[Results and conclusions](#conclusions)

<a id = "intro"></a>
# Introduction
[Back to Table of Contents](#contents)

The purpose of this notebook is to illustrate a way to approach a time-series forecasting problem. We will analyze the time-series data available and discuss a few algorithms that will do the job. The data we will be using is from the <b>["Predict Future Sales"](https://www.kaggle.com/c/competitive-data-science-predict-future-sales)</b>. We are given sales historic data for a bunch of shops and products of a large Russian sofware company and the task is to forecast the total amount of products sold in every shop for the test set.

<a id = "quick_look"></a>
# Quick look at the data
[Back to Table of Contents](#contents)

Let's import all the necessary libraries and read the csv files:

In [1]:
from platform import python_version
print(python_version())

In [1]:
import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt 
import seaborn as sns 

from math import sqrt
from sklearn.metrics import mean_squared_error
from matplotlib.pylab import rcParams


# TIME SERIES
from statsmodels.tsa.seasonal import seasonal_decompose
from pandas.plotting import autocorrelation_plot
from statsmodels.tsa.stattools import adfuller, acf, pacf
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
from pandas import Timestamp
from datetime import datetime

#MODELLING
import xgboost as xgb
from xgboost import XGBRegressor
from xgboost import plot_importance
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from fbprophet import Prophet
from sklearn.model_selection import GridSearchCV

from sklearn.preprocessing import LabelEncoder
# settings
import os
import gc
import warnings
warnings.filterwarnings("ignore")

<b>Data description in the kaggle competition:</b>

You are provided with daily historical sales data. The task is to forecast the total amount of products sold in every shop for the test set. Note that the list of shops and products slightly changes every month. Creating a robust model that can handle such situations is part of the challenge.

<b>File descriptions</b>

    sales_train.csv - the training set. Daily historical data from January 2013 to October 2015.
    test.csv - the test set. You need to forecast the sales for these shops and products for November 2015.
    sample_submission.csv - a sample submission file in the correct format.
    items.csv - supplemental information about the items/products.
    item_categories.csv  - supplemental information about the items categories.
    shops.csv- supplemental information about the shops.

<b>Data fields</b>

    ID - an Id that represents a (Shop, Item) tuple within the test set
    shop_id - unique identifier of a shop
    item_id - unique identifier of a product
    item_category_id - unique identifier of item category
    item_cnt_day - number of products sold. You are predicting a monthly amount of this measure
    item_price - current price of an item
    date - date in format dd/mm/yyyy
    date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
    item_name - name of item
    shop_name - name of shop
    item_category_name - name of item category


Let's import the data:

In [1]:
folder='/kaggle/input/competitive-data-science-predict-future-sales/'

df_cats=pd.read_csv(folder+'item_categories.csv')
df_items=pd.read_csv(folder+'items.csv')
df_sales=pd.read_csv(folder+'sales_train.csv')
df_shops=pd.read_csv(folder+'shops.csv')
df_test=pd.read_csv(folder+'test.csv')

#pickles
folder_pickles='.../competitive-data-science-predict-future-sales/pickles/'

#Submission
df_sub=pd.read_csv(folder+'sample_submission.csv')
sub_folder='.../competitive-data-science-predict-future-sales/submissions/'

Let's define the following function to get a quick summary of a dataframe

In [1]:
def overview(df):
    print('SHAPE:\n',df.shape)
    print('COLUMN NAMES:\n', df.columns.tolist())
    print('UNIQUE VALUES PER COLUMN:\n', df.nunique())
    print('COLUMNS WITH MISSING DATA:\n',df.isnull().sum())
    print('SAMPLE:\n',df.sample(10))
    print('INFO:\n',df.info())

<a id = "quick_look_sales_train"></a>
### sales_train
[Back to Table of Contents](#contents)

In [1]:
df_sales.head()

In [1]:
overview(df_sales)

In [1]:
sorted(list(df_sales["item_cnt_day"].unique()))[:20]

In [1]:
sorted(list(df_sales["item_price"].unique()))[:20]

In [1]:
df_sales[df_sales["item_price"]<0]

In [1]:
mean = df_sales[(df_sales["shop_id"] == 32) & (df_sales["item_id"] == 2973) & (df_sales["date_block_num"] == 4) & (df_sales["item_price"] > 0)]["item_price"].mean()
df_sales.loc[df_sales.item_price < 0, 'item_price'] = mean

In [1]:
# =============================================================================
# CLEANING DF_SALES
# =============================================================================

# Якутск Орджоникидзе, 56
df_sales.loc[df_sales.shop_id == 0, 'shop_id'] = 57
df_test.loc[df_test.shop_id == 0, 'shop_id'] = 57
# Якутск ТЦ "Центральный"
df_sales.loc[df_sales.shop_id == 1, 'shop_id'] = 58
df_test.loc[df_test.shop_id == 1, 'shop_id'] = 58
# Жуковский ул. Чкалова 39м²
df_sales.loc[df_sales.shop_id == 10, 'shop_id'] = 11
df_test.loc[df_test.shop_id == 10, 'shop_id'] = 11

#create attribute revenue
df_sales['revenue']=df_sales['item_price']*df_sales['item_cnt_day']


Let's look for outliers

In [1]:
plt.figure(figsize = (10,4))
sns.boxplot(x = df_sales["item_cnt_day"])

In [1]:
df_sales[df_sales['item_cnt_day']>800]

In [1]:
plt.figure(figsize = (10,4))
sns.boxplot(x = df_sales["item_price"])

In [1]:
df_sales[df_sales['item_price']>100000]

In [1]:
df_sales = df_sales[(df_sales.item_price < 300000 )& (df_sales.item_cnt_day < 1000)]

<a id = "quick_look_items"></a>
### items
[Back to Table of Contents](#contents)

In [1]:
df_items.head()

In [1]:
overview(df_items)

In [1]:
pd.options.display.max_rows =100

gb_item_cat=df_items.groupby('item_category_id').agg({'item_id':'count'})
gb_item_cat.sort_values('item_id',ascending=False,inplace=True)
gb_item_cat

<a id = "quick_look_item_categories"></a>
### item_categories
[Back to Table of Contents](#contents)

In [1]:
df_cats.head()

In [1]:
overview(df_cats)

In [1]:
df_cats

They seem to be composed of a type and subtype separated by a "-"... blablabla...

In [1]:
# =============================================================================
# CLEANING DF_CATS
# =============================================================================

df_cats['cat_type'] = df_cats['item_category_name'].str.split('-').map(lambda x: x[0])
df_cats['cat_subtype'] = df_cats['item_category_name'].str.split('-').map(lambda x: x[1] if len(x)>1 else x[0])
df_cats['cat_type_id']=LabelEncoder().fit_transform(df_cats['cat_type'])
df_cats['cat_subtype_id']=LabelEncoder().fit_transform(df_cats['cat_subtype'])

df_cats=df_cats[['item_category_id','cat_type_id','cat_subtype_id']]

<a id = "quick_look_shops"></a>
### shops
[Back to Table of Contents](#contents)

In [1]:
df_shops.head()

In [1]:
overview(df_shops)

In [1]:
df_shops.sort_values('shop_name',ascending=False)

The name of the city seems to be the first word of the shop name. We see some typos like "!Якутск" instead of "Якутск" and some shop names that seem to be duplicates: "Жуковский ул. Чкалова 39м²" and  "Жуковский ул. Чкалова 39м?". We will keep this in mind when we prepare the data

In [1]:
# =============================================================================
# CLEANING DF_SHOP
# =============================================================================

df_shops['city'] = df_shops['shop_name'].str.split(' ').map(lambda x: x[0])
df_shops.loc[df_shops.city=='!Якутск','city']='Якутск'
df_shops['city_id']=LabelEncoder().fit_transform(df_shops['city'])


df_shops['cat_tienda'] = df_shops['shop_name'].str.split(' ').map(lambda x: x[1])
category = []
for cat in df_shops.cat_tienda.unique():
    if len(df_shops[df_shops.cat_tienda == cat]) >= 5:
        category.append(cat)
df_shops.cat_tienda = df_shops.cat_tienda.apply( lambda x: x if (x in category) else "other" )

df_shops['shop_cat']=LabelEncoder().fit_transform(df_shops['cat_tienda'])

df_shops=df_shops[['city','shop_id','city_id','shop_cat']]

<a id = "quick_look_test"></a>
### test
[Back to Table of Contents](#contents)

In [1]:
df_test.head()

In [1]:
overview(df_test)

<a id = "quick_look_sample_submission"></a>
### sample_submission
[Back to Table of Contents](#contents)

In [1]:
df_sub.head()

In [1]:
overview(df_sub)

In [1]:
df_sub.drop('item_cnt_month',axis=1,inplace=True)  

<a id = "time_variable"></a>
# Dealing with the time variable
[Back to Table of Contents](#contents)

We have the date_block_num variable that gives us a montly segmentation of the ts. We will reformat the date column anyway in order to be able to resample the date daily, weekly and yearly if necessary:

In [1]:
df_sales["date"] = pd.to_datetime(df_sales["date"], format = "%d.%m.%Y")
df_sales.set_index('date')

In [1]:
df_sales.info()

<a id = "EDA"></a>
# EDA
[Back to Table of Contents](#contents)

Megajoin

In [1]:
full_df=pd.merge(df_sales,df_shops,on=['shop_id'],how='left')
full_df=pd.merge(full_df,df_items,on=['item_id'],how='left')
full_df=pd.merge(full_df,df_cats,on=['item_category_id'],how='left')

<a id = "sales_per_month"></a>
### Total sales per month
[Back to Table of Contents](#contents)

In [1]:
ts_M = full_df[["date", "item_cnt_day"]].set_index("date").resample("M").sum()

plt.figure(figsize = (10, 6))
plt.plot(ts_M, color = "blue", label = "Monthly sales",marker='.')
plt.title("Monthly sales")
plt.legend();

We can already see  that the total sales decrease over time and that there is an obvious yearly seasonal pattern with high spikes around Christmas time

<a id = "sales_per_week"></a>
### Total sales per week
[Back to Table of Contents](#contents)

In [1]:
ts_W = full_df[["date", "item_cnt_day"]].set_index("date").resample("W").sum()

plt.figure(figsize = (10, 6))
plt.plot(ts_W, color = "blue", label = "Weekly sales")
plt.title("Weekly sales")
plt.legend();

<a id = "sales_per_month_shop"></a>
### Total sales per month and shop
[Back to Table of Contents](#contents)

In [1]:
fig,ax = plt.subplots(nrows=5, ncols=2,sharex=True, sharey=True,figsize=(15,15))
fig.suptitle('Sales per Shop', fontsize=30)
ts_shop = full_df.groupby(['shop_id','date_block_num'])['item_cnt_day'].sum().reset_index()
shop_count=0
for i in range(5):
    for j in range(2):
        for z in range(6):
            ax[i,j].plot(ts_shop[ts_shop['shop_id']==shop_count]['date_block_num'],ts_shop[ts_shop['shop_id']==shop_count]['item_cnt_day'],alpha=.5,label='shop '+str(shop_count))
            shop_count += 1
            ax[i,j].legend(loc='best')

for ax in ax.flat:
    ax.set(xlabel='months', ylabel='Sales')
    ax.label_outer()

<a id = "sales_per_month_category"></a>
### Total sales per month and category
[Back to Table of Contents](#contents)

In [1]:
items_x_cat=full_df.groupby('item_category_id').agg({'item_cnt_day':'sum'})
items_x_cat.reset_index(inplace=True)

In [1]:
items_x_cat=items_x_cat.sort_values('item_cnt_day',ascending=False)
items_x_cat_top=items_x_cat[0:15] 
items_x_cat_top['item_category_id']=items_x_cat_top['item_category_id'].astype(object)
items_x_cat_top=items_x_cat_top.reset_index(drop=True)

In [1]:
barplot=sns.barplot(y='item_cnt_day',x='item_category_id',palette='GnBu_d',data=items_x_cat_top,order=items_x_cat_top.sort_values('item_cnt_day',ascending=False).item_category_id)
barplot.set(xlabel="item category", ylabel = "Sales")

In [1]:
fig,ax = plt.subplots(nrows=7, ncols=2,sharex=True, sharey=True,figsize=(20,20))
fig.suptitle('Sales per product category', fontsize=30)
ts_cat = full_df.groupby(['item_category_id','date_block_num'])['item_cnt_day'].sum().reset_index()
cat_count=0
for i in range(7):
    for j in range(2):
        for z in range(6):
            ax[i,j].plot(ts_cat[ts_cat['item_category_id']==cat_count]['date_block_num'],ts_cat[ts_cat['item_category_id']==cat_count]['item_cnt_day'],alpha=.5,label='cat '+str(cat_count))
            cat_count += 1
            ax[i,j].legend(loc='best')

for ax in ax.flat:
    ax.set(xlabel='months', ylabel='Sales')
    ax.label_outer()

<a id = "ts_analysis"></a>
# Time Series Analysis
[Back to Table of Contents](#contents)

In this section we are gonna take a look at the main properties of time series. We are gonna check the stationarity of the series, decompose it into its essential components and check how strong the relationship between an observation is with the observations at prior time steps, called lags. 

Some of the algorithms that we will be using require the time series to be stationary, so we will have to transform it beforehand.

<a id = "autocorrelation"></a>
### Autocorrelation and partial autocorrelation (ACF and PACF)
[Back to Table of Contents](#contents)

This techniques consists in finding how correlated a time series is with itself in prior time steps. We fit the observations in time t with the observations in t-1, t-2 , etc.  In this particular problem, it will allow us to know how correlated the number of sales in a month is to the number of sales the previous month, and to two months before, and so on... There is two effects to take into account: The direct effect and the indirect effect.

Let's imagine we want to check the correlation between time t observations and time t-2 observations. The direct effect will be the correlation between t-2 to t and the correlation between t-1 and t. The indirect correlation would be between t-2 and t-1.

<b>Autocorrelation:</b> Takes into account direct and indirect effects.

<b>Partial autocorrelation:</b> Takes into account only the direct effects.

In [1]:
fig, (ax1, ax2) = plt.subplots(1, 2,figsize = (20,6), dpi = 80)
plot_acf(ts_M, ax = ax1, lags = 20)
plot_pacf(ts_M, ax = ax2, lags = 20);

The blue area represents de confidence interval, set to 95% by default. This suggests that lags with values outside of this area are likely correlated. We can se how there is a positive correlation with the first 6 lags and that <b>it is significant for the lags 2 and 3, which means that the sales from the previous two months have a significant correlation with the sales of the present month.</b>

<a id = "decomposing"></a>
### Decomposing the TS
[Back to Table of Contents](#contents)

Here we are gonna decompose the ts into its essential compenents, the general trend, seasonal trend and residuals

In [1]:
res=seasonal_decompose(ts_M.values,freq=12,model='additive')
fig=res.plot()

The trend show us that sales are going down. The seasonal graph removes the trend from the ts and shows us high seasonal spikes around Chritsmas time. The resuals show us what is left of the ts when you remove the trend and the seasonality effects, so hopefully the residuals are small, since we don't want other effects, aside from trend and seasonality to explain the ts

<a id = "stationarity"></a>
### Stationarity Test | Dickey-Fuller
[Back to Table of Contents](#contents)

The conditions we need to look at to determine if a ts is stationary are:

 <b>-The mean is constant</b>
 
 <b>-The standard deviation is constant</b>
 
 <b>-There is no seasonality</b>

Some forecasting machine learning methods and statistical modeling methods require the ts to be stationary in order to be able to use them, that is without the effects of a trend, seasonality, volatility and other time-dependent structures. We have already seen that the ts we are working with here is not stationary. It has a downward trend and high seasonality. The Dickey-Fuller test gives us a number to measure how far off our ts is from being stationary.

The null hypothesis of the test is that the time series is not stationary and has time-dependent structures of some kind. The alternate hypothesis (rejecting the null hypothesis) is that the time series is stationary. It use the p-value to interpret the result. If the p-value > 0.05, we can reject the h0 and the data is non-stationary. If p-value <= 0.05 we can reject the h0 and conclude that the ts is stationary. 

In [1]:
def test_stationarity(timeseries):
    print('Results of Dickey-Fuller Test:')
    dftest = adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['ADF Statistic:','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print (dfoutput)

In [1]:
test_stationarity(ts_M)

No surprise here, p-value > 0.05. <b>Ts is non-stationary</b>

<a id = "remove_stationarity"></a>
### Remove Non-Stationarity effects
[Back to Table of Contents](#contents)

Let's take a look at the different things we need to do to make a time series stationary

In [1]:
def normalize(ts):
    avg,dev=ts.mean(), ts.std()
    ts=(ts-avg)/dev
    return ts

def remove_seasonality(ts):
    ts= ts-ts.shift(12)
    ts=ts.dropna()
    return ts

def remove_trend(ts):
    ts= ts.diff(1).dropna()
    return ts

In [1]:
# Normalize TS
ts_norm=normalize(ts_M)
ts_norm.plot()
plt.xlabel('meses')
plt.ylabel('Total de ventas')

In [1]:
# Remove seasonality
ts_est=remove_seasonality(ts_M)
ts_est.plot()
test_stationarity(ts_est)

In [1]:
#Remove trend
ts_trend=remove_trend(ts_M)
ts_trend.plot()
plt.xlabel('meses')
plt.ylabel('Total de ventas')
test_stationarity(ts_trend)

<a id = "housekeeping"></a>
# Breaf housekeeping 
[Back to Table of Contents](#contents)

<a id = "useful"></a>
### Useful functions
[Back to Table of Contents](#contents)

In [1]:
def create_pred_list(X,Y):
    lista_shops=X['shop_id'].tolist()
    preds=[]
    for shop in range(0,len(Y)):
        preds.append(np.array([lista_shops[shop],Y[shop]]))
    return preds

def submission_df(df_forc_items):
    df_final=pd.merge(df_sub,df_forc_items,how='left',on=['ID'])
    df_final.fillna(0,inplace=True)
    return df_final

def evaluation(df_pred):
    df_eval=pd.merge(df_valid,df_pred, on=['ID'],how='inner')
    error = sqrt(mean_squared_error(df_eval['obs'].values, df_eval['pred'].values))
    return error

In order to dissagreate the shop forecasts using the <b>TOP-DOWN approach</b> and give every item the proper proportion, we are gonna build the following function to calculate the weight corresponding to each item in each store. <b>We calculate the weights based on the sales of the last 3 months:</b>

In [1]:
def middleout_forecasting(predictions,months_weights): 
    month=33-months_weights
    df_forecast=pd.DataFrame(predictions,columns=['shop_id','forecast_shop'])
    
    #we calculate the number of sales per item in each store of the las "month" months
    sales_gb_item=df_sales[df_sales['date_block_num']>month].groupby(['shop_id','item_id'])["item_cnt_day"].sum()
    sales_gb_item=pd.DataFrame(sales_gb_item)
    sales_gb_item.reset_index(inplace=True)
    sales_gb_item.rename(columns={'item_cnt_day':'item_sales'},inplace=True)
    
    #we calculate the number of sales store of the las "month" months
    sales_gb_shop=df_sales[df_sales['date_block_num']>month].groupby(['shop_id'])["item_cnt_day"].sum()
    sales_gb_shop=pd.DataFrame(sales_gb_shop)
    sales_gb_shop.reset_index(inplace=True)
    sales_gb_shop.rename(columns={'item_cnt_day':'shop_sales'},inplace=True)
    
    #we calculate the proportion of the sales for each item in every shop in the last "month" months
    sales_gb_full=pd.merge(sales_gb_item,sales_gb_shop,how='left',on=['shop_id'])
    sales_gb_full=pd.merge(sales_gb_full,df_test,how='left',on=['shop_id','item_id'])
    sales_gb_full['weights']=sales_gb_full['item_sales']/sales_gb_full['shop_sales']
    sales_gb_full.drop(['item_sales','shop_sales'],axis=1,inplace=True)
    
    #we calculate the forecast for each item in every store
    df_calc=pd.merge(sales_gb_full,df_forecast,how='left',on=['shop_id'])
    df_calc['item_cnt_month']=(df_calc['weights']*df_calc['forecast_shop']).clip(0,20) #clip the result to submit
    df_calc.drop(['shop_id','item_id','weights','forecast_shop'],axis=1,inplace=True)
    
    return df_calc

<a id = "validationdf"></a>
### Validation df
[Back to Table of Contents](#contents)

Before we begin modelling... here we put together a df to validate the results of the models. We will use the last month (october, date_block_num=33) for validation. This df has the format of the submission file : ['ID','obs']

In [1]:
#df validation, month 33
df_valid=full_df[full_df['date_block_num']==33].groupby(['shop_id','item_id'])["item_cnt_day"].sum().clip(0,20).to_frame()
df_valid.reset_index(inplace=True)

df_valid=pd.merge(df_valid,df_test, on=['shop_id','item_id'],how='left').sort_values('ID')
df_valid.drop(['item_id','shop_id'],axis=1,inplace=True)
df_valid.rename(columns={'item_cnt_day':'obs'},inplace=True)
df_valid=df_valid[['ID','obs']]

<a id = "approaches"></a>
# Possible approaches
[Back to Table of Contents](#contents)

Up to this point, we have extracted some interesting new features from the raw data, we merged it all together and we know the essential characteristics of the time series we are dealing with. 

In this section we are gonna explore the underlying intuition behind some comun statistical methods for time series forecasting.

<a id = "hierarchical"></a>
### Hierarchical Time Series
[Back to Table of Contents](#contents)

Our ts has 3 levels of hierarchy: total sales, sales per shop and sales per shop and item. In this exercise we are asked to make predictions for the lower hierarchical level, sales per shop and item. This poses a problem since we are working on a personal laptop with not enough computing power to model at the item level. There are to, we are gonna have to work around this issue. We found this [paper published by Rob Hyndman](https://robjhyndman.com/publications/hierarchical-tourism/) with an interesting approach. Without going into too much detail, the idea of the technique is the following:

<b>-Bottom-up approach:</b> Model and predict for the lowest level of the hierarchy and then sum these predictions to produce forecast for the upper levels.

<b>-Top-down approach:</b> Model and predict for the top level of the hierarchy and then disaggregate these down the hierarchy.

<b>-Middle-out approach:</b> It combines bottom-up and top-down. First generates forecasts for a middle level and the apply top-down or bottom-up to generate the forecasts for the lower and upper levels.

<b>We are going to use the Middle-out approach to model and predict at the shop level and then dissagregate to generate the predictions at the shop-item level.</b>

<a id = "approach1"></a>
# Approach 1: ARIMA
[Back to Table of Contents](#contents)

ARIMA stands for <b>AutoRegressive Integrated Moving Average</b> . There is 3 components in this model and each one corresponds to a parameter in the ARIMA model implementation in python. ARIMA(p,d,q):

<b>-AR:</b> Autoregression. Takes into account the strength of the relationship between an observation and its previous observation at different lags. It corresponds to the <b>p</b> parameter or the lag order, it is the number of lags included in the model.

<b>-I:</b> Integration. Substracts to the observation the observation at previous timestamps. Useful to make the series stationary.It corresponds to the <b>d</b> parameter or the degree of differencing, it is the number of times we want to substract from the observations.

<b>-MA:</b> Moving Average. Takes into account the strenght of the relationship between an observation and its average in previous timestamps. It corresponds to the <b>q</b> parameter or the window size, it is the number lags we want to take into account to calculate the average.

We can try to make some of the parameters 0 to make AR models (d=q=0) or MA models (p=d=0) or ARMA models (d=0)...

<a id = "arima_quicklook"></a>
### ARIMA - Quick look
[Back to Table of Contents](#contents)

Let's first take a look at the results of applying ARIMA to the global monthly sales time series. 

Since we have seen previously that the time series is not stationary and requires differencing to remove the seasonality, 1 would be a good value for the parameter q. And since the autocorrelation was significat for the first 2 or 3 lags we can set the p to 3.

In [1]:
ts_arima=remove_seasonality(ts_M)

In [1]:
model = ARIMA(ts_arima, order=(3,1,0))
model_fit = model.fit(disp=0)
print(model_fit.summary())

# plot residual errors
residuals = pd.DataFrame(model_fit.resid)
residuals.plot()
residuals.plot(kind='kde')
print(residuals.describe())

<a id = "arima_modelling"></a>
### ARIMA - Modelling
[Back to Table of Contents](#contents)

In [1]:
# NIVEL DE TIENDA
ts_shop=df_sales.groupby(["date_block_num",'shop_id'])["item_cnt_day"].sum()
ts_shop=ts_shop.unstack(level=1)
ts_shop=ts_shop.fillna(0)

In [1]:
closed_shops=(ts_shop[33:]==0).all()
closed_shops=np.array(closed_shops.index[closed_shops==True])

In [1]:
def train_evaluation_arima(ts,i):
    predictions = list()
    
    X=ts[i].values
    train, test = X[0:33], X[33:]
    history = [x for x in train]
    print('-----------------------shop %f--------------------------' % i)
    if (i in closed_shops or sum(history)==0):
        for t in range(len(test)):
            	predictions.append(0)
    else:
        for t in range(len(test)):
            model = ARIMA(history, order=(2,1,0))
            model_fit = model.fit(disp=0)
            output = model_fit.predict(1, len(history)+1, typ='levels')
            yhat = output[-1]
            predictions.append(yhat)
            obs = test[t]
            history.append(obs)
            
            print('predicted=%f, expected=%f' % (yhat, obs))
            error = sqrt(mean_squared_error(test, predictions))
            print('Test MSE: %.3f' % error)
            
            plt.plot(history,label='obs')
            plt.plot(output, label='pred',ls='--')
            plt.xticks(range(0,len(history)+1))
            plt.legend()
            plt.title('shop '+str(i))
            plt.show()
        
    predictions_series=pd.Series(predictions)
    return np.append(np.array(i),predictions_series.values.transpose())

In [1]:
lista_pred_arima=[]
for i in ts_shop.columns.tolist():
    lista_pred_arima.append(train_evaluation_arima(ts_shop,i))

In [1]:
df_pred_arima=middleout_forecasting(lista_pred_arima,3)
df_pred_arima=submission_df(df_pred_arima)
df_pred_arima.rename(columns={'item_cnt_month':'pred'},inplace=True)

Error evaluation:

In [1]:
error_arima = evaluation(df_pred_arima)
print('Test MSE: %.3f' % error_arima)

<a id = "arima_prediction"></a>
### ARIMA - Forecasting
[Back to Table of Contents](#contents)

In [1]:
def forecast_arima_shops(ts,i):
    predictions = list()
    train=ts[i].values
    
    if (i in closed_shops or sum(train)==0):
        predictions.append(0)
    else:
        model = ARIMA(train, order=(2,1,0))
        model_fit = model.fit(disp=0)
        output = model_fit.predict(1, len(train)+1, typ='levels')
        yhat = output[-1]
        predictions.append(yhat)

        plt.plot(train,label='obs')
        plt.plot(output, label='pred',ls='--')
        plt.xticks(range(0,len(train)+1))
        plt.legend()
        plt.title('shop '+str(i))
        plt.show()
        
    predictions_series=pd.Series(predictions)
    return np.append(np.array(i),predictions_series.values.transpose())

In [1]:
lista_pred_arima_test=[]
for i in ts_shop.columns.tolist():
    lista_pred_arima_test.append(forecast_arima_shops(ts_shop,i))

In [1]:
df_forecast_items=middleout_forecasting(lista_pred_arima_test,3)

In [1]:
df_final=submission_df(df_forecast_items)
df_final.to_csv(sub_folder+'submission_arima.csv', index=False)

<a id = "approach2"></a>
# Approach 2: Prophet
[Back to Table of Contents](#contents)

We will explore this new forecasting procedure developed by Facebook. It is relatively simple to use and could be an interesting approach.

It seems to work reasonably well on messy data, it is robust to outliers and missing data and it is fast.

<a id = "prophet_prepro"></a>
### Prophet - Data prepocessing
[Back to Table of Contents](#contents)

In [1]:
ts_prophet=full_df.groupby(["date_block_num"])["item_cnt_day"].sum()

ts_prophet.index=pd.date_range(start = '2013-01-01',end='2015-10-01', freq = 'MS')
ts_prophet=ts_prophet.reset_index()
ts_prophet.columns=['ds','y']

<a id = "prophet_quicklook"></a>
### Prophet - Quick look
[Back to Table of Contents](#contents)

In [1]:
model = Prophet(yearly_seasonality=True)
model.fit(ts_prophet)
# predict for five months in the furure and MS - month start is the frequency
future = model.make_future_dataframe(periods = 4, freq = 'MS',include_history=True)  
# now lets make the forecasts
forecast = model.predict(future)
#Let's plot the forecast
model.plot(forecast)
#decompose the forecast
model.plot_components(forecast)

<a id = "prophet_model"></a>
### Prophet - Modelling
[Back to Table of Contents](#contents)

In [1]:
ts_prophet_shop=full_df.groupby(["date",'shop_id'])["item_cnt_day"].sum()
ts_prophet_shop=ts_prophet_shop.unstack(level=1)
ts_prophet_shop=ts_prophet_shop.fillna(0)
ts_prophet_shop = ts_prophet_shop.resample("M").sum()

In [1]:
def train_evaluation_Prophet(ts,i):
    predictions = list()
    ts=pd.DataFrame(ts[i]).reset_index()
    ts.columns=['ds','y']
    
    train = ts.loc[ts['ds'] != datetime(2015, 10, 31)]
    test = ts.loc[ts['ds'] == datetime(2015, 10, 31)]
    print('-----------------------shop %f--------------------------' % i)
    
    if (i in closed_shops or sum(train['y'].values)==0):
        predictions.append(0)
    else:
        model = Prophet(yearly_seasonality=True) 
        model.fit(train)
        future = model.make_future_dataframe(periods = 1, freq = 'MS',include_history=True)  
        forecast = model.predict(future) 
        output = forecast['yhat']
        yhat = output.values[-1]
        predictions.append(yhat)

        print('predicted=%f, expected=%f' % (yhat, test['y'].values))
        error = sqrt(mean_squared_error(test['y'], predictions))
        print('Test MSE: %.3f' % error)

#         model.plot(forecast)
        plt.plot(ts['y'],label='obs')
        plt.plot(output, label='pred',ls='--')
        plt.xticks(range(0,len(train)+1))
        plt.legend()
        plt.title('shop '+str(i))
        plt.show()
        
    predictions_series=pd.Series(predictions)
    return np.append(np.array(i),predictions_series.values.transpose())

In [1]:
lista_pred_prophet=[]
for i in ts_prophet_shop.columns.tolist():
    lista_pred_prophet.append(train_evaluation_Prophet(ts_prophet_shop,i))

In [1]:
df_pred_prophet=middleout_forecasting(lista_pred_prophet,3)
df_pred_prophet=submission_df(df_pred_prophet)
df_pred_prophet.rename(columns={'item_cnt_month':'pred'},inplace=True)

Error evaluation:

In [1]:
error_prophet = evaluation(df_pred_prophet)
print('Test MSE: %.3f' % error_prophet)

<a id = "prophet_prediction"></a>
### Prophet - Forecasting
[Back to Table of Contents](#contents)

In [1]:
def forecast_Prophet(ts,i):
    predictions = list()
    train=pd.DataFrame(ts[i]).reset_index()
    train.columns=['ds','y']
    
    if (i in closed_shops or sum(train['y'])==0):
        predictions.append(0)
    else:
        model = Prophet(yearly_seasonality=True) 
        model.fit(train)
        future = model.make_future_dataframe(periods = 1, freq = 'MS',include_history=True)  
        output = model.predict(future)['yhat']
        yhat = output.values[-1]
        predictions.append(yhat)

        plt.plot(train['y'],label='obs')
        plt.plot(output, label='pred',ls='--')
        plt.xticks(range(0,len(train)+1))
        plt.legend()
        plt.title('shop '+str(i))
        plt.show()
        
    predictions_series=pd.Series(predictions)
    return np.append(np.array(i),predictions_series.values.transpose())

In [1]:
lista_pred_test_prophet=[]
for i in ts_prophet_shop.columns.tolist():
    lista_pred_test_prophet.append(forecast_Prophet(ts_prophet_shop,i))

In [1]:
df_forecast_prophet=middleout_forecasting(lista_pred_test_prophet,3)

In [1]:
df_forecast_prophet=submission_df(df_forecast_prophet)
df_forecast_prophet.to_csv(sub_folder+'submission_prophet.csv', index=False)

<a id = "approach3"></a>
# Approach 3: XGBOOST
[Back to Table of Contents](#contents)

<a id = "xgboost"></a>
### Convert to supervised learning
[Back to Table of Contents](#contents)

Forecasting problems can be converted into supervised learning problems.This will allow us to use machine learning methods to make predictions.

Before we can do such thing, the time series must be transformed and we need to create features that will give a machine learning model the information it needs about the dependency of the target variable with the previous observations. Therefore, the features that we are going to create are "lag features", that contain information about the variation of a certain metric over different time steps.

<a id = "xgboost_prepro"></a>
### XGBOOST - Data prepocessing
[Back to Table of Contents](#contents)

We are going to build a matrix with all the possible permutations of date and shop. This makes more sense if we had the item level because we would be able to build a matrix with all the shop-item possible combinations. Since that is how we should ideally face this problem, I will keep this structure for academic purposes

In [1]:
min_date = full_df["date"].min()
max_date_sales = full_df["date"].max()
max_date_test = datetime(2015, 11, 30)

date_range = pd.date_range(min_date, max_date_test, freq = "M")

shops_cartesian=sorted(df_test['shop_id'].unique().tolist())

cartesian_product=pd.MultiIndex.from_product([date_range,shops_cartesian],names=['date','shop_id'])

In [1]:
#We prepare the follwing df to fill the matrix
df_dates_max = full_df.groupby(['date_block_num']).agg({'date':'max'})
df_dates_max.reset_index(inplace=True)

gb_sales = full_df.groupby(['date_block_num',"shop_id"]).agg({'item_cnt_day':np.sum,'revenue':np.sum})
gb_sales.reset_index(inplace=True)
gb_sales = gb_sales.rename(columns={'item_cnt_day': 'item_cnt_month'})

gb_sales=pd.merge(gb_sales, df_dates_max,on='date_block_num',how='left')

In [1]:
matrix_test=df_test.copy()
matrix_test["date_block_num"] = 34
matrix_test["date_block_num"] = matrix_test["date_block_num"].astype(np.int8)
matrix_test["shop_id"] = matrix_test.shop_id.astype(np.int8)
matrix_test.drop(['ID','item_id'],axis=1,inplace=True)

In [1]:
# We fill the matrix
matrix = pd.DataFrame(index = cartesian_product).reset_index()

matrix=pd.merge(matrix,gb_sales,on=['date','shop_id'],how='left')
matrix=pd.merge(matrix,matrix_test,on=['date_block_num','shop_id'],how='left')
matrix.loc[matrix['date'] == datetime(2015, 11, 30), 'date_block_num'] = 34
matrix.fillna(0, inplace = True)

In [1]:
# Let's get the shop features we created earlier
matrix=pd.merge(matrix,df_shops,on=['shop_id'],how='left')

Write and read pickle

In [1]:
# matrix.to_pickle(folder_pickles+"matrix.pkl")
# matrix = pd.read_pickle(folder_pickles+"matrix.pkl")

Now we are going to create the lag features using the following function:

In [1]:
def lag_feature( df,lags, cols):
    for col in cols:
        print(col)
        tmp = df[["date_block_num", "shop_id",col ]]
        for i in lags:
            tmp_aux = tmp.copy()
            tmp_aux.columns = ["date_block_num", "shop_id", col + "_lag_"+str(i)]
            tmp_aux.date_block_num += i
            df = pd.merge(df, tmp_aux, on=['date_block_num','shop_id'], how='left')
    return df

In [1]:
# LAG FEATURES

#feature 1
matrix_in=matrix
matrix_feat1 = lag_feature(matrix_in, [1,2,3], ["item_cnt_month"])

#feature 2
df_group=matrix.groupby(['date_block_num','shop_id','city_id']).agg({'item_cnt_month':['sum','mean']})
df_group.columns = ["date_shop_city_sum",'date_shop_city_mean']
df_group.reset_index(inplace=True)
matrix_in=pd.merge(matrix,df_group,on=['date_block_num','shop_id','city_id'],how='left')

matrix_feat2 = lag_feature(matrix_in, [1,2,3], ["date_shop_city_sum",'date_shop_city_mean'])


# feature 3
df_group=matrix.groupby(['date_block_num','shop_cat']).agg({'item_cnt_month':['sum','mean']})
df_group.columns = ["date_shop_cat_sum",'date_shop_cat_mean']
df_group.reset_index(inplace=True)
matrix_in=pd.merge(matrix,df_group,on=['date_block_num','shop_cat'],how='left')

matrix_feat3 = lag_feature(matrix_in, [1,2,3], ["date_shop_cat_sum",'date_shop_cat_mean'])

Let's put together all the features we have:

In [1]:
def merge_features( df,cols):
    cols_join=['date_block_num','shop_id']
    cols_feats=df.iloc[:,-cols:].columns.tolist()
    df_aux=df[cols_join+cols_feats]
    df_feats=pd.merge(matrix_feats, df_aux, on=cols_join, how='left')
    return df_feats

matrix_feats=matrix.copy()
matrix_feats=merge_features(matrix_feat1, 3)
matrix_feats=merge_features(matrix_feat2, 8)
matrix_feats=merge_features(matrix_feat3, 8)

Let's add more time features that might be interesting:

In [1]:
matrix_feats["year"] = matrix_feats["date"].dt.year
matrix_feats["month"] = matrix_feats["date"].dt.month
matrix_feats["days_in_month"] = matrix_feats["date"].dt.days_in_month
matrix_feats["quarter_start"] = matrix_feats["date"].dt.is_quarter_start
matrix_feats["quarter_end"] = matrix_feats["date"].dt.is_quarter_end

We include two variables, one indicating if there is a holiday in the current month and another one in the next month

In [1]:
holidays_next_month = {
    12:8,
    1:1,
    2:1,
    3:0,
    4:2,
    5:1,
    6:0,
    7:0,
    8:0,
    9:0,
    10:1,
    11:0
}

holidays_this_month = {
    1:8,
    2:1,
    3:1,
    4:0,
    5:2,
    6:1,
    7:0,
    8:0,
    9:0,
    10:0,
    11:1,
    12:0
}

matrix_feats["holidays_next_month"] = matrix_feats["month"].map(holidays_next_month)
matrix_feats["holidays_this_month"] = matrix_feats["month"].map(holidays_this_month)

We borrow this great attributes from the kaggle comunity with the population and income of the cities.

In [1]:
city_population = {
'Якутск':307911, 
'Адыгея':141970,
'Балашиха':450771, 
'Волжский':326055, 
'Вологда':313012, 
'Воронеж':1047549,
'Выездная':1228680, 
'Жуковский':107560, 
'Интернет-магазин':1228680, 
'Казань':1257391, 
'Калуга':341892,
'Коломна':140129,
'Красноярск':1083865, 
'Курск':452976, 
'Москва':12678079,
'Мытищи':205397, 
'Н.Новгород':1252236,
'Новосибирск':1602915 , 
'Омск':1178391, 
'РостовНаДону':1125299, 
'СПб':5398064, 
'Самара':1156659,
'СергиевПосад':104579, 
'Сургут':373940, 
'Томск':572740, 
'Тюмень':744554, 
'Уфа':1115560, 
'Химки':244668,
'Цифровой':1228680, 
'Чехов':70548, 
'Ярославль':608353
}

city_income = {
'Якутск':70969, 
'Адыгея':28842,
'Балашиха':54122, 
'Волжский':31666, 
'Вологда':38201, 
'Воронеж':32504,
'Выездная':46158, 
'Жуковский':54122, 
'Интернет-магазин':46158, 
'Казань':36139, 
'Калуга':39776,
'Коломна':54122,
'Красноярск':48831, 
'Курск':31391, 
'Москва':91368,
'Мытищи':54122, 
'Н.Новгород':31210,
'Новосибирск':37014 , 
'Омск':34294, 
'РостовНаДону':32067, 
'СПб':61536, 
'Самара':35218,
'СергиевПосад':54122, 
'Сургут':73780, 
'Томск':43235, 
'Тюмень':72227, 
'Уфа':35257, 
'Химки':54122,
'Цифровой':46158, 
'Чехов':54122, 
'Ярославль':34675
}

matrix_feats["city_population"] = matrix_feats["city"].map(city_population)
matrix_feats["city_income"] = matrix_feats["city"].map(city_income)

In [1]:
#Pickle the matrix with all the features ready
# matrix_feats.to_pickle(folder_pickles+"matrix_feats.pkl")
# matrix_feats = pd.read_pickle(folder_pickles+"matrix_feats.pkl")

<a id = "xgboost_model"></a>
### XGBOOST - Modelling
[Back to Table of Contents](#contents)

Before we train let's get rid of the variables that are not gonna be informed in the test set (month=34). The variables that we need for the training have to be informed in the test set. So they have to be the ones that don't vary over time (we assume city population and icome as constant) and the lag features we created, because they have the information about the changing over time

In [1]:
cols_to_drop = [  
'revenue', 
"city",
"date_shop_city_sum",
"date_shop_city_mean", 
"date_shop_cat_sum",
"date_shop_cat_mean",   
]

In [1]:
# In[1]: XGBOOST REGRESSOR
matrix_feats = matrix_feats[matrix_feats["date_block_num"] > 3]
matrix_feats.drop(cols_to_drop, inplace = True, axis = 1)

In [1]:
data = matrix_feats.copy()
data.drop('date',axis=1,inplace=True)

In [1]:
X_train = data[data.date_block_num < 33].drop(['item_cnt_month'], axis=1)
Y_train = data[data.date_block_num < 33]['item_cnt_month']
X_valid = data[data.date_block_num == 33].drop(['item_cnt_month'], axis=1)
Y_valid = data[data.date_block_num == 33]['item_cnt_month']
X_test  = data[data.date_block_num == 34].drop(['item_cnt_month'], axis=1)

We use GridSearchCV to find the best parameters and do cross validation 

In [1]:
xgb_ft=xgb.XGBRegressor()

parameters = {'nthread':[4],
              'objective':['reg:linear'],
              'learning_rate': [0.01,0.05,0.07], 
              'max_depth': [2,5,10],
              'min_child_weight': [0.5,1.0,1.5],
              'silent': [1],
              'subsample': [0.2,0.7,0.9],
              'colsample_bytree': [0.7],
              'n_estimators': [100,1500,3000]}

model = GridSearchCV(xgb_ft,
                        parameters,
                        cv = 2,
                        n_jobs = 5,
                        verbose=True)

model.fit(
    X_train, 
    Y_train, 
    eval_metric="rmse", 
    eval_set=[(X_train, Y_train), (X_valid, Y_valid)], 
    verbose=True, 
    early_stopping_rounds = 10)


In [1]:
estimators=pd.DataFrame({'importance':model.best_estimator_.feature_importances_.tolist(),
                         'features':X_train.columns.tolist()})

In [1]:
fig, ax = plt.subplots(1,1,figsize=(10,14))
ax = sns.barplot(x="importance", y="features", data=estimators).set_title('Feature importance')

<a id = "xgboost_prediction"></a>
### XGBOOST - Data forecasting
[Back to Table of Contents](#contents)

In [1]:
Y_pred = model.predict(X_valid)

In [1]:
lista_valid_xgb=create_pred_list(X_valid,Y_pred)
df_pred_xgb=middleout_forecasting(lista_valid_xgb,3)
df_pred_xgb=submission_df(df_pred_xgb)
df_pred_xgb_sub=df_pred_xgb.copy()
df_pred_xgb.rename(columns={'item_cnt_month':'pred'},inplace=True)

Error evaluation:

In [1]:
error_xgb = evaluation(df_pred_xgb)
print('Test MSE: %.3f' % error_xgb)

Submission file

In [1]:
#df_pred_xgb_sub.to_csv(sub_folder+'submission_xgb.csv', index=False)

<a id = "conclusions"></a>
# Results and conclusions
[Back to Table of Contents](#contents)

After exploring these approaches the submissions on kaggle gave the following results:
    
      ARIMA: 1.07697
    PROPHET: 1.12637
        XGB: 1.09260
    

With enough computing power we wouldn't need to use the TOP-DOWN method and we would be able to forecast at the lowest hierarchical level. If that were the case, it would be a very good idea to explore XGB with more lag features at the item level. Exploring SARIMAX with yearly seasonality would also be an interesting approach. Prophet is a very easy to use and quick way to do a forecast, it's a great option for a quick solution.