# Introduction

The goal of this competition is to predict 3 months of sales for 50 different items at 10 different stores using the last 5 years.
The evalutation is based on SMAPE:

SMAPE = $\frac{100\%}{n}\displaystyle\sum_{t=1}^{n} \frac{|Ft-At|}{(|At|+|Ft|)/2}$

My goal is to achieve this using a GAM model (Generalized additive model), so I will try pyGAM from:

https://pygam.readthedocs.io/en/latest/

# Loading libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
      
from scipy.stats import skew,norm,zscore

from statsmodels.tsa.seasonal import seasonal_decompose

from sklearn.model_selection import train_test_split, cross_val_score, TimeSeriesSplit, GridSearchCV, cross_validate
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, make_scorer, mean_squared_log_error, mean_absolute_error, mean_absolute_percentage_error
from sklearn.preprocessing import RobustScaler, StandardScaler, MinMaxScaler

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
warnings.filterwarnings('ignore')
sns.set_theme()

#### Installing pygam

In [None]:
!pip install pygam
from pygam import GAM, f, l, s, te

# Importing the dataset

In [None]:
orig_train = pd.read_csv("/kaggle/input/demand-forecasting-kernels-only/train.csv", parse_dates=['date'])

In [None]:
orig_test = pd.read_csv("/kaggle/input/demand-forecasting-kernels-only/test.csv", parse_dates=['date'])

In [None]:
orig_train.info()

In [None]:
orig_train.sample(10)

In [None]:
orig_test.info()

In [None]:
orig_test.sample(10)

# Creating main date features

In [None]:
def add_features (orig_df):
    
    df = orig_df.copy()
    
    df = df.set_index('date').sort_index()
        
    # Time features
    df['year'] = df.index.year.astype('int')
    df['quarter'] = df.index.quarter.astype('int')
    df['month'] = df.index.month.astype('int')
    df['day'] = df.index.day.astype('int')
    df['dayofweek'] = df.index.day_of_week.astype('int')
    df['weekofyear'] = df.index.week.astype('int')
    df['isweekend'] = df.dayofweek.apply(lambda x: 1 if x in (5,6) else 0)
    df['issunday'] = df.dayofweek.apply(lambda x: 1 if x==6 else 0)
    df['issaturday'] = df.dayofweek.apply(lambda x: 1 if x==5 else 0)
    df['daysinmonth'] = df.index.days_in_month.astype('int')
    
    return df

In [None]:
train = add_features(orig_train)
test = add_features(orig_test)

# Setting main dates

In [None]:
date = {}
date['date_start_train'] = '2013-01-01'
date['date_end_train'] = '2017-12-31'
date['date_start_test'] = '2018-01-01'
date['date_end_test'] = '2018-03-31'
date['date_start_fore'] = '2013-01-01'

diff_train = (pd.Timestamp(date['date_end_train']) - pd.Timestamp(date['date_start_fore'])).days
diff_test = (pd.Timestamp(date['date_end_test']) - pd.Timestamp(date['date_start_fore'])).days

# Stores

In [None]:
fig, axes = plt.subplots(10, 1, figsize=(15, 10*3))

for ax, i in zip(axes.flat, np.arange(0,10,1)):
    sns.lineplot(ax=axes[i], x=orig_train[orig_train.store==(i+1)].groupby(by='date')['sales'].mean().reset_index().date.values, y=orig_train[orig_train.store==(i+1)].groupby(by='date')['sales'].mean())
    ax.set_title(f'store_n_{i}') 
    
fig.tight_layout()

# Items

In [None]:
fig, axes = plt.subplots(50, 1, figsize=(15, 50*3))

for ax, i in zip(axes.flat, np.arange(0,50,1)):
    sns.lineplot(ax=axes[i], x=orig_train[orig_train.item==(i+1)].groupby(by='date')['sales'].mean().reset_index().date.values, y=orig_train[orig_train.item==(i+1)].groupby(by='date')['sales'].mean())
    ax.set_title(f'item_n_{i}') 
    
fig.tight_layout()

# Trend and seasonality

In [None]:
result = seasonal_decompose(orig_train.groupby(by='date')['sales'].mean(), model='multiplicative', period=365)
fig = result.plot()
fig.set_size_inches((14,12))

In [None]:
result = seasonal_decompose(orig_train.groupby(by='date')['sales'].mean(), model='multiplicative', period=90)
fig = result.plot()
fig.set_size_inches((14,12))

In [None]:
fig = plt.figure(figsize=(15,6))
train.loc['2017-10':].groupby(by='date')['sales'].mean().plot()

# The model

The data is quite clean and regular, I think that the patterns of trend and seasonality can be catched using: years, months and days of the week.
I will create a simple function to test the performances of the model through test periods of: 180, 120 and 90 days.

The function will iterate over each store and each item measuring SMAPE on the training and the test sample.

In order to give more weight to the recent years, I decided to apply an exponentially weighted cost function to the model.

I will use a GAM model:

$g(E(Y)) = \beta_0 + f_1(x_1) + f_2(x_2) +...+ f_n(x_n)$

where:
- g is the link function;

- $f_i$ are functions built using penalized B splines;

- the $x_i$ will be: year, month and day of the week.

At first I will use a linear GAM model, using an identity link and a normal distribution to have a look at the residual plot and understand the behavior of the model.


After running the linear model with default settings, the residual plot showed an evident not linearity, so I decided to apply the logarithm as a link function, whereas, as a distribution, I kept the default one.
As the patterns were really simple, I could reduce the number of splines, the degrees and the penalty of the model, so I also avoided the risk of overfitting.
Running the new model I detected some days where the residuals were quite high so I decided to apply another penalty on those days.
I measured the residuals on the sum of the day for simplicity instead of considering each store and item.

In [None]:
df = pd.concat([train, test.drop(columns='id')], axis=0)

#### Creating SMAPE evalutation

In [None]:
def smape(A, F):
    smape = 100/len(A) * np.sum(2 * np.abs(F - A) / (np.abs(A) + np.abs(F)))
    return smape

#### Creating a function to divide training and test sample

In [None]:
def split_func (orig_df, X, y, end_date, test_size):
    
    idx_train, idx_test = train_test_split(orig_df.index, test_size=test_size, shuffle=False)
    X_train, X_test = X.loc[idx_train, :], X.loc[idx_test, :]
    y_train, y_test = y.loc[idx_train], y.loc[idx_test]
    
    return X_train, y_train, X_test, y_test

#### Performance function

In [None]:
def gam_func (orig_df, end_df, n):
    
    df = orig_df.loc[:end_df,:]
    
    # Outliers
    fil = ['2014-06-15', '2014-06-21', '2015-03-20', '2015-04-08' ,'2016-04-02']
    
    df.loc[df.index.isin(fil),'outliers'] = 1
    df.loc[df.outliers.isnull(),'outliers'] = 0
    
    # Empty arrays
    if end_df <= date['date_end_train']:
        y_tr = np.empty((diff_train-n+1))
        y_te = np.empty((n))
        pred_train_y = np.empty((diff_train-n+1))
        pred_test_y = np.empty((n))
    else:
        y_tr = np.empty((diff_test-n+1))
        y_te = np.empty((n))
        pred_train_y = np.empty((diff_test-n+1))
        pred_test_y = np.empty((n))
    
    # Iterating over stores and items
    for i in df.store.unique():
        for j in df.item.unique():
            y = df.loc[(df.item==j) & (df.store==i),:].sales
            X = df.loc[(df.item==j) & (df.store==i),:].drop(columns=['store','item','sales'])

            X_train, y_train, X_test, y_test = split_func(y, X, y, end_df, n)
            
            # Exponentially weighted cost function
            y_weights = X_train.year.apply(lambda x: np.exp((-1/10)*(2018 - x)))
            
            # Penalty on outliers
            o_weights = X_train.outliers.apply(lambda x: 0 if x==1 else 1)
            
            # Total weights
            weights = (y_weights+o_weights)/2
            
            # Features:
            # year:0, quarter:1, month:2, day:3, dayofweek:4, weekofyear:5,
            #isweekend:6, issunday:7, issaturday:8, daysinmonth:9
            
            # The model
            model = GAM(s(0,n_splines=5,spline_order=1)+
                        s(2,n_splines=12,spline_order=1)+
                        s(4,n_splines=7,spline_order=1),
                        link='log', lam=0)
            
            model.fit(X_train.values, y_train.values, weights=weights)
                
            gam_pred_train_y = model.predict(X_train) 
            gam_pred_test_y = model.predict(X_test)
            
            # Store and item SMAPE
            if end_df <= date['date_end_train']:
                print(f'SMAPE_train st_{i},item_{j}: ', np.round(smape(y_train.clip(0.0), gam_pred_train_y.clip(0.0)), 4), f'SMAPE_test st_{i},item_{j}: ', np.round(smape(y_test.clip(0.0), gam_pred_test_y.clip(0.0)), 4))
            else:
                print(f'SMAPE_train st_{i},item_{j}: ', np.round(smape(y_train.clip(0.0), gam_pred_train_y.clip(0.0)), 4))
    
            y_tr = np.vstack([y_tr, y_train.values])
            y_te = np.vstack([y_te, y_test.values])
            pred_train_y = np.vstack([pred_train_y, gam_pred_train_y])
            pred_test_y = np.vstack([pred_test_y, gam_pred_test_y])

    iterables = [df.store.unique(), df.item.unique()]
    index = pd.MultiIndex.from_product(iterables, names=['store', 'item'])
    
    y_tr = pd.DataFrame(y_tr[1:,:], index=index, columns=pd.date_range(start=date['date_start_fore'], end=(pd.to_datetime(end_df)-pd.DateOffset(n)))).T
    y_te = pd.DataFrame(y_te[1:,:], index=index, columns=pd.date_range(start=(pd.to_datetime(end_df)-pd.DateOffset(n-1)), end=end_df)).T
    pred_train_y = pd.DataFrame(pred_train_y[1:,:], index=index, columns=pd.date_range(start=date['date_start_fore'], end=(pd.to_datetime(end_df)-pd.DateOffset(n)))).T
    pred_test_y = pd.DataFrame(pred_test_y[1:,:], index=index, columns=pd.date_range(start=(pd.to_datetime(end_df)-pd.DateOffset(n-1)), end=end_df)).T
    
    # Total SMAPE
    if end_df <= date['date_end_train']:
        print(f'SMAPE_train tot: ', np.round(smape(y_tr.clip(0.0), pred_train_y.clip(0.0)).mean(), 4), f'RMSLE_test tot: ', np.round(smape(y_te.clip(0.0), pred_test_y.clip(0.0)).mean(), 4))
    else:
        print(f'SMAPE_train tot: ', np.round(smape(y_tr.clip(0.0), pred_train_y.clip(0.0)).mean(), 4)) 
   
    y_tr = y_tr.stack(['store', 'item'])
    y_te = y_te.stack(['store', 'item'])
    y = pd.concat([y_tr, y_te])
    
    pred_train_y = pred_train_y.stack(['store', 'item'])
    pred_test_y = pred_test_y.stack(['store', 'item'])
    pred_y = pd.concat([pd.Series(pred_train_y).apply(lambda x: 0 if x<0 else x), pd.Series(pred_test_y).apply(lambda x: 0 if x<0 else x)])
    
    # Some plots
    if end_df <= date['date_end_train']:

        fig, axes = plt.subplots(2, 2, figsize=(15,10))
        y.loc['2017':,:].reset_index().groupby(by='level_0')[0].sum().plot(ax=axes[0,0], color="red")
        pred_y.loc['2017':,:].reset_index().groupby(by='level_0')[0].sum().plot(ax=axes[0,0], color="orange")
    
        y.loc['2017-12':,:].reset_index().groupby(by='level_0')[0].sum().plot(ax=axes[0,1], color="red")
        pred_y.loc['2017-12':,:].reset_index().groupby(by='level_0')[0].sum().plot(ax=axes[0,1], color="orange")
        
        (y.loc['2017-01':,:].reset_index().groupby(by='level_0')[0].sum() - pred_y.loc['2017-01':,:].reset_index().groupby(by='level_0')[0].sum()).plot(ax=axes[1,1])
        
        sns.residplot(ax=axes[1,0], x=pred_y.reset_index().groupby(by='level_0')[0].sum(), 
                      y=zscore(y.reset_index().groupby(by='level_0')[0].sum() - pred_y.reset_index().groupby(by='level_0')[0].sum()), 
                      lowess=True, scatter_kws={'alpha': 0.5}, line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})  
        
        fig.tight_layout()
        res = zscore(y.reset_index().groupby(by='level_0')[0].sum() - pred_y.reset_index().groupby(by='level_0')[0].sum())
        
        return res 
    else:
        return pred_test_y, pred_y, y
    

In [None]:
#X = gam_func(df, date['date_end_train'], 180)

In [None]:
#X = gam_func(df, date['date_end_train'], 120)

In [None]:
X = gam_func(df, date['date_end_train'], 90)

In [None]:
%%time
gam_pred_test_y, gam_pred_tot_y, orig_y = gam_func(df, date['date_end_test'], 90)

In [None]:
y = gam_pred_tot_y.reset_index().rename(columns={'level_0':'date', 0:'sales'}).set_index('date')

In [None]:
sub = test.merge(y, on=['date','store','item']).loc[:,['id','sales']]

In [None]:
sub['sales'] = sub['sales'].apply(lambda x: 0 if x<0 else x) 

In [None]:
sub = sub.sort_values(by='id').reset_index(drop=True)

In [None]:
sub = np.round(sub)

# Final result

In [None]:
sub

In [None]:
sub.to_csv('submission.csv', index=False)