# Table of Content 
1. [Introduction](#Intrduction)
2. [Data Loading And Utilities](#Utils)
3. [EDA](#EDA)
4. [Features Extraction](#FE)
5. [Prediction Methods](#PM)

    5.1 [Mean](#mean)
    
    5.2 [Random Forest Regressor](#RandomForest)
    
    5.3 [LSTM](#LSTM)

    5.4 [ARIMA](#ARIMA)
    
    5.5 [SARIMAX](#SARIMAX)











 # Introduction <a class="anchor" id="Introduction"></a>
 * This notebook aims to perform a EDA of the provided dataset find out the features in the data provided to make a decision about the next Kaggle store. 
* If you find it useful, please upvote to keep me motivated for more additions to it. 




# DATA access and other utilities<a class="anchor" id="Utils"></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
df_trn = pd.read_csv("/kaggle/input/tabular-playground-series-jan-2022/train.csv")
df_tst = pd.read_csv("/kaggle/input/tabular-playground-series-jan-2022/test.csv")
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Let's try to understand the dataset provided for this competetion. The test dataset here has ~26K entries and 6 features. 
* date: give the date on which a given product is sold. 
* country: in which country this product was sold. 
* store: which store sold a given product. 
* product: what exactly is the product. 
* num_sold: how many pieces of a given product are sold on a given day, in a given country by a given store. 

In [None]:
df_trn

## The test dataset is as follows:
* We have 6569 entries in the test dataset. 
* If you notice the num_sold column is missing, this is the number we need to predict. 

### Plan of action:
* I am going to use a couple of basic models in the begining to mark as a reference and then use the advanced more to see how much I manage to improve in terms of prediction of the total sales. 

In [None]:
df_tst

# EDA<a class="anchor" id="EDA"></a>



### Let's explore the training dataset, 
1. There are three unique product, ['Kaggle Mug', 'Kaggle Hat', 'Kaggle Sticker']
2. As said, there are two 

In [None]:
print (df_trn["product"].unique())
print (df_trn["country"].unique())

In [None]:
df_trn["country"].unique()

In [None]:
len(df_trn)

In [None]:
store_list=df_trn["store"].unique()
product_list=df_trn["product"].unique()
country_list=df_trn["country"].unique()


In [None]:
df_trn.groupby("country").num_sold.sum()


* This function group the dataframe based on the columns passes as an argument and return a dictionary. The dictionary contain the value of column as "key" and the dataframe as the dictionary value. 

In [None]:
def getGroupedDataFrames(df, columnName):
    grouped_df=list(df.groupby(columnName))
    df_dictionary={}
    for i in range(len(grouped_df)):
        df_dictionary[grouped_df[i][0]] = grouped_df[i][1]
    return df_dictionary

* Let's seprate the dataframe for each store and each country, so in total we now have 2x3=6 dataframes, saved in a single object. 

In [None]:
df_grp_store = getGroupedDataFrames(df_trn,"store")
df_country={}
for istore in store_list:
    df_country[istore] = getGroupedDataFrames(df_grp_store[istore],"country")



* Let's check the overall sale from each store in each country for each of the product. 
* this is not very helpful to see the evolution but do tell which store sold more product and in which category. 

In [None]:
for istore in store_list:
    for icountry in country_list:
        print (istore, icountry)
        print (df_country[istore][icountry].groupby("product").num_sold.sum())
        print ("------")

* In order to see better the evolution of sales of these products in each store in each country, we need to see the time evolution plots/time series distribution of these these sales. 
* Let's check them for each of them using seaborn. 

In [None]:
import seaborn as sns
sns.set(rc = {'figure.figsize':(20,8)})

In [None]:
#sns.lineplot(x="date", y="num_sold",hue="product",
#             data=df_country["KaggleMart"]["Sweden"],
#            ci=None)

In [None]:
#sns.lineplot(x="date", y="num_sold",hue="product",
#             data=df_country["KaggleMart"]["Finland"],ci=None)

In [None]:
#sns.lineplot(x="date", y="num_sold",hue="product",
#             data=df_country["KaggleMart"]["Norway"], ci=None)

In [None]:
#sns.lineplot(x="date", y="num_sold",hue="product",
#             data=df_country["KaggleRama"]["Sweden"],ci=None)

In [None]:
#sns.lineplot(x="date", y="num_sold",hue="product",
#             data=df_country["KaggleRama"]["Finland"],ci=None)

In [None]:
#sns.lineplot(x="date", y="num_sold",hue="product",
#             data=df_country["KaggleRama"]["Norway"],ci=None)

In [None]:
#sns.lineplot(x="date", y="num_sold",hue="product",
#             data=df_grp_store["KaggleRama"],ci=None)


#### As a next step let's check the distribution of these sales per year and their comparision per product, per country, per store. 
* This will give insight about the sesonality in the data. 
* Basically I plot the data in the seasonal form. As can be seen from previous timing plots that there are overall trend is similar in each year. 

In [None]:
df=df_country["KaggleRama"]["Sweden"]
df["year"] = pd.to_datetime(df.date).dt.year
df["day"] = pd.to_datetime(df.date).dt.day_of_year
df["day"]

#df["year"]


## Features in the seasonal plot: 
As can be seen in timing data plot, for 4 years, that there is some seasonality present in the data. We can clearly see those features from seasonal plot. Note that the distribution here is only for KaggleRama::Sweden
1. There are TWO major periods (and one minor/long range) when sales of the product have increased, or should say exploded. 
 - First is around new year, 
 - Second, somewhere in the April, that is the time close to Easter, celebrated in the EU. 
 - Third, A wide increase in sales from May-June, this is likely due the begining of the summer and summer vacation. 
2. In addition to the festival sales peak there is one more feature to be seen in sales of each product.
 - There features are visible by small speaks occuring after regular interval, from first look this seems to be weekly. The number of peaks are roughly 50-55, so they seems to correspond to weekend. And why not, people do enjoy shopping with family over the weekend. 
 
 
 When we want to predict the sales of these product, we need to keep in mind following information: 
 1. End of year / New year sales explosion 
 2. Easter blast 
 3. Summer leisure 
 4. Weekend fun with family and friends 
 5. Three products have completely different sales in each country, and rise in sales is also different. This implies we must treat them as independent when making prediction. And same is true for each country. 
  - One all the prediction for each product and country is made, we must sum them to predict the total number, instead of predicting everything in one i.e. total sales. 
  
 

In [None]:
sns.lineplot(data=df, 
             x='day', 
             y='num_sold', 
             hue='year', 
             style="product",
             legend='full',ci=None)


## Rolling Average 
Let's check the rolling average using dataframe.rolling function. I am trying rolling average with various window, say 5,7 and 8 days to see the feature in the time series data. Let's start with the Kaggle hat data.

* The 5 day and 8 day rolling average does not show any smoothening properties, 
* However 7 day rolling average does indicate that the distribution is now much more smooth and the weekly fluctuations are now gone. 
* Similar situation for remaining product and countries and stores. 

Let's see how can we use all these information. 

In [None]:
## get the kaggle Hat dataframe instead of all 3 product. 
df_kaggle_hat = df[(df["product"]=="Kaggle Hat")]
df_kaggle_hat

In [None]:
df_kaggle_hat["avg5"] = df_kaggle_hat.num_sold.rolling(5).mean()
df_kaggle_hat["avg7"] = df_kaggle_hat.num_sold.rolling(7).mean()
df_kaggle_hat["avg8"] = df_kaggle_hat.num_sold.rolling(8).mean()
df_kaggle_hat

In [None]:
sns.lineplot(data=df_kaggle_hat, 
             x='day', 
             y='avg7', 
             hue='year', 
             legend='full',ci=None)



In [None]:
sns.lineplot(data=df_kaggle_hat, 
             x='day', 
             y='avg5', 
             hue='year', 
             legend='full',ci=None)



In [None]:
sns.lineplot(data=df_kaggle_hat, 
             x='day', 
             y='avg8', 
             hue='year', 
             legend='full',ci=None)



I am going to try some of the trivial methods to forcast the sales for next year and compare them. 
1. The first method being tried is average method: This assumes that the future values of the sales will equal to the average of the collected time series data. Quite straightforward!! 

# Features Extraction<a class="anchor" id="FE"></a>



## Categorical data 
There are a few features which are given in categorical form, e.g. country, product, store. It will make the like easy if I use some encoding for these variables and make them quantitative variables. 

In [None]:
country_map={"Finland":0,
             "Norway":1,
             "Sweden":2}

product_map={"Kaggle Mug":0,
             "Kaggle Hat":1,
             "Kaggle Sticker":2}
store_map={"KaggleRama":0,
           "KaggleMart":1}

def CategorialToQuantitative(df):
    df.replace(country_map,inplace=True)
    df.replace(store_map,inplace=True)
    df.replace(product_map,inplace=True)
    return df


In [None]:
df_trn = CategorialToQuantitative(df_trn)
df_tst = CategorialToQuantitative(df_tst)

## Extract time information from provided date 

In [None]:
def gettimeFeatures(df):
    df["day_of_year"]=pd.to_datetime(df.date).dt.day_of_year
    df["day_of_month"] = pd.to_datetime(df.date).dt.day
    df["week"]=pd.to_datetime(df.date).dt.isocalendar().week
    df["quarter"]=pd.to_datetime(df.date).dt.quarter
    df["month"]=pd.to_datetime(df.date).dt.month
    df["year"]=pd.to_datetime(df.date).dt.year
    df["weekd"]=pd.to_datetime(df.date).dt.weekday
    df["weekend"]=(pd.to_datetime(df.date).dt.weekday>4).astype(int) ## weekday range from 0 to 6. 
    
    ## Easter 
    #2015: 5 April: 95
    #2016: 27 March: 88
    #2017: 16 April: 106
    #2018: 1 April: 91
    #2019: 21 April: 111 
    ## I still look for an easier way to do this, instead of hardcoding it 
    df.loc[df.month>-1,"easter"]=0 ## setting default 
    df.loc[ ( ( (df.year==2018) & ( abs(df.day_of_year-91)<10) ) | 
              ( (df.year==2017) & ( abs(df.day_of_year-106)<10)) | 
              ( (df.year==2016) & ( abs(df.day_of_year-88)<10) ) |
              ( (df.year==2018) & ( abs(df.day_of_year-95)<10) ) |
              ( (df.year==2019) & ( abs(df.day_of_year-111)<10) )
            ),"easter"]=1

    ## Year End 
    df.loc[df.month>0,"year_end"]=0 ## setting default
    df.loc[ ( (  (df.month==12) & (df.day_of_month>22) ) |
                 (  (df.month==1) &  (df.day_of_month <8) ) ),"year_end"]=1

    ## Summer 
    df.loc[df.day_of_year>-1,"summer"]=0 ## ## setting default
    df.loc[ ( (df.day_of_year>125) & (df.day_of_year<190) ),"summer"] =1



    ## we don't need date anymore 
    df.drop(["date"],axis=1,inplace=True)
    return df 

In [None]:
df_trn = gettimeFeatures(df_trn)
df_tst = gettimeFeatures(df_tst)
df_tst

## Metric for submission 

In [None]:
def SMAPE(y_true, y_pred):
    diff = np.abs(y_true - y_pred) / (y_true + np.abs(y_pred)) * 200
    return diff.mean()


# Prediction Algorithms<a class="anchor" id="PM"></a>
## Mean <a class="anchor" id="mean"></a>



In [None]:
df_grp_store = getGroupedDataFrames(df_trn,"store")
df_country={}
for istore in df_trn.store.unique():
    df_country[istore] = getGroupedDataFrames(df_grp_store[istore],"country")

df_mean = pd.DataFrame()
df_mean_tmp = pd.DataFrame()
for istore in range (0,2):
    for icountry in range (0,3):
        df_mean_tmp = pd.DataFrame((df_country[istore][icountry].groupby("product").mean()["num_sold"]))
        df_mean_tmp["country"]=icountry
        df_mean_tmp["store"]=istore
        #print (df_mean_tmp)
        df_mean = pd.concat([df_mean,df_mean_tmp])

df_mean
df_mean.reset_index().rename({'product':'product'}, axis = 'columns')
df_out = df_tst.merge(df_mean,on=['product','country','store'])
df_out.drop(["country","store","product", "day_of_year", "week", "quarter", "month", "year", "weekd", "weekend"],axis=1,inplace=True)
#df_out.to_csv("submission.csv",index=False)
#df_out.shape



### Result with mean: 

* When sales set to the mean of past data,  per store, per product and country the scrore omn test data is: 13.04
* I will skip the naive, snaive and other trivial methods and try to use simple methods to compare the results. 
Let's try to use the engeniered variables in a random forest Regressor and XGBoose to see the improvement,  see the response. 

## Random Forest <a class="anchor" id="RandomForest"></a>



In [None]:
df_tst.columns

In [None]:
from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor()
vars=['country', 'store', 'product', 'day_of_year', 'week',
       'quarter', 'month', 'year', 'weekd', 'weekend', 'day_of_month',
       'easter', 'year_end', 'summer']
X = df_trn[vars]
Y = df_trn['num_sold']
clf.fit(X,Y)
X_tst = df_tst[vars]
Y_tst=clf.predict(X_tst)


In [None]:
df_tst["num_sold"] = Y_tst
df_out = df_tst.drop(["country","store","product", "day_of_year", "week", "quarter", "month", "year", "weekd", "weekend",'day_of_month', 'easter', 'year_end', 'summer'],axis=1)

In [None]:
df_out.to_csv("submission.csv",index=False)
df_out.shape



### Result of Random Forest Regressor 
* The first iteration of result with RFR give a huge improvement w.r.t mean, score is 7.249 

## Splitting the data 
* Let's seprate train data into two parts, first 3 years as training and remaining one year as test, so that  I can get the score without submitting the prediction. 

In [None]:
df_train = df_trn[df_trn.year<2018]
df_test  = df_trn[df_trn.year==2018]

X_Train, Y_Train = df_train[vars], df_train["num_sold"]
X_Test, Y_True          = df_test[vars], df_test["num_sold"]

clf.fit(X_Train,Y_Train)
Y_Test=clf.predict(X_Test)



In [None]:
SMAPE(Y_True,Y_Test)

# LSTM to predict th## LSTM <a class="anchor" id="LSTM"></a>

e sales 

In [None]:
import keras
import matplotlib.pyplot as plt
from keras.layers import Dense,Dropout,LSTM
from keras.models import Sequential
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler 



In [None]:
X_Train = X_Train.astype(float)
X_Test = X_Test.astype(float)
Y_Train = Y_Train.astype(float)
Y_True = Y_True.astype(float)

#scaler  = StandardScaler()
#scaler = scaler.fit(X_Train)
#X_Train_scaled = scaler.transform(X_Train)
#X_Train_scaled.shape, type(X_Train_scaled)

#scaler1  = StandardScaler()
#scaler1 = scaler1.fit(X_Test)
#X_Test_scaled = scaler1.transform(X_Test)
#X_Test_scaled.shape



In [None]:
X_Train_scaled = np.array(X_Train)
X_Test_scaled = np.array(X_Test)



In [None]:
'''
X_Train_LSTM = []
Y_Train_LSTM = [] 
X_Test_LSTM = []
Y_True_LSTM = [] 


n_future = 1 
n_past = 365
for i in range (n_past, len(X_Train_scaled)-n_future+1):
    X_Train_LSTM.append(X_Train_scaled[i-n_past:i, 0:X_Train_scaled.shape[1] ])
    Y_Train_LSTM.append(X_Train_scaled[i+n_future-1:i+n_future,0])

for i in range (n_past, len(X_Test_scaled)-n_future+1):
    X_Test_LSTM.append(X_Test_scaled[i-n_past:i, 0:X_Test_scaled.shape[1] ])
    Y_True_LSTM.append(X_Test_scaled[i+n_future-1:i+n_future,0])


X_Train_LSTM, Y_Train_LSTM = np.array(X_Train_LSTM), np.array(Y_Train_LSTM)
X_Test_LSTM, Y_True_LSTM = np.array(X_Test_LSTM), np.array(Y_True_LSTM)
'''

In [None]:
X_Train_LSTM = X_Train_scaled.reshape(X_Train_scaled.shape[0],X_Train_scaled.shape[1],1)
Y_Train_LSTM = np.array(Y_Train)
X_Test_LSTM = X_Test_scaled.reshape(X_Test_scaled.shape[0],X_Test_scaled.shape[1],1)
#Y_True_LSTM = 
#X_Test_LSTM.shape
Y_Train_LSTM.shape, X_Train_LSTM.shape, Y_Train_LSTM

In [None]:
model = Sequential()
model.add(LSTM(units = 50 , return_sequences=True , input_shape = (X_Train_LSTM.shape[1], 1 )))

model.add(Dropout(0.2))
# Second LSTM layer
model.add(LSTM(units=50, return_sequences=True))
model.add(Dropout(0.2))
# Third LSTM layer
model.add(LSTM(units=50, return_sequences=True))
model.add(Dropout(0.2))
# Fourth LSTM layer
model.add(LSTM(units=50))
model.add(Dropout(0.2))
# The output layer
model.add(Dense(units=1))
model.compile(optimizer='adam',loss='mean_squared_error')


In [None]:
model.summary()

In [None]:
history = model.fit(X_Train_LSTM,Y_Train_LSTM,batch_size=32 , epochs=5)

In [None]:
forecast = model.predict(X_Test_LSTM)
forecast = pd.Series(forecast.reshape(forecast.shape[0]))

In [None]:
y_true = np.array(Y_True)
y_true.reshape(y_true.shape[0])
SMAPE(y_true,forecast)




### Results of LSTM 
* As can be seen the primitive LSTM method does not work well. The score is 53, However the Random Forest gives much better results. 

1. mean 
2. random forest regressor 
3. LSTM 

From initial investigations RFR seems to be the best with the present set of features, 

In the next version I plan to test, ARIMA, SARIMA, and SARIMAX and a hybrid more with Prophet. 

## ARIMA <a class="anchor" id="ARIMA"></a>
* ARIMA stands for Auto Regression Integrated Moving Average. It was fun reading this book to understand about the TSA and prediction, https://otexts.com/fpp3/arima.html 
* 

In [None]:
from pandas.plotting import autocorrelation_plot, lag_plot
df_sel = df_train[ ( (df_train.country==0) & (df_train.store==0) & (df_train["product"]==0) )]
series  = df_sel.num_sold
lag_plot(series, lag=1)


In [None]:
autocorrelation_plot(series)

In [None]:
'''
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt
plt.rcParams.update({'figure.figsize':(9,7), 'figure.dpi':120})
fig, axes = plt.subplots(3, 2, sharex=True)

axes[0, 0].plot(series); axes[0, 0].set_title('Original Series')
plot_acf(series, ax=axes[0, 1])

# 1st Differencing
axes[1, 0].plot(series.diff()); axes[1, 0].set_title('1st Order Differencing')
plot_acf(series.diff().dropna(), ax=axes[1, 1])

# 2nd Differencing
axes[2, 0].plot(series.diff().diff()); axes[2, 0].set_title('2nd Order Differencing')
plot_acf(series.diff().diff().dropna(), ax=axes[2, 1])
'''


In [None]:
from statsmodels.tsa.arima.model import ARIMA
from matplotlib import pyplot
model = ARIMA(series, order=(5,1,1))
model_fit = model.fit()
residuals = pd.DataFrame(model_fit.resid)
residuals.plot()
pyplot.show()
# density plot of residuals
residuals.plot(kind='kde')
pyplot.show()


* The residual error plot shows that there are features in the data which are not picked well by the model with defined parameter set (5,1,0). 
* The density plot of the erros show that the errors are almost normal / Gaussian distributed, but the mean is slightly shifted to lower values, i.e. slightly smaller than 0, with an width of about 200. Recall this is not a proper gaussian, it has a slow fall on the right. 

### For forecasting 

In [None]:
series_test = df_test[ ( (df_test.country==0) & (df_test.store==0) & (df_test["product"]==0) )].num_sold
series_out = model_fit.forecast(365, alpha=0.05)  # 95% conf

In [None]:
series_out = series_out.reset_index().drop("index",axis=1)
series_test = series_test.reset_index().drop("index",axis=1)

In [None]:
SMAPE(series_test.num_sold,series_out.predicted_mean

* It can be seen clearly that ARIMA is not able to pick the seasonality of data, therefore, It will make sense to use SARIMA instead of ARIMA. In addition to just the series, I would like to use other information available. So the natural choice is SARIMAX. Let's take a look. 


## SARIMAX <a class="anchor" id="SARIMAX"></a>
Coming soon


