# Introduction 
`V1.0.1 - 2021-10-30`
### Who am I
Just a fellow Kaggle learner. Created this Notebook as practice and thought it could be useful to some others 
### Who is this for
This Notebook is for people that learn from examples. Forget the boring lectures and follow along for some fun/instructive time :)
### What can I learn here
You learn all the basics needed to create a rudimentary Exploratory Data Analysdis and Multivariate LSTM Network. I go over a multitude of steps with explanations. Hopefully with these building blocks,
you can go ahead and build much more complex models.

### Thins to remember
+ Please Upvote/Like the Notebook so other people can learn from it
+ Feel free to give any recommendations/changes. 
+ I will be continuously updating the notebook. Look forward to many more upcoming changes in the future.

# File Descriptions and Data Field Information
## train.csv
- The training data, comprising time series of features store_nbr, family, and onpromotion as well as the target sales.
- store_nbr identifies the store at which the products are sold.
- family identifies the type of product sold.
- sales gives the total sales for a product family at a particular store at a given date. Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips).
- onpromotion gives the total number of items in a product family that were being promoted at a store at a given date.
## test.csv
- The test data, having the same features as the training data. You will predict the target sales for the dates in this file.
- The dates in the test data are for the 15 days after the last date in the training data.
## sample_submission.csv
- A sample submission file in the correct format.
## stores.csv
- Store metadata, including city, state, type, and cluster.
- cluster is a grouping of similar stores.
## oil.csv
- Daily oil price. Includes values during both the train and test data timeframes. (Ecuador is an oil-dependent country and it's economical health is highly vulnerable to shocks in oil prices.)
## Transactions.csv
- Daily transactions per store
## holidays_events.csv
- Holidays and Events, with metadata
- NOTE: Pay special attention to the transferred column. A holiday that is transferred officially falls on that calendar day, but was moved to another date by the government. A transferred day is more like a normal day than a holiday. To find the day that it was actually celebrated, look for the corresponding row where type is Transfer. For example, the holiday Independencia de Guayaquil was transferred from 2012-10-09 to 2012-10-12, which means it was celebrated on 2012-10-12. Days that are type Bridge are extra days that are added to a holiday (e.g., to extend the break across a long weekend). These are frequently made up by the type Work Day which is a day not normally scheduled for work (e.g., Saturday) that is meant to payback the Bridge.
Additional holidays are days added a regular calendar holiday, for example, as typically happens around Christmas (making Christmas Eve a holiday).
## Additional Notes
- Wages in the public sector are paid every two weeks on the 15 th and on the last day of the month. Supermarket sales could be affected by this.
- A magnitude 7.8 earthquake struck Ecuador on April 16, 2016. People rallied in relief efforts donating water and other first need products which greatly affected supermarket sales for several weeks after the earthquake.

# Imports
First let us start by importing the relevant libraries that we need.

In [None]:
# Computational imports
import numpy as np   # Library for n-dimensional arrays
import pandas as pd  # Library for dataframes (structured data)

# Helper imports
import os 
import warnings
import pandas_datareader as web
import datetime as dt

# ML/DL imports
from keras.models import Sequential
from keras.preprocessing.sequence import TimeseriesGenerator
from keras.layers import LSTM, Dense, Dropout, RepeatVector, TimeDistributed
from sklearn.preprocessing import MinMaxScaler, OrdinalEncoder
from sklearn.metrics import mean_squared_error, mean_squared_log_error
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping

# Plotting imports
import matplotlib.pyplot as plt
import matplotlib.dates as dates
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode, iplot

%matplotlib inline
warnings.filterwarnings("ignore")
init_notebook_mode(connected=True)

# Set seeds to make the experiment more reproducible.
from numpy.random import seed
seed(1)

# Reading the Data
Let's start by reading our data. We will store them in dataframes. 

In [None]:
path = '/kaggle/input/store-sales-time-series-forecasting/'

train_data = pd.read_csv(path+'train.csv', index_col=0)
test_data = pd.read_csv(path+'test.csv', index_col=0)
data_oil = pd.read_csv(path+'oil.csv')
samp_subm = pd.read_csv(path+'sample_submission.csv')
data_holi = pd.read_csv(path+'holidays_events.csv')
data_store =  pd.read_csv(path+'stores.csv')
data_trans = pd.read_csv(path+'transactions.csv')

# Exploratory Data Analysis (EDA)
## Exploring with Dataframe Analysis
Here we will analyze each dataframe to to understand the type of data we are working with. To do this, we create a simple function that we can call whenver we want to do basic dataframe analysis.

In [None]:
def basic_eda(df):
    print("-------------------------------TOP 5 RECORDS-----------------------------")
    print(df.head(5))
    print("-------------------------------INFO--------------------------------------")
    print(df.info())
    print("-------------------------------Describe----------------------------------")
    print(df.describe())
    print("-------------------------------Columns-----------------------------------")
    print(df.columns)
    print("-------------------------------Data Types--------------------------------")
    print(df.dtypes)
    print("----------------------------Missing Values-------------------------------")
    print(df.isnull().sum())
    print("----------------------------NULL values----------------------------------")
    print(df.isna().sum())
    print("--------------------------Shape Of Data---------------------------------")
    print(df.shape)
    print("============================================================================ \n")

In [None]:
#Litle bit of exploration of data

print("=================================Train Data=================================")
basic_eda(train_data)
print("=================================Test data=================================")
basic_eda(test_data)
print("=================================Holidays events=================================")
basic_eda(data_holi)
print("=================================Transactions data=================================")
basic_eda(data_trans)
print("=================================Stores data=================================")
basic_eda(data_store)
print("=================================Oil data=================================")
basic_eda(data_oil)

Lucky for us, there seems not to be any missing values. This makes our lives easier.

## Exploring by plotting and analyzing graphs (using plotly)

### Sales Variation with store, family and clusters
To see the variation of sales per store, family and clusters in a subplot, we use plotly. 

In [None]:
# Creating one joined dataframe for visualization needs
df_visualization = train_data.merge(data_holi, on = 'date', how='left')
df_visualization = df_visualization.merge(data_oil, on = 'date', how='left')
df_visualization = df_visualization.merge(data_store, on = 'store_nbr', how='left')
df_visualization = df_visualization.merge(data_trans, on = ['date', 'store_nbr'], how='left')
df_visualization = df_visualization.rename(columns = {"type_x" : "holiday_type", "type_y" : "store_type"})

df_visualization['date'] = pd.to_datetime(df_visualization['date'])
df_visualization['year'] = df_visualization['date'].dt.year
df_visualization['month'] = df_visualization['date'].dt.month
df_visualization['week'] = df_visualization['date'].dt.isocalendar().week
df_visualization['quarter'] = df_visualization['date'].dt.quarter
df_visualization['day_of_week'] = df_visualization['date'].dt.day_name()
df_visualization[:3]

In [None]:
# data -------------------------------------------------------------------------------
df_store_sales = df_visualization.groupby('store_type').agg({"sales" : "mean"}).reset_index().sort_values(by='sales', ascending=False)
df_fam_sales = df_visualization.groupby('family').agg({"sales" : "mean"}).reset_index().sort_values(by='sales', ascending=False)[:10]
df_clus_sales = df_visualization.groupby('cluster').agg({"sales" : "mean"}).reset_index() 

# chart color -------------------------------------------------------------------------------
df_fam_sales['color'] = '#008000'
df_fam_sales['color'][3:] = '#00FF00'
df_clus_sales['color'] = '#00FF00'

# chart -------------------------------------------------------------------------------
fig = make_subplots(rows=2, cols=2, 
                    specs=[[{"type": "bar"}, {"type": "pie"}],
                           [{"colspan": 2}, None]],
                    column_widths=[0.7, 0.3], vertical_spacing=0, horizontal_spacing=0.02,
                    subplot_titles=("Top 10 Highest Product Sales", "Highest Sales in Stores", "Clusters Vs Sales"))

fig.add_trace(go.Bar(x=df_fam_sales['sales'], y=df_fam_sales['family'], marker=dict(color= df_fam_sales['color']),
                     name='Family', orientation='h'), 
                     row=1, col=1)
fig.add_trace(go.Pie(values=df_store_sales['sales'], labels=df_store_sales['store_type'], name='Store type',
                     marker=dict(colors=['#006400', '#008000','#228B22','#00FF00','#7CFC00','#00FF7F']), hole=0.7,
                     hoverinfo='label+percent+value', textinfo='label'), 
                    row=1, col=2)
fig.add_trace(go.Bar(x=df_clus_sales['cluster'], y=df_clus_sales['sales'], 
                     marker=dict(color= df_clus_sales['color']), name='Cluster'), 
                     row=2, col=1)

# styling -------------------------------------------------------------------------------
fig.update_yaxes(showgrid=False, ticksuffix=' ', categoryorder='total ascending', row=1, col=1)
fig.update_xaxes(visible=False, row=1, col=1)
fig.update_xaxes(tickmode = 'array', tickvals=df_clus_sales.cluster, ticktext=[i for i in range(1,17)], row=2, col=1)
fig.update_yaxes(visible=False, row=2, col=1)
fig.update_layout(height=500, bargap=0.2,
                  margin=dict(b=0,r=20,l=20), xaxis=dict(tickmode='linear'),
                  title_text="Average Sales Analysis",
                  template="plotly_white",
                  title_font=dict(size=25, color='#8a8d93', family="Lato, sans-serif"),
                  font=dict(color='#8a8d93'),
                  hoverlabel=dict(bgcolor="#f2f2f2", font_size=13, font_family="Lato, sans-serif"),
                  showlegend=False)
fig.show()

This allows us to get an initial understanding on how the sales varies depending on various variables.

### Sales variation by month and year
To see the variation of sales per month and year in one plot, we use plotly. 

In [None]:
# data 
df_2013 = df_visualization[df_visualization['year']==2013][['month','sales']]
df_2013 = df_2013.groupby('month').agg({"sales" : "mean"}).reset_index().rename(columns={'sales':'s13'})
df_2014 = df_visualization[df_visualization['year']==2014][['month','sales']]
df_2014 = df_2014.groupby('month').agg({"sales" : "mean"}).reset_index().rename(columns={'sales':'s14'})
df_2015 = df_visualization[df_visualization['year']==2015][['month','sales']]
df_2015 = df_2015.groupby('month').agg({"sales" : "mean"}).reset_index().rename(columns={'sales':'s15'})
df_2016 = df_visualization[df_visualization['year']==2016][['month','sales']]
df_2016 = df_2016.groupby('month').agg({"sales" : "mean"}).reset_index().rename(columns={'sales':'s16'})
df_2017 = df_visualization[df_visualization['year']==2017][['month','sales']]
df_2017 = df_2017.groupby('month').agg({"sales" : "mean"}).reset_index()
df_2017_no = pd.DataFrame({'month': [9,10,11,12], 'sales':[0,0,0,0]})
df_2017 = df_2017.append(df_2017_no).rename(columns={'sales':'s17'})
df_year = df_2013.merge(df_2014,on='month').merge(df_2015,on='month').merge(df_2016,on='month').merge(df_2017,on='month')

# top levels
top_labels = ['2013', '2014', '2015', '2016', '2017']

colors = ['#2EB62C', '#57C84D',
          '#83D475', '#ABE098',
          '#C5E8B7']

# X axis value 
df_year = df_year[['s13','s14','s15','s16','s17']].replace(np.nan,0)
x_data = df_year.values

# y axis value (Month)
df_2013['month'] =['Jan','Feb','Mar','Apr','Mai','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
y_data = df_2013['month'].tolist()

# create plotly figure
fig = go.Figure()
for i in range(0, len(x_data[0])):
    for xd, yd in zip(x_data, y_data):
        fig.add_trace(go.Bar(
            x=[xd[i]], y=[yd],
            orientation='h',
            marker=dict(
                color=colors[i],
                line=dict(color='rgb(248, 248, 249)', width=1)
            )
        ))

fig.update_layout(title='Avg Sales for each Year',
    xaxis=dict(showgrid=False, 
               zeroline=False, domain=[0.15, 1]),
    yaxis=dict(showgrid=False, showline=False,
               showticklabels=False, zeroline=False),
    barmode='stack',
    plot_bgcolor='#fff', 
    paper_bgcolor='#fff',
    margin=dict(l=0, r=50, t=100, b=10),
    showlegend=False, 
)

annotations = []
for yd, xd in zip(y_data, x_data):  
    # labeling the y-axis
    annotations.append(dict(xref='paper', yref='y',
                            x=0.14, y=yd,
                            xanchor='right',
                            text=str(yd),
                            font=dict(family='verdana', size=16,
                                      color='rgb(67, 67, 67)'),
                            showarrow=False, align='right'))
    
    # labeling the first Likert scale (on the top)
    if yd == y_data[-1]:
        annotations.append(dict(xref='x', yref='paper',
                                x=xd[0] / 2, y=1.1,
                                text=top_labels[0],
                                font=dict(family='verdana ', size=16,
                                          color='rgb(67, 67, 67)'),
                                showarrow=False))
        
    space = xd[0]
    for i in range(1, len(xd)):        
            # labeling the Likert scale
            if yd == y_data[-1]:
                annotations.append(dict(xref='x', yref='paper',
                                        x=space + (xd[i]/2), y=1.1,
                                        text=top_labels[i],
                                        font=dict(family='verdana ', size=16,
                                                  color='rgb(67, 67, 67)'),
                                        showarrow=False))
            space += xd[i]
            
fig.update_layout(
    annotations=annotations)
fig.show()

With this graph we can see that the sales are highest generally for the month of december and they also increase through the years. This can be due to many reasons such as Christmas gifts.

### Analyzing the relationship of Sales and Transactions amount with Oil price

In [None]:
data_oil.head()

In [None]:
plt.plot(data_oil.set_index('date').dcoilwtico, color='green', label=f"Oil Price")
plt.title("Oil Price vs Days")
plt.xlabel("Days")
plt.ylabel("Oil Price")
plt.legend()
plt.show()

We group by the date and find the mean valus for sales and transactions for those grouped dates.

In [None]:
train_data_per_date = train_data.groupby('date').agg({'sales': 'mean'}).reset_index()
train_data_per_date['weekly_avg_sales'] = train_data_per_date['sales'].ewm(span=7, adjust=False).mean()

train_data_per_date.head()

In [None]:
transactions_per_day = data_trans.groupby('date').agg({'transactions': 'mean'}).reset_index()
transactions_per_day['weekly_avg_transactions'] = transactions_per_day['transactions'].ewm(span=7, adjust=False).mean()

transactions_per_day.head()

Let us compare sales and oil price

In [None]:
fig=make_subplots()

fig.add_trace(go.Scatter(x=train_data_per_date.date,y=train_data_per_date.sales,name="Sales"))
fig.add_trace(go.Scatter(x=train_data_per_date.date,y=train_data_per_date.weekly_avg_sales,name="Weekly Sales"))


fig.add_trace(go.Scatter(x=data_oil.date,y=data_oil.dcoilwtico,name="Oil Price"))

fig.update_layout(autosize=True,width=900,height=500,title_text="Variation of Sales and Oil Price Through Time")
fig.update_xaxes(title_text="Days")
fig.update_yaxes(title_text="Prices")
fig.show()

Let us compare transactions and oil price

In [None]:
fig=make_subplots()

fig.add_trace(go.Scatter(x=transactions_per_day.date,y=transactions_per_day.transactions,name="Transactions"))
fig.add_trace(go.Scatter(x=transactions_per_day.date,y=transactions_per_day.weekly_avg_transactions,name="Weekly Transactions"))

fig.add_trace(go.Scatter(x=data_oil.date,y=data_oil.dcoilwtico,name="Oil Price"))

fig.update_layout(autosize=True,width=900,height=500,title_text="Variation Transactions and Oil Price Through Time")
fig.update_xaxes(title_text="Days")
fig.update_yaxes(title_text="Prices")
fig.show()

We can also create a correlation matrix to see the correlation between the various variables.

In [None]:
data_oil['sales'] = train_data_per_date['sales']
data_oil['transactions'] = transactions_per_day['transactions']

data_oil.corr()

Analyzing the graphs and the correlation matrix, we get the understanding that there is no strong correlation between oil and transactions. There is slight inversion correlation between sales and oil price (representing loosely the economic status of a country)

# Training the model

In [None]:
path = '/kaggle/input/store-sales-time-series-forecasting/'

train_data = pd.read_csv(path+'train.csv', index_col=0)
test_data = pd.read_csv(path+'test.csv', index_col=0)
data_oil = pd.read_csv(path+'oil.csv')
samp_subm = pd.read_csv(path+'sample_submission.csv')
data_holi = pd.read_csv(path+'holidays_events.csv')
data_store =  pd.read_csv(path+'stores.csv')
data_trans = pd.read_csv(path+'transactions.csv')

## Explore the training data

In [None]:
basic_eda(train_data)

In [None]:
print(min(train_data['date']))
print(max(train_data['date']))

Let's find the numerical and categorical columns. This is important if you want to later standardize your numerical data or Encode your categorical data.

In [None]:
object_cols = [cname for cname in train_data.columns 
               if train_data[cname].dtype == "object" 
               and cname != "date"]

print("Categorical variables:")
object_cols 

In [None]:
num_cols = [cname for cname in train_data.columns 
            if train_data[cname].dtype in ['int64', 'float64']]

print("Numerical variables:")
num_cols 

In [None]:
all_cols = num_cols + object_cols
print(all_cols)

## Taking care of Categorical Features
Let us transform the categorical columns into numerical ones so that we can use them as features.

In [None]:
ordinal_encoder = OrdinalEncoder()
train_data[object_cols] = ordinal_encoder.fit_transform(train_data[object_cols])
train_data

## Scaling of Numerical Features
We have to standardize our numerical data so that our leaning algorithm can perform better.

In [None]:
scaler = MinMaxScaler(feature_range=(0,1))

for col in num_cols:
    scaled_data = scaler.fit_transform(train_data[col].values.reshape(-1,1))
    train_data[col] = pd.Series(scaled_data.flatten())

train_data.head()

## Grouping the Data
We need to group the data by the dates. This will make the prediction much easier.

In [None]:
train_data = train_data.groupby(['date']).agg({'sales':'mean', 'onpromotion':'mean'})
train_data.tail()

In [None]:
x_train = train_data.copy()
y_train = train_data.sales.copy()

In [None]:
x_train.head()

In [None]:
y_train.head()

## Transforming the Input data into Time-Series data
We have to transform the training data into time-series accepte sequences to be then fed to our model. We have decided to use keras TimeseriesGenerator to create those sequences. You can also choose to seperate into sequences without using this function.

In [None]:
num_feature_input = len(x_train.columns)
history_input = 30

"""
length: Number of past time steps to be included, 
batch_size: The amount of predicted days. Here we have 1 since we are trying to predict the next day using the last 30 days.
"""
generator = TimeseriesGenerator(x_train, y_train, length=history_input, batch_size = 1)

# Print the first sequence, you should see 30 past day (x) for 1 predicted day (y)
for i in range(len(generator)):
    x, y = generator[i]
    print('%s => %s' % (x, y))
    break

In [None]:
print(len(generator))

In [None]:
def Multi_Step_LSTM_model():
    
    # Use Keras sequential model
    model = Sequential()    
    
    # First LSTM layer with Dropout regularisation; Set return_sequences to True to feed outputs to next layer
    model.add(LSTM(units = 50, activation='relu', return_sequences = True, input_shape = (history_input, num_feature_input))) 
    model.add(Dropout(0.2))
    
    # Second LSTM layer with Dropout regularisation; Set return_sequences to True to feed outputs to next layer
    model.add(LSTM(units = 50,  activation='relu', return_sequences = True))                                    
    model.add(Dropout(0.2))
    
    # Final LSTM layer with Dropout regularisation; Set return_sequences to False since now we will be predicting with the output layer
    model.add(LSTM(units = 50))
    model.add(Dropout(0.2))
    
    # The output layer with linear activation to predict Open stock price
    model.add(Dense(units=1, activation = "linear"))
    
    return model

In [None]:
model = Multi_Step_LSTM_model()
model.summary()

Now we set our compiler and our optimatization mechanism. We will be using the Adam optimazation method since it is widely used and performs much better than regular gradient descent.

In [None]:
model.compile(optimizer='adam', loss='mean_squared_error', metrics = ['accuracy'])

In [None]:
model.fit_generator(generator, steps_per_epoch=len(generator), epochs=20, verbose=2)

## Save and Load model if needed
I have commented the two code blocks out. Uncomment and use if you need to:

a) save the model

In [None]:
# # serialize model to JSON
# model_json = model.to_json()
# with open("model2.json", "w") as json_file:
#     json_file.write(model_json)
# # serialize weights to HDF5
# model.save_weights("model2.h5")
# print("Saved model to disk")

b) load the model

In [None]:
# # load json and create model
# json_file = open('model.json', 'r')
# loaded_model_json = json_file.read()
# json_file.close()
# loaded_model = model_from_json(loaded_model_json)
# # load weights into new model
# loaded_model.load_weights("model.h5")
# print("Loaded model from disk")

# Predicting on Test data (WIP)
Here we keep 30 past days from the training data to help us start predicting the test data.

In [None]:
# split a univariate sequence into samples
def split_sequence(data, days_past, days_future):
    X, y = list(), list()
    
    for i in range(len(data)):        
        # find the end of this pattern
        end_ix = i + days_past
        out_end_ix = end_ix + days_future
        
        # check if we are beyond the sequence
        if out_end_ix > len(data):
            break
            
        # gather input and output parts of the pattern
        seq_x, seq_y = data[i:end_ix], data[end_ix:out_end_ix]
        X.append(seq_x)
        y.append(seq_y)
        
    return np.array(X), np.array(y)

In [None]:
full_dataset = pd.concat([train_data, test_data], ignore_index=True, sort=False)
full_dataset = full_dataset.iloc[3000887-5:,:]
full_dataset

In [None]:
full_dataset = full_dataset.groupby(['date']).agg({'sales':'mean', 'onpromotion':'mean'})
full_dataset