# Introduction 
`V1.0.1`
### Who am I
Just a fellow Kaggle learner. I was creating this Notebook as practice and thought it could be useful to some others 
### Who is this for
This Notebook is for people that learn from examples. Forget the boring lectures and follow along for some fun/instructive time :)
### What can I learn here
You learn all the basics needed to create a rudimentary RNN/LSTM encoder-decoder Network. I go over a multitude of steps with explanations. Hopefully with these building blocks,
you can go ahead and build much more complex models.

### Things to remember
+ Please Upvote/Like the Notebook so other people can learn from it
+ Feel free to give any recommendations/changes. 
+ I will be continuously updating the notebook. Look forward to many more upcoming changes in the future.

### You can also refer to these notebooks that have helped me as well:
+ https://www.kaggle.com/li325040229/eda-and-an-encoder-decoder-lstm-with-9-features#Build-a-LSTM-Model-

+ https://www.kaggle.com/yashvi/time-series-forecasting-using-lstm-m5/notebook

# Imports
First let us start by importing the relevant libraries that we need.

In [None]:
# Computational imports
import numpy as np   # Library for n-dimensional arrays
import pandas as pd  # Library for dataframes (structured data)

# Helper imports
import os 
import re
import warnings
from tqdm import tqdm
import datetime as dt

# ML/DL imports
from sklearn.preprocessing import MinMaxScaler, OrdinalEncoder, LabelEncoder
from sklearn.metrics import mean_squared_error, mean_squared_log_error
from sklearn.model_selection import train_test_split

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, RepeatVector, TimeDistributed
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping

# Plotting imports
import matplotlib.pyplot as plt
import matplotlib.dates as dates
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode, iplot

%matplotlib inline
warnings.filterwarnings("ignore")
init_notebook_mode(connected=True)

# Set seeds to make the experiment more reproducible.
from numpy.random import seed
seed(1)

# Allows us to see more information regarding the DataFrame
pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", 500)

# Helper Functions
These are some helper functions that allow us to simplyfy our code and re-use some functionalities.

## Show Shapes
This functions is used to quickly check the shapes of our numpy arrays. This is especially important to assure we have the right shape for our LSTM network.

In [None]:
def show_shapes(Sequences, Targets): # this'll use inputs; can make yours to use local variable values
    print("Expected: (num_samples, timesteps, channels)")
    print("Sequences: {}".format(Sequences.shape))
    print("Targets:   {}".format(Targets.shape))   

## Downcasting
This functions is used to downcast our variables to types that take less memory. This helps with model performance and speed.

In [None]:
def downcast_dtypes(df):
    float_cols = [c for c in df if df[c].dtype == "float64"]
    int_cols = [c for c in df if df[c].dtype in ["int64", "int32"]]
    df[float_cols] = df[float_cols].astype(np.float32)
    df[int_cols] = df[int_cols].astype(np.int16)
    return df

## Exploraty Data Analysis for pandas
This functions is used to quickly check the basic attributes of our pandas DataFrame.

In [None]:
def basic_eda(df):
    print("-------------------------------TOP 5 RECORDS-----------------------------")
    print(df.head(5))
    print()
    
    print("-------------------------------INFO--------------------------------------")
    print(df.info())
    print()
    
    print("-------------------------------Describe----------------------------------")
    print(df.describe())
    print()
    
    print("-------------------------------Columns-----------------------------------")
    print(df.columns)
    print()
    
    print("-------------------------------Data Types--------------------------------")
    print(df.dtypes)
    print()
    
    print("----------------------------Missing Values-------------------------------")
    print(df.isnull().sum())
    print()
    
    print("----------------------------NULL values----------------------------------")
    print(df.isna().sum())
    print()
    
    print("--------------------------Shape Of Data---------------------------------")
    print(df.shape)
    print()
    
    print("============================================================================ \n")

## Split Sequences
A key component of time-series problem is splitting our input data into sequences that we can feed to our LSTM network. This sequences depend on the required timesteps and horizons. 

In [None]:
def split_sequences(sequences, timesteps, horizon):
    Sequences, Targets = list(), list()
    for i in range(len(sequences)):
        # find the end of this pattern
        end_ix = i + timesteps
        out_end_ix = end_ix + horizon-1
        # check if we are beyond the dataset
        if out_end_ix > len(sequences):
            break
        # gather input and output parts of the pattern
        seq_x, seq_y = sequences[i:end_ix, :-1], sequences[end_ix-1:out_end_ix, -1]
        Sequences.append(seq_x)
        Targets.append(seq_y)
        show_shapes()
    return array(X), array(y)

## Event data transform
This function is specific to this competition and is used to manipulate and transform the competiton input data.

In [None]:
def transform(data):
    
    nan_features = ['event_name_1', 'event_type_1', 'event_name_2', 'event_type_2']
    for feature in nan_features:
        data[feature].fillna('unknown', inplace = True)
        
    cat = ['event_name_1','event_type_1','event_name_2','event_type_2','snap_CA','snap_TX','snap_WI']
    for feature in cat:
        encoder = LabelEncoder()
        data[feature] = encoder.fit_transform(data[feature])
    
    return data

## Normalization 
These functions are used to normalize our data. This aids with model performance and speed. You can also use the scikit-learn MinMaxScaler if you wish, it is up to you.

In [None]:
def Normalize(list):
    list = np.array(list)
    low, high = np.percentile(list, [0, 100])
    delta = high - low
    if delta != 0:
        for i in range(0, len(list)):
            list[i] = (list[i]-low)/delta
    return  list,low,high

def FNoramlize(list,low,high):
    delta = high - low
    if delta != 0:
        for i in range(0, len(list)):
            list[i] = list[i]*delta + low
    return list

def Normalize2(list,low,high):
    list = np.array(list)
    delta = high - low
    if delta != 0:
        for i in range(0, len(list)):
            list[i] = (list[i]-low)/delta
    return  list

# Reading and Preparing the Data
Let's start by reading our data. We will store it in many dataframes.

In [None]:
path = '../input/m5-forecasting-accuracy/'

train_data = pd.read_csv(path+'sales_train_validation.csv')
calendar = pd.read_csv('/kaggle/input/m5-forecasting-accuracy/calendar.csv')
sell_prices = pd.read_csv('/kaggle/input/m5-forecasting-accuracy/sell_prices.csv')
submission_file = pd.read_csv('/kaggle/input/m5-forecasting-accuracy/sample_submission.csv')


In [None]:
train_data.describe()

# Preparing the Data
Let's prepare our data. We will manipulate and transform the data and make it more convenient for us to use.

## Calendar Data

In [None]:
days = range(1, 1970)
time_series_columns = [f'd_{i}' for i in days]
transfer_cal = pd.DataFrame(calendar[['event_name_1','event_type_1','event_name_2','event_type_2','snap_CA','snap_TX','snap_WI']].values.T, index=['event_name_1','event_type_1','event_name_2','event_type_2','snap_CA','snap_TX','snap_WI'], columns= time_series_columns)
transfer_cal = transfer_cal.fillna(0)
event_name_1_se = transfer_cal.loc['event_name_1'].apply(lambda x: x if re.search("^\d+$", str(x)) else np.nan).fillna(10)
event_name_2_se = transfer_cal.loc['event_name_2'].apply(lambda x: x if re.search("^\d+$", str(x)) else np.nan).fillna(10)

In [None]:
transfer_cal.head()

In [None]:
event_name_1_se.head()

In [None]:
calendar['date'] = pd.to_datetime(calendar['date'])
calendar = calendar[calendar['date']>= '2016-2-01']  # reduce memory
calendar= transform(calendar)
# Attempts to convert events into time series data.
transfer_cal = pd.DataFrame(calendar[['event_name_1','event_type_1','event_name_2','event_type_2','snap_CA','snap_TX','snap_WI']].values.T,
                            index=['event_name_1','event_type_1','event_name_2','event_type_2','snap_CA','snap_TX','snap_WI'])
transfer_cal

## Sell Price Data

In [None]:
price_fea = calendar[['wm_yr_wk','date']].merge(sell_prices, on = ['wm_yr_wk'], how = 'left')
price_fea['id'] = price_fea['item_id']+'_'+price_fea['store_id']+'_validation'
price_fea.head()

In [None]:
df = price_fea.pivot('id','date','sell_price')
df.head()

In [None]:
price_df = train_data.merge(df,on=['id'],how= 'left').iloc[:,-140:] # -145: starts dataframe column at 2016-01-27 
price_df.index = train_data.id
price_df.head()

## Sales Data

In [None]:
train_data.info()

In [None]:
train_data = downcast_dtypes(train_data)
train_data.info()

In [None]:
train_data = train_data.iloc[:, -140:]
train_data.head(10)

## Combining all datas

In [None]:
time_series_col1 = train_data.columns
time_series_col2 = price_df.columns
time_series_col3 = transfer_cal.columns

print(len(time_series_col1),len(time_series_col2),len(time_series_col3))

In [None]:
price_df.columns = time_series_col1
transfer_cal.columns = time_series_col1

train_data.shape, price_df.shape, transfer_cal.shape

In [None]:
full_train_data = pd.concat([train_data, transfer_cal, price_df], axis=0)
full_train_data.tail(10)

# Exploring the DataFrame
Here we explore the DataFrames and look for anything unusual.

In [None]:
# Litle bit of exploration of data
print("=================================train_data=================================")
basic_eda(full_train_data)

We notice that the DataFrame has some missing values that we want to take care of.

# Taking care of NaN (missing values) in the dataframe
There is multiple ways of doing this, I have chosen do it with the following method:

In [None]:
full_train_data.fillna(method='backfill', axis=1, inplace=True)
np.sum(full_train_data.isnull().sum())

You can also choose to drop NaN columns/rows, frontfill or even fill with the mean. A lot of the time, it is best to try many methods and just simply choose the best.

# Training the model
Before we start training, we have to do some intermediate steps.

## Tranposing the DataFrame
We do this so we can have the index as the timesteps (days) and the columns as our features.

In [None]:
full_train_data_transposed = full_train_data.T
full_train_data_transposed.head()

## Obtaining all CAT and NUM columns
This is important if you want to do some feature scaling or encoding. 

In [None]:
object_cols = [cname for cname in full_train_data_transposed.columns 
               if full_train_data_transposed[cname].dtype == "object" 
               and cname != "date"]

print("Categorical Columns:")
len(object_cols)

In [None]:
num_cols = [cname for cname in full_train_data_transposed.columns 
            if full_train_data_transposed[cname].dtype in ['int64', 'float64', 'int16', 'float32']
            and cname not in ['event_name_1','event_type_1','event_name_2','event_type_2','snap_CA','snap_TX','snap_WI']]

print("Numerical Columns:")
len(num_cols)

In [None]:
all_cols = num_cols + object_cols
print("All Columns:")
len(all_cols)

## Splitting the training data into sequences
In this section, we split the training data into sequences that we can further feed into LSTM network. Notice that each sequence has many variables/features making it a multivariate problem. To predict the next 14 days (our horizon), we are going to use the events that occureed 14-54 days ago (not 1-14 ays ago, we keep a lag of 14 days since we believe that it takes a while for the events to show effect onto the price and number of items sold). 

In [None]:
timesteps = 28
horizon = 28

full_train_data_sequenced = []   

for i in tqdm(range(train_data.shape[0])):      # Using tqdm to visualize the progress

    full_train_data_sequenced.append([list(t) for t in zip(full_train_data_transposed['event_name_1'][-(100+14):-(14)],
                                       full_train_data_transposed['event_type_1'][-(100+14):-(14)],
                                       full_train_data_transposed['event_name_2'][-(100+14):-(14)],     
                                       full_train_data_transposed['event_type_2'][-(100+14):-(14)],
                                       full_train_data_transposed['snap_CA'][-(100+14):-(14)],
                                       full_train_data_transposed['snap_TX'][-(100+14):-(14)],
                                       full_train_data_transposed['snap_WI'][-(100+14):-(14)],
                                       price_df.iloc[i][-100:],
                                       train_data.iloc[i][-100:])]) 

full_train_data_sequenced = np.asarray(full_train_data_sequenced, dtype=np.float32)

## Normalize the training data
Here we normalize our training data with our own in-house functions. This will increase modela accuracy and speed.

In [None]:
norm_full_train_data, train_low, train_high = Normalize(full_train_data_sequenced[:,-(timesteps*2):,:])

In [None]:
print(norm_full_train_data.shape)
print(train_low)
print(train_high)

# Split into sequence and target
After sequencing and normalzing the data, we slice the data to create the input sequences and output targets.

In [None]:
num_features = 9

X_train = norm_full_train_data[:,-28*2:-28,:]
y_train = norm_full_train_data[:,-28:,8] 

X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], num_features))
y_train = y_train.reshape((y_train.shape[0], y_train.shape[1], 1))

show_shapes(X_train, y_train)

# Creating the LSTM Network
We are going to be creating a multivariate encoder-decoder LSTM Network with a dense layer at the end. We are using dropout as a regularisation method to combat overfitting.

In [None]:
def encoder_decoder_model():
    
    # Use Keras sequential model
    model = Sequential()
    
    # Encoder LSTM layer with Dropout regularisation; Set return_sequences to False since we are feeding last output to decoder layer
    model.add(LSTM(units = 100, activation='relu', input_shape = (X_train.shape[1], X_train.shape[2])))
    model.add(Dropout(0.2))
    
    # The fixed-length output of the encoder is repeated, once for each required time step in the output sequence with the RepeatVector wrapper
    model.add(RepeatVector(horizon))
    
    # Decoder LSTM layer with Dropout regularisation; Set return_sequences to True to feed each output time step to a Dense layer
    model.add(LSTM(units = 100, activation='relu', return_sequences=True))
    model.add(Dropout(0.2))
    
    # Same dense layer is repeated for each output timestep with the TimeDistributed wrapper
    model.add(TimeDistributed(Dense(units=1, activation = "linear")))
    
    return model

Let us now use summary method to validate our network.

In [None]:
model = encoder_decoder_model()
model.summary()

Now we set our compiler and our optimatization mechanism. We will be using the Adam optimazation method since it is widely used and performs much better than regular gradient descent.

In [None]:
model.compile(optimizer='adam', loss='mean_squared_error', metrics = ['accuracy'])

## Training/Fitting time
We can finally train our model with our training data. Let's see how it does.

In [None]:
his=model.fit(X_train,y_train,epochs=15,batch_size=1000,verbose=2)

## Plotting model accuracy and loss
This step is very important since it allows you to see if your model is performing well as you train it. If it isn't, you will rather have to create new features, tune hyperparameters, modify the RNN network or cry.

In [None]:
plt.plot(his.history['loss'])
plt.plot(his.history['accuracy'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['loss','accuracy'])
plt.show()

## Saving and Loading Models
The next two cells allows you to save and load models. This would save you a lot of time whenever training takes a good chunk of time and computational power.

In [None]:
# # serialize model to JSON
# model_json = model.to_json()
# with open("model2.json", "w") as json_file:
#     json_file.write(model_json)
# # serialize weights to HDF5
# model.save_weights("model2.h5")
# print("Saved model to disk")

In [None]:
# # load json and create model
# json_file = open('model.json', 'r')
# loaded_model_json = json_file.read()
# json_file.close()
# loaded_model = model_from_json(loaded_model_json)
# # load weights into new model
# loaded_model.load_weights("model.h5")
# print("Loaded model from disk")

# Testing the Model
Now that we have validated that the model does pretty well on our training data, we can move to some more serious stuff... TEST DATA
Why is this important you might ask? Well if you score plenty of goals in your practice (good job I guess), but none at the real game... there must be something wrong.

That is why we need test data to confirm that our model does well on unseen data.

In [None]:
# Take last 28 days in the past to predict the next 28 days in the future
test_input = np.array(X_train[:, -timesteps:, :]) # Here timesteps=28
test_input = test_input.reshape((X_train.shape[0], timesteps, num_features)) # Very important to reshape to assure that the test input has the correct shape (# samples, # timesteps, # features)
print(test_input.shape)

# Predict the next 28 days 
y_test = model.predict(test_input[:,-timesteps:, :], verbose=2)

# Concatenate prediction with past timesteps
test_forecast= np.concatenate((test_input[:,:,8].reshape(test_input.shape[0],test_input.shape[1]), 
                           y_test.astype(np.float32).reshape(test_input.shape[0],test_input.shape[1])),axis=1).reshape((test_input.shape[0],test_input.shape[1]+28,1))
print(y_test)
print(test_forecast.shape)

In [None]:
print(y_test.shape)
print(test_forecast.shape)

In [None]:
# Reverse normalize to obtain human interpratable values
test_forecast = FNoramlize(test_forecast,train_low,train_high)

# Round values
test_forecast = np.rint(test_forecast)

In [None]:
# Transform into DataFrame and keep only the predictions
forecast = pd.DataFrame(test_forecast.reshape(test_forecast.shape[0],test_forecast.shape[1])).iloc[:,-28:]
forecast.columns = [f'F{i}' for i in range(1, forecast.shape[1] + 1)]
forecast[forecast < 0] = 0
forecast.head()

Here we prep the ids to conform with the sample_submission.csv. We want to have both validation and evaluation ids

In [None]:
train_data = pd.read_csv(path+'sales_train_validation.csv')
validation_ids = train_data['id'].values
evaluation_ids = [i.replace('validation', 'evaluation') for i in validation_ids]

In [None]:
ids = np.concatenate([validation_ids, evaluation_ids])

In [None]:
predictions = pd.DataFrame(ids, columns=['id'])
forecast = pd.concat([forecast]*2).reset_index(drop=True)
predictions = pd.concat([predictions, forecast], axis=1)

# Submit Predictions
Here we submit the predictions and pray to god we did well :)

In [None]:
predictions.to_csv('submission.csv', index=False)  #Generate the csv file.

# Final Remarks
Thank you for going through this notebook. Please feel free to show support and comment on the notebooks with advice or improvements. If you found it useful, please let me know as well :)