# 1. RNN

## 1.1 Introduction to RNN

* Suppose we are watching a movie, we keep watching the movie as at any point in time, we have the context. Because we have seen the movie until that time, then only we are able to relate everything correctly. It means that we remember everything that we have watched.
* Similarly RNN remembers everything.
* In other neural networks, all the inputs are independent of each other. But in RNN, all the inputs are related to each other.
* Lets say we have to predict the next word in a given sentence, in that case, relation among all the previous words helpsin predicting the better output. RNN remembers all these relationswhile training itself.
* In order to achieve this RNN creates networks with loopsin them, which allow it to persist the information.
* RNN carries context/information along with the journey
* They overcome from the drawback of feed forward neural networks i.e. they accept  a fixed sized vector as input (ex:- image) and produce a fixed sized vector as output(ex:- probabilities of different classes)
* Ex :- "This phone is very fast" - here, 'fast' is related to 'phone'. 'very' is used in the context of phone and fastness.
* Time series is also kind of sequence information.

## 1.2 Why RNN

* If output depends on sequence of inputs
* To retain and levearage sequence information
* If the input size is changing, for MNN it is tuff to handle
* Can handle sequential data
* Considers current input and also the previously received inputs
* Can memorize previous inputs due to its internal memory

## 1.3  Applications of RNN

* Natural Language Processing - Text sequence information
* Time Series data analysis - Timely sequence information
* Machine Translation - Sequence to sequence information(variation of RNN)
* Speech recognition - ex:- sequence in audio -> sentence in english
* Image captioning - Image -> captions/english sentences

## 1.4 Types of RNN Architectures

* Many to one : Sentment analysis, movie rating(multiclass classification)
* One to many : Image captioning
* Many to many : Parts of speech detection(many to many of same length), Machine translation(many to many of different length). Also called Encoder and Decoder model

## 1.5  Disadvantages of RNN

* Can't take care of long term dependencies ie. if later output depends on earlier input/inputs. Ex:- machine translations
* If the activation function is sigmoid/tanh, lots of multiplication of partial derivatives which are less than 1, results in Vanishing Gradient problem
* If few partial derivatives are more than 1, results in Exploding Gradient problem
* Lots of multiplication during forward/backward propagation over time results in above two problems

# 2. LSTM

## 2.1 Introduction to LSTM

* It is similar to RNN. The differences are the operations within the LSTM's cells. These operations are used to allow the LSTM to keep or forget information.
* Core concept of LSTMs are the cell state and its various gates.
* The cell state act as a transport highway that transfers relative information all the way down the sequence chain. We can think of it as a memory of the network.
* Cell state can carry relavent information thoughput the processing of the sequence.
* Cell state carry infomation from earlier time steps can make its way  to latet time steps, reducing the efforts of short term memory.
* The cell state goes on its journey, informationgets added  or removed  to the cell state via gates.
* The gates different neural networks that decide  which information is allowed to the cell state. The gates can learn what information to keep or forget during training.

## 2.2 LSTM Gates

<b>Forget gate :</b><br>
* This gate decides what information should be thrown away or kept.
* Info from previous hidden state and info from current input  is passed though sigmoid function.
* Values come out between 0 and 1. The closer to '0' means to forget, and closer to '1' means to keep.<br>

<b>Input gate :</b><br>
* To update the cell state we have input gate.

* It gives good accuracy
* It takes more time to run
* Architecture reference : https://colah.github.io/posts/2015-08-Understanding-LSTMs/
* IMDB case study: https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/

# 3. GRU

* GRU uses 2 gates ie. reset gate and update gate. Where as LSTM has 3 steps.
* GRU doesnot have an internal memory.
* Reset gate decides how to combine  new input with the previous time steps menory.
* Update gate decides how much of the previous memory should be kept. Update gate is a combination of input and forget gate that we understood in LSTM.
* GRU is simpler variant of LSTM to solve vanishing gradient problem
* GRU is faster to compute back-propagation due to simpler network architecture
* GRU reference : https://www.slideshare.net/hytae/recent-progress-in-rnn-and-nlp-63762080
* GRU reference : https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be

# 4. IMDB Sentiment Classification

<b>Reference :</b><br>
* https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/
* https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification

<b>Embedding layers :</b>
* Turns positive integers (indexes) into dense vectors of fixed size.
e.g. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]
* https://keras.io/api/layers/core_layers/embedding/

* Xi -> sequence of words, Yi -> binary classification

<b>Libraries required :</b>

In [None]:
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
# fix random seed for reproducibility
numpy.random.seed(7)

<b>Dataset :</b>

In [None]:
# reference : https://keras.io/api/datasets/imdb/

# load the dataset but only keep the top n words, zero the rest
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=top_words)

In [None]:
print(X_train[1])
print(type(X_train[1]))
print(len(X_train[1]))

<b>Data preparation :</b>

In [None]:
# truncate and/or pad input sequences
max_review_length = 600
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

print(X_train.shape)
print(X_train[1])

<b>Model building :</b>

In [None]:
# create the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words+1, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
#Refer: https://datascience.stackexchange.com/questions/10615/number-of-parameters-in-an-lstm-model

In [None]:
model.fit(X_train, y_train, nb_epoch=10, batch_size=64)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

# 5. Demand Forecasting using LSTM

<b>Objective :</b><br>
    The objective is to predict the electricity consuumption of a household with a one-minute sampling rate based on the past 4 years of consumption

<b>Libraries :</b>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
from time import time

# Suppress Scientific notations and display float up to 2 decimals
np.set_printoptions(suppress=True)
pd.options.display.float_format = '{:.2f}'.format

<b>Read Data :</b>

In [4]:
df = pd.read_csv('household_power_consumption.txt', delimiter=';')
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
0,16/12/2006,17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
1,16/12/2006,17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2,16/12/2006,17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
3,16/12/2006,17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
4,16/12/2006,17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0


* We need to merge 'Date' and 'Time' features into single indexed time series feature.

In [5]:
# Check the data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075259 entries, 0 to 2075258
Data columns (total 9 columns):
Date                     object
Time                     object
Global_active_power      object
Global_reactive_power    object
Voltage                  object
Global_intensity         object
Sub_metering_1           object
Sub_metering_2           object
Sub_metering_3           float64
dtypes: float64(1), object(8)
memory usage: 142.5+ MB


<b>Prepare Data :</b>

For this problem we will do a Univariate time series modeling using only the Global_active_power as the variable.
So let us keep only the columns that we need, and also convert the Global_active_power to numeric

1) Create a Date-Time column
<br>2) Keep only the columns we need
<br>3) Address missing values
<br>4) Convert all the data to float
<br>5) Create columns for Year, Quarter, Month, Day, Weekday to observe the trends

In [6]:
# Create a new column for Date-time
df['Datetime'] = pd.to_datetime(df.Date + ' ' + df.Time)

In [7]:
# Keep only the columns we need
df = df.loc[:,['Datetime','Global_active_power']]
df.head()

Unnamed: 0,Datetime,Global_active_power
0,2006-12-16 17:24:00,4.216
1,2006-12-16 17:25:00,5.36
2,2006-12-16 17:26:00,5.374
3,2006-12-16 17:27:00,5.388
4,2006-12-16 17:28:00,3.666


In [8]:
df.tail()

Unnamed: 0,Datetime,Global_active_power
2075254,2010-11-26 20:58:00,0.946
2075255,2010-11-26 20:59:00,0.944
2075256,2010-11-26 21:00:00,0.938
2075257,2010-11-26 21:01:00,0.934
2075258,2010-11-26 21:02:00,0.932


In [9]:
# The missing values are denoted by "?". Let us check how many missing values are there in each column
df.apply(lambda x : sum(x=="?"))

Datetime                   0
Global_active_power    25979
dtype: int64

In [10]:
# Fill all the missing values with the data from the previous row (previous minute)
df.replace('?', np.nan, inplace=True)
df.fillna(method='ffill', inplace=True)
df.apply(lambda x : sum(x=='?'))

Datetime               0
Global_active_power    0
dtype: int64

In [11]:
# Convert to numeric
df['Global_active_power'] = pd.to_numeric(df['Global_active_power'])

In [12]:
# Ensure the data is sorted
df.sort_values('Datetime', inplace=True, ascending=True)

In [13]:
# Set the Datetime column as the index
df.set_index('Datetime', inplace=True)

df.head()

Unnamed: 0_level_0,Global_active_power
Datetime,Unnamed: 1_level_1
2006-12-16 17:24:00,4.216
2006-12-16 17:25:00,5.36
2006-12-16 17:26:00,5.374
2006-12-16 17:27:00,5.388
2006-12-16 17:28:00,3.666


In [14]:
# Save the processed file for future use
df.to_csv('household_power_consumption_final.csv')

In [None]:
# Plot the average Daily consumption on different time scales
fig, ax = plt.subplots(5, figsize=(16,30))

# Daily
ax[0].plot(df['Global_active_power'].resample('D').mean())
ax[0].set_title('Mean power resampled over Days',fontweight="bold", color='g')

# Weekly
ax[1].plot(df['Global_active_power'].resample('W').mean())
ax[1].set_title('Mean power resampled over Weeks',fontweight="bold", color='g')

# Monthly
ax[2].plot(df['Global_active_power'].resample('M').mean())
ax[2].set_title('Mean power resampled over Months',fontweight="bold", color='g')

# Quarterly
ax[3].plot(df['Global_active_power'].resample('Q').mean())
ax[3].set_title('Mean power resampled over Quarters',fontweight="bold", color='g')

# Yearly
ax[4].plot(df['Global_active_power'].resample('Y').mean())
ax[4].set_title('Mean power resampled over Years',fontweight="bold", color='g')

plt.show()

<b>Check if the series is Stationary</b><br>

We will use the Dickey-Fuller test to check this

H0: The time-series has a unit root, i.e. it is NOT stationary
<br>H1: The time series is stationary

In [16]:
# Aggregate the data at a day level
daily_data = df.resample('D').mean()

# Check if there are any NaNs in the daily data
np.sum(daily_data.isna())

# Replace the NaNs by the previous day's value
daily_data.fillna(method='ffill', inplace=True)

daily_data.head()

Unnamed: 0_level_0,Global_active_power
Datetime,Unnamed: 1_level_1
2006-12-16,3.053475
2006-12-17,2.354486
2006-12-18,1.530435
2006-12-19,1.157079
2006-12-20,1.545658


In [None]:
# Define a function to check the Stationarity

from statsmodels.tsa.stattools import adfuller

def Stationarity_Check(timeseries):
    dftest = adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags used','#Observations'])
    print("\nResults of the Augmented Dickey Fuller Test : \n")
    print (dfoutput)

In [None]:
# Perform the DF test to verify the stationarity

Stationarity_Check(daily_data.Global_active_power.values)

From the p-value we can reject the Null hypothesis and conclude that the data is Stationary

<b>Time series modeling using LSTM</b>

Prepare the data for modeling

In [17]:
data = df['Global_active_power'].values

In [18]:
data

array([4.216, 5.36 , 5.374, ..., 0.688, 0.688, 0.688])

In [19]:
data = data.reshape((-1,1))

In [20]:
data

array([[4.216],
       [5.36 ],
       [5.374],
       ...,
       [0.688],
       [0.688],
       [0.688]])

In [21]:
data.shape

(2075259, 1)

In [None]:
# Scale the data
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0,1))
data = scaler.fit_transform(data)

In [22]:
int(len(data) * 0.8)

1660207

In [None]:
# Split into training/test sets
train_size = int(len(data) * 0.8)
train, test = data[:train_size,], data[train_size:,]

In [None]:
train

In [None]:
test

In [None]:
# Prepare the data in a format required for LSTM (samples, timesteps, features)

def Create_Dataset(df, lookback=1):
    X, Y = [], []
    for i in range(len(df) - lookback - 1):
        X.append(df[i:(i+lookback), 0])
        Y.append(df[i + lookback,0])
    return np.array(X), np.array(Y)

In [None]:
lookback = 30
X_train, Y_train = Create_Dataset(train, lookback)
X_test, Y_test   = Create_Dataset(test, lookback)

In [None]:
X_train

In [None]:
Y_train

In [None]:
X_train.shape

In [None]:
X_train = np.reshape(X_train, (X_train.shape[0], 1, X_train.shape[1]))

In [None]:
X_train

In [None]:
X_test  = np.reshape(X_test, (X_test.shape[0], 1, X_test.shape[1]))

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
# Print the data
print("X_train : \n")
print(X_train[:2])

print("\n\nY_train : \n")
print(Y_train[:2])

In [None]:
# Check the shapes of the data for modeling
print("X_train : ", X_train.shape)
print("Y_train : ", Y_train.shape)
print("\nX_test : ", X_test.shape)
print("Y_test : ", Y_test.shape)

<b>Build the LSTM Model</b>

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM
from tensorflow.keras.callbacks import EarlyStopping # Optional

In [None]:
model = Sequential()

In [None]:
X_train.shape[1]

In [None]:
X_train.shape[2]

In [None]:
model.add(LSTM(100, input_shape=( X_train.shape[1], X_train.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(1))

In [None]:
model.compile(optimizer='adam', loss='mean_squared_error')

In [None]:
model.summary()

<b>Train the Model</b>

In [None]:
history = model.fit(X_train, Y_train, 
                    epochs=2, 
                    batch_size=50,
                    validation_data=(X_test, Y_test),
                    callbacks=[EarlyStopping(monitor='val_loss', patience=10)],
                    verbose=1,
                    shuffle=False)

<b>Make the predictions and convert back to the original scale</b>

In [None]:
# Make the predictions
train_predict = model.predict(X_train)
test_predict  = model.predict(X_test)

# Invert the predictions to original scale
train_predict = scaler.inverse_transform(train_predict)
Y_train = scaler.inverse_transform([Y_train])

test_predict  = scaler.inverse_transform(test_predict)
Y_test = scaler.inverse_transform([Y_test])

<b>Check the RMSE scores</b>

In [None]:
from sklearn.metrics import mean_squared_error

print ("Train MAE : ", mean_squared_error(Y_train[0], train_predict[:,0]))
print ("Train RMSE : ", np.sqrt(mean_squared_error(Y_train[0], train_predict[:,0])))

print ("\nTest MAE : ", mean_squared_error(Y_test[0], test_predict[:,0]))
print ("Test RMSE : ", np.sqrt(mean_squared_error(Y_test[0], test_predict[:,0])))

Mape_train =np.mean(np.abs(Y_train[0] - train_predict[:,0])/Y_train[0])
Mape_test =np.mean(np.abs(Y_test[0] - test_predict[:,0])/Y_test[0])

print ("\nTrain MAPE : ", Mape_train)
print ("Test MAPE : ", Mape_test)

In [None]:
plt.figure(figsize=(12,5))
plt.title('Comparison of model loss')
plt.plot(history.history['loss'], label='Train loss')
plt.plot(history.history['val_loss'], label='Test loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend(loc='upper right')
plt.show()

<b>Plot the predicted vs actual values</b>

In [None]:
actual = Y_test[0][:200]
predicted = test_predict[:,0][:200]

plt.figure(figsize=(16,6))
plt.plot(actual, label='Actual')
plt.plot(predicted, label='Predicted')
plt.ylabel('Global_active_power', size=13)
plt.xlabel('Time Step', size=13)
plt.tight_layout()
plt.legend(fontsize=13)
plt.show()

In [None]:
##Forecasting for future time periods
y = []
for i in range(0,30):
    X1 = test[len(test)-30+i:,0]
    X1 = np.append(X1, y)
    X1 = X1.reshape(1,1,30)
    y1 = model.predict(X1)
    y.append(y1)
    print(y1)