 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>

# `Introduction to Recurrent Neural Networks (RNN)`

* special networks that channel information in such a  way that we can model sequential data efficiently

* feed forward networks don't have the concept of "memory"
  * the concept of "having an order in time" of operations is inconceivable
  * the only thing that happens is the tuning of weights

* in RNNs **information cycles in loops**
    * whenever a neuron makes a decision, it considers the input of the previous neuron AND ALSO what it has learned from ALL previous neurons


<center><img src="https://edlitera-images.s3.amazonaws.com/RNN_folded_and_unfolded.png" width="1200"/>

source:
<br>
https://arxiv.org/abs/2005.11691

## `Types of Recurrent Neural Networks (RNN)`

* typical feed forward networks are of the **one to one** type
    * we go from some fixed-size input to some fixed size output (e.g. image classification)

* RNNs are more flexible

* we tipically divide RNNs into the following three types:
    * **one to many** - e.g. image captioning (convert an image input into a sentence of words)
    * **many to one** - e.g. emotion AI (inferring emotional state some text)
    * **many to many** - e.g. machine translation (translating a sentence from one language to another)
    



<center><img src="https://edlitera-images.s3.amazonaws.com/RNN_types.jpeg" width="1200">


source:
<br>
http://karpathy.github.io/2015/05/21/rnn-effectiveness/

# `Training RNNs`

* you can treat RNNs as a modifed version of the MLP

* as with every other network, we go through two phases:
    * forward propagation
    * back propagation

* key difference between standard backpropagation and backpropagation through time - we take into account all former steps when calculating the new value for the weights

* very complex stuff so we will skip going through the equations

* we cover this and similar more advanced topics in our `Introduction to Deep Learning with Python` course

# `Why are vanilla RNNs not used anymore?`

* two common problems occur:
    
    * vanishing gradients - shrinking of the gradients
    * exploding gradients - growth of the gradients

**Cause of problems:**

* the recurrent multiplication during backpropagation casues the gradients to either shrink exponentially or to grow exponentially

**Solution:**

* using gates we create LSTM networks and GRU networks
    * more on them later on

# `Data preprocessing for RNNs?`

* sometimes we want to perform very little preprocessing, and sometimes we want to perform a lot of preprocessing

* we must choose carefuly: **the choice whether to preprocess data and how much preprocess we will do depends entirely on the problem we are trying to solve**

**Rules of thumb:**
    
* **start with no preprocessing and based on the results decide whether to perform some data preprocessing**
* data preprocessing you can safely do: removing timestamps, html code, etc.

# `Long Short Term Memory networks - LSTM `

* perform extremely well on a wide variety of problems

* designed to avoid the problems encountered with RNNs

* storing information for a long time is their default behaviour, and a byproduct of their design
    * it is not something that needs to be trained

## `LSTM vs RNN `

* the main difference between LSTMs and RNNs is in the structure that gets repeated

* in RNN, the structure that gets repeated is very simple


* that same repeating structure is much more complicated in LSTMs

* the detailed structure and inner workings of LSTM are outside the scope of this course
    * if you're interested, we do cover this in our `Introduction to Deep Learning with Python` course

### `LSTM Structure`

<center><img src="https://edlitera-images.s3.amazonaws.com/LSTM_general_overview.png" width="1200">

source:
<br>
https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21

# `LSTM example`

In [1]:
# Import necessary libraries

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.optimizers import Adam
from keras.losses import BinaryCrossentropy
from keras.metrics import BinaryAccuracy

import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Load in our data and create a Dataframe

df = pd.read_csv("https://edlitera-datasets.s3.amazonaws.com/imdb_dataset.csv")
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1
...,...,...
49995,I thought this movie did a down right good job...,1
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",0
49997,I am a Catholic taught in parochial elementary...,0
49998,I'm going to have to disagree with the previou...,0


In [4]:
# Shuffle data

df = df.sample(frac=1).reset_index(drop=True) 

In [5]:
# Text preprocessing

df["review"].replace("<.*?>"," ", regex=True, inplace=True)

In [6]:
# Define independent feature

X = df["review"]

# Define dependent feature

y = df["sentiment"]

In [7]:
# Separate data into training data and testing data

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42
)

In [8]:
# Separate data into training data and validation data

X_train, X_valid, y_train, y_valid = train_test_split(
    X_train, y_train, test_size=0.20, random_state=42
)

In [9]:
# Define tokenizer

tokenizer = Tokenizer(num_words=20_000, oov_token = "<OOV>")

In [10]:
# Fit tokenizer on train data

tokenizer.fit_on_texts(X_train)

In [11]:
# Define number of total words

vocab_size = len(tokenizer.word_index) + 1 

In [12]:
# Convert into sequences of integers

X_train = tokenizer.texts_to_sequences(X_train)
X_valid = tokenizer.texts_to_sequences(X_valid)
X_test = tokenizer.texts_to_sequences(X_test)

In [13]:
# Define values important for padding

max_length = 100
trunc_type = "post"
padding_type = "post"

In [14]:
# Pad train, validation and test data

X_train_padded = pad_sequences(X_train, padding=padding_type, maxlen=max_length, truncating=trunc_type)
X_valid_padded = pad_sequences(X_valid, padding=padding_type, maxlen=max_length, truncating=trunc_type)
X_test_padded = pad_sequences(X_test, padding=padding_type, maxlen=max_length, truncating=trunc_type)

In [15]:
# Define model

embedding_dim = 100
input_dim = vocab_size

model = Sequential()
model.add((Embedding(input_dim=input_dim, output_dim=embedding_dim, input_length=max_length)))
model.add(LSTM(64, recurrent_dropout=0.2)) 
model.add(Dropout(0.5))
model.add(Dense(16,activation = "relu"))
model.add(Dense(1, activation="sigmoid"))

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 100, 100)          10158300  
_________________________________________________________________
lstm (LSTM)                  (None, 64)                42240     
_________________________________________________________________
dropout (Dropout)            (None, 64)                0         
_________________________________________________________________
dense (Dense)                (None, 16)                1040      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 10,201,597
Trainable params: 10,201,597
Non-trainable params: 0
_________________________________________________________________


In [16]:
# Compile model

loss_function = BinaryCrossentropy()

metric = BinaryAccuracy()

optim = Adam(learning_rate=0.0001)

model.compile(loss=loss_function, optimizer=optim, metrics=BinaryAccuracy())

In [17]:
# Define training parameters

num_epochs = 10
batch_size = 128

# Train model

history = model.fit(X_train_padded, 
                    y_train, 
                    batch_size=batch_size, 
                    epochs=num_epochs, 
                    verbose=1, 
                    validation_data=(X_valid_padded, y_valid))

Epoch 1/10
Epoch 2/10
 12/250 [>.............................] - ETA: 2:11 - loss: 0.6915 - binary_accuracy: 0.5469

KeyboardInterrupt: 

In [None]:
# Save model

model.save("LSTM_example_model.h5")

In [None]:
from keras.models import load_model

# Load saved model

model = load_model("LSTM_example_model.h5")

In [None]:
# Make predictions

y_pred = model.predict(X_test_padded)

In [None]:
y_pred

In [None]:
# Select a threshold and use it to convert preditctions into classes

y_pred = y_pred > 0.5

In [None]:
# Create confusion matrix

cm = confusion_matrix(y_test, y_pred)

sns.heatmap(cm, annot=True, fmt='g', cmap='Blues')
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.show()

In [None]:
# Create classification report

print(classification_report(y_test, y_pred))

## Using pretrained embeddings

In [18]:
# Load pretrained embeddings

embedding_vector = dict()

word_embeddings = "glove.6B.100d.txt"

f = open(word_embeddings, encoding="utf8")

for line in f:
    values = line.split()
    word = values[0]
    coef = np.asarray(values[1:], dtype="float32")
    embedding_vector[word] = coef
    
f.close()

In [19]:
# Create an embedding matrix to use as weights for the embedding layer

embedding_matrix = np.zeros((vocab_size, 100))

for word, i in tokenizer.word_index.items():
    embedding_value = embedding_vector.get(word)
    
    if embedding_value is not None:
        embedding_matrix[i] = embedding_value

In [20]:
# Define embedding layer with pretrained weights

embedding_layer = Embedding(
    vocab_size, 
    100, 
    weights=[embedding_matrix], 
    input_length=100, 
    trainable=False
)

In [21]:
# Define model

embedding_dim = 100
input_dim = vocab_size

model = Sequential()
model.add(embedding_layer)
model.add(LSTM(64, recurrent_dropout=0.2)) 
model.add(Dropout(0.5))
model.add(Dense(32, activation="relu"))
model.add(Dense(1, activation="sigmoid"))

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 100, 100)          10158300  
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                42240     
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 32)                2080      
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 33        
Total params: 10,202,653
Trainable params: 44,353
Non-trainable params: 10,158,300
_________________________________________________________________


In [22]:
# Compile model

loss_function = BinaryCrossentropy()

metric = BinaryAccuracy()

optim = Adam()

model.compile(loss=loss_function, optimizer=optim, metrics=BinaryAccuracy())

In [23]:
# Define training parameters

num_epochs = 15
batch_size = 128

# Train model

history = model.fit(X_train_padded, 
                    y_train, 
                    batch_size=batch_size, 
                    epochs=num_epochs, 
                    verbose=1, 
                    validation_data=(X_valid_padded, y_valid))

Epoch 1/15

KeyboardInterrupt: 

In [None]:
# Save model

model.save("LSTM_model_with_pretrained")

In [None]:
# Load saved model

model = load_model("LSTM_model_with_pretrained")

In [None]:
# Make predictions

y_pred = model.predict(X_test_padded)

In [None]:
# Select a threshold and use it to convert preditctions into classes

y_pred = y_pred > 0.5

In [None]:
# Create confusion matrix

cm = confusion_matrix(y_test, y_pred)

sns.heatmap(cm,annot = True,fmt ='g', cmap='Blues')
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.show()

In [None]:
# Create classification report

print(classification_report(y_test, y_pred))

# `LSTM take home exercise`

**Train an `LSTM model` to classify wines into good wines and superior wines (classes 0 and 1), using the dataset stored in the `https://edlitera-datasets.s3.amazonaws.com/wine_data_classification.csv` file. Do not clean your data in any way. Instead, use it as is to prove that the model will outperform the classic Machine Learning models we trained earlier, even without text data preprocessing!**

**When done training the model, print the classification report.**

## `Solution`

 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>