# Preface

In this notebook, we apply RNN and its variants to predict the positivity of movie reviews. This notebook serves the following purposes:
  * Introduce basic usage of RNN/LSTM
  * Introduce basic text data handling

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pathlib
sns.set(font_scale=1.5, style='darkgrid')

# The IMDB Dataset

We will use the [IMDB dataset](http://ai.stanford.edu/~amaas/data/sentiment/) consisting of movie reviews of various movie titles. 

Our goal is to develop a machine learning model which can predict, given a text review, whether the sentiment of the review is postive (1) or negative (0). 

THe IMDB dataset is built into `keras.datasets`.

In [None]:
from tensorflow.keras.datasets import imdb

In [None]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=20000)

We can see that the inputs are already coded into integers

In [None]:
print(x_train[0])

These are actually word encodings based on frequency. For details, have a look a the documentation of the dataset, e.g. [here](http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=2ahUKEwjE69P55NXnAhXLfH0KHSFoClAQygQwAHoECAQQBw&url=http%3A%2F%2Fkeras.io%2Fdatasets%2F%23imdb-movie-reviews-sentiment-classification&usg=AOvVaw3ZEeYraF9cI7oBodf2K9ea).

To see what the review text actually is, we write a simple function that converts the encodings back into words. This is done using the `imdb.get_word_index()` method.

In [None]:
def to_words(word_ids):
    """
    Convert list of word_ids back to words.
    Special chars for 0-3 are based on the default kwargs of
    imdb.load_data()
    """
    index_from = 3
    word_to_id = imdb.get_word_index()
    word_to_id = {k: (v + index_from) for k, v in word_to_id.items()}
    word_to_id['<PAD>'] = 0
    word_to_id['<START>'] = 1
    word_to_id['<UNK>'] = 2
    word_to_id['<UNUSED>'] = 3
    id_to_word = {value: key for key, value in word_to_id.items()}
    return ' '.join(id_to_word[id] for id in word_ids)

Let us look at some randomly chosen reviews:

In [None]:
idx = np.random.choice(len(x_train))
print('Sentiment: ', y_train[idx])
print('Review: ', to_words(x_train[idx]))

## Preprocessing the data for training

Let us check what is the length (number of words) for each review.

In [None]:
lengths_train = list(map(len, x_train))

In [None]:
ax = sns.histplot(lengths_train, kde=True)
ax.set_xlabel('Number of Words')

The number of words clearly varies from review to review, so we pad them. This is performed by the `sequence.pad_sequences` function. Any review longer than `maxlen` is cut off, and those shorter are padded (in the front) by 0s. 

In [None]:
from tensorflow.keras.preprocessing import sequence

In [None]:
x_train = sequence.pad_sequences(x_train, maxlen=100)
x_test = sequence.pad_sequences(x_test, maxlen=100)

# Build a Simple RNN Model

Now, we are ready to build a RNN model to learn to predict the sentiment given the review text.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, SimpleRNN
from tensorflow.keras.optimizers import Adam
from tqdm.keras import TqdmCallback

In [None]:
model = Sequential()

Here, we cannot directly work with x_train because they are encoded integers. Instead, we first use an `Embedding` layer that casts these integer encodings to a feature space. 

For example, `Embedding(5, 2)` is a 5x2 matrix that maps integer encodings (0,...,4) into 5 real-valued vectors of 2 dimensions:

| Coding | Embedded Vectors
| --- | --- |
| 0 | [0.5, 1.0] |
| 1 | [1.0, 1.2] |
| 2 | [0.1, -0.6] |
| 3 | [0.3, 0.5] |
| 4 | [-0.4, -0.1] |

This embedding is trainable, so we can learn to embed these encodings in the right way relevant to the task: words of similar meaning should have similar embeddings!

In [None]:
model.add(Embedding(20000, 128))

Now, we can add the RNN layers, and append to it an output `Dense` layer.

In [None]:
model.add(SimpleRNN(128))
model.add(Dense(1, activation='sigmoid'))

In [None]:
model.summary()

Since we are going to test several models, we will write some functions to train, save and evaluate models. These functions are self-explanatory.

In [None]:
def train_and_save(model, path, force=False, optimizer=Adam(0.001)):
    """
    Looks for saved model in path, if found, load.
    If not, compile, train and save model to path
    If force=True, will always retrain
    """
    model_save_dir = pathlib.Path(path)
    
    model.compile(
        loss='binary_crossentropy',
        optimizer=optimizer,
        metrics=['accuracy'],
    )
    
    if model_save_dir.exists() and not force:
        model.load_weights(str(model_save_dir))
    else:
        history = model.fit(
            x_train,
            y_train,
            batch_size=32,
            epochs=15,
            validation_data=(x_test, y_test),
            callbacks=[TqdmCallback(verbose=1)],
            verbose=0,
        )
        model.save_weights(str(model_save_dir))
        results = pd.DataFrame(history.history)
        results['epoch'] = history.epoch
        return results

In [None]:
def evaluate(model, train_data, test_data):
    """
    Evaluate model on train/test sets
    """
    eval_train = model.evaluate(*train_data, batch_size=512, verbose=0)
    eval_test = model.evaluate(*test_data, batch_size=512, verbose=0)
    print(f'Train - loss = {eval_train[0]:.3f}, acc = {eval_train[1]:.3f} ')
    print(f'Test - loss = {eval_test[0]:.3f}, acc = {eval_test[1]:.3f} ')

In [None]:
train_and_save(model=model, path='imdb_simple_rnn.h5')

In [None]:
evaluate(model, train_data=(x_train, y_train), test_data=(x_test, y_test))

# Deep RNN

Now, we train a deep RNN model by stacking two RNN cells together. This is done simply by adding another `model.add` call. However, note that other than the last RNN cell, "hidden" RNN cells need to have `return_sequences` set to `True`, so that the entire hidden sequence $h^{(t)}$, is returned, and hence can be treated as inputs for the next layer.

In [None]:
model = Sequential()
model.add(Embedding(20000, 128))
model.add(SimpleRNN(128, return_sequences=True))
model.add(SimpleRNN(64, return_sequences=False))
model.add(Dense(1, activation='sigmoid'))

RNNs are notoriously hard to train. For this deeper model, we will use a smaller learning rate than before for the Adam optimizer.

In [None]:
train_and_save(model=model, path='imdb_deep_rnn.h5', optimizer=Adam(0.0001))

In [None]:
evaluate(model, train_data=(x_train, y_train), test_data=(x_test, y_test))

Observe that by going deeper we actually manage to do a little better than before!

# LSTM

Now, we can also try to improve performance using LSTM, which makes learning long-time dependence much easier. The implementation is very simple -- we just substitute all calls to `SimpleRNN` by `LSTM`.

In [None]:
from tensorflow.keras.layers import LSTM

In [None]:
model = Sequential()
model.add(Embedding(20000, 128))
model.add(LSTM(128))
model.add(Dense(1, activation='sigmoid'))

In [None]:
train_and_save(model=model, path='imdb_lstm.h5')

In [None]:
evaluate(model, train_data=(x_train, y_train), test_data=(x_test, y_test))

# Exercise

1. Play around with the above models and optimizer configurations to get better models.
2. Observe that the training accuracy is much greater than the test accuracy. What can you do to improve generalization?