# Tutorial 5 - RNNs

In this tutorial, you will build a recurrent neural network (RNN), with long-short term memory (LSTM) units, in order to classify movie reviews from the IMDB dataset. As discussed in the lectures, RNNs operate by holding a state vector that is a combination of all previous inputs in a sequence. LSTMs are a modification of standard RNN cells, which perform additional operations to filter out unimportant information, allowing for longer sequences to be learned.

This tutorial has been adapted from the TensorFlow guides. https://www.tensorflow.org/text/tutorials/text_classification_rnn

In [None]:
# Module Imports
import numpy as np

import tensorflow_datasets as tfds
import tensorflow as tf

tfds.disable_progress_bar()

import matplotlib.pyplot as plt

# Exercise 1 - Setting up the data

As seen in the lectures, computers cannot handle text data. Therefore, it is necessary to convert the text to numbers. Here, we will use a vectorisation layer, which converts each word in a sentence to a unique integer.

First, load in the dataset using the code provided. You can inspect the data by using the `train_dataset.take(1)` method, which will return a batch of (examples, labels). Print some of the texts and their labels.

Next, set up the `TextVectorization` layer as described below. Use the `encoder.adapt()` method to generate the vocabulary, or corpus, of words that the model will understand. Print the top 20 words in the dataset, which you can access via the `encoder.get_vocabulary()` method.

Pass a few examples of the data into the encoder and print the results. We can reverse the process by indexing the vocab with the encoded vectors. The decoded vectors are not exactly the same as the original texts, why?

In [None]:
# Code to load the dataset from the tensorflow_datasets module
# A tf.data iterator is set up for training and testing data (we will see these
# in later lectures.)
dataset, info = tfds.load('imdb_reviews', with_info=True,
                          as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

BUFFER_SIZE = 10000
BATCH_SIZE = 64
train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
Train_TextOnly = train_dataset.map(lambda text, label: text)
test_dataset = test_dataset.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

In [None]:
### Print a few examples of the training dataset, along with their labels.
# train_dataset.take(1)



In [None]:
### Set up a TextVectorzation layer, with a vocab size (max_tokens) of 1000.
### Build the vocab of the model by calling the encoder.adapt() method on the
### Train_TextOnly dataset. 
# tf.layers.TextVectorization, encoder.adapt

VOCAB_SIZE = 1000
encoder = 


In [None]:
### Extract the vocabulary from the encoder, print the top 20 words.
# encoder.get_vocabulary()

# vocab should be a np.array
vocab = 


In [None]:
### Encode a few example reviews and print the results.
# encoder().numpy()

encoded_example = 

In [None]:
### Decode some example texts, by using the encoded vectors as indices for the
### vocabulary.



# Exercise 2 - Model set up

We will construct our network as a keras sequential model.

First, set up the embedding layer, which will take the encoded words and learn a suitable representation of the data.

Next, build the sequential model with the following layers:
```
encoder
embedder
LSTM - 64 units
Dense - 64 units, relu activation
Dense - 1 unit, sigmoid activation
```

Compile the model with `BinaryCrossentropy` loss, and the Adam optimizer. Train the model for 10 epochs, storing the losses and metrics in a history object.

In [None]:
### Set up the embedding layer. The input_dim is the length of the vocab, and
### the output_dim is 64. You should also set mask_zero=True.
# tf.keras.layers.Embedding

embedder = 

In [None]:
### Build a keras seuqential model, with the layers provided above.
# tf.keras.layers.LSTM, tf.keras.layers.Dense
model = tf.keras.Sequential([

])

In [None]:
### Compile the model with binary crossentropy loss, the adam optimizer, and 
### the accuracy metric
# tf.keras.losses.BinaryCrossentropy(from_logits=False)
# tf.keras.optimizers.Adam(1e-4)

model.compile()

In [None]:
# Train the model for 10 epochs.
history = model.fit(train_dataset, epochs=10,
                    validation_data=test_dataset,
                    validation_steps=30)

# Exercise 3 - Evaluate the model

Now that the model is trained, we can analyse its performance. Consider, during this exercise, not just what the numbers are, but the meaning behind them.

Start by calculating the test loss and test accuracy, using the `model.evaluate()` method. Print these values.

Create two plots, one for the loss and one for the accuracy, containing the values for the training and testing data from the history object. How do the test and train values compare? How do they evolve overtime? Why?

Extract a few examples from the test dataset, and use the `model.predict()` method to classify them as positive or negative. Do you agree with the model? How do the predicted labels compare to the true labels? Can you tell why it might be getting some examples wrong?

Lastly, write a short review of the last movie you watched, and use the model to predict whether it is positive or negative. Is it right?

In [None]:
### Use the model.evaluate method to calculate the test loss and accuracy
# model.evaluate()

test_loss, test_acc = 


In [None]:
### Create plots of the training and test losses and metrics
# history.history

In [None]:
### Extract a batch of examples, and pass the texts in the model.predict method
### to classify them as positive (close to 1) or negative (close to 0)
# model.predict

examples = [(texts, labels) for texts, labels in test_dataset.take(1)]

texts = examples[0][0]
labels = examples[0][1]


In [None]:
### Write a short review of the last movie you watched and use the model to
### predict if it is positive or negative

sample_review = np.array([('Text of your review')])
