# Lab 13
## Recurrent Neural Networks
In this week’s lab we will apply a bidirectional RNN (GRU) to a binary text classification problem.   Specifically, we will train a model to distinguish between positive and negative sentiment of movie reviews using the [IBM movie review dataset](http://ai.stanford.edu/~amaas/data/sentiment/). If you are not familiar with bidirectional RNNs, see Chapter 10.3 of the textbook.

In [None]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

In [None]:
import tensorflow as tf
from tensorflow.keras import Model
from tensorflow.keras.layers import TextVectorization, Embedding, Bidirectional, GRU, Dense
import tensorflow_datasets as tfds

## Data Loading
The following class handles the data loading / preprocessing part.

First we download the whole dataset and divide it into a train, valid, test and unsupervised subset.
The unsupervised dataset is a subset of movie reviews which is not labeled.
We will need this to initialize the text-to-number encoder later.

In [None]:
tfds.disable_progress_bar()


class DataLoaderIMDB:
    """
    This class downloads and prepares the IMDB large movie dataset.
    It splits the dataset into train, valid, test, and unsupervised subsets.
    """

    def __init__(self):
        MINI_BATCH_SIZE = 64
        VALID_SPLIT = 0.1

        dataset, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

        valid_size = int(info.splits["train"].num_examples * VALID_SPLIT)

        self._valid_dataset = dataset["train"].take(valid_size)
        self._train_dataset = dataset["train"].skip(valid_size)
        self._test_dataset = dataset["test"]
        self._unsupervised_dataset = dataset["unsupervised"]

        # example = next(self._train_dataset.as_numpy_iterator())
        # print("text:", example[0])
        # print("label:", example[1])

        self._train_dataset = self._train_dataset.shuffle(10000)

        # Batch
        self._train_dataset = self._train_dataset.batch(MINI_BATCH_SIZE)
        self._valid_dataset = self._valid_dataset.batch(MINI_BATCH_SIZE)
        self._test_dataset = self._test_dataset.batch(MINI_BATCH_SIZE)
        self._unsupervised_dataset = self._unsupervised_dataset.batch(MINI_BATCH_SIZE)

        # Prefetch
        self._train_dataset = self._train_dataset.prefetch(tf.data.AUTOTUNE)
        self._valid_dataset = self._valid_dataset.prefetch(tf.data.AUTOTUNE)
        self._test_dataset = self._test_dataset.prefetch(tf.data.AUTOTUNE)
        self._unsupervised_dataset = self._unsupervised_dataset.prefetch(
            tf.data.AUTOTUNE
        )

    @property
    def train_dataset(self):
        return self._train_dataset

    @property
    def valid_dataset(self):
        return self._valid_dataset

    @property
    def test_dataset(self):
        return self._test_dataset

    @property
    def unsupervised_dataset(self):
        return self._unsupervised_dataset

# Model
This class defines the neural network as base class of tf.keras.Model.

Each element of our dataset consists of a string of words (one movie review per element).
However, neural networks do not know how to process letters or sentences.
Therefore, we first have to convert each word into a vector using a TextVectorizer in combination with an Embedding layer.

- The TextVectorizer simply converts the text to a sequence of token indices (basically a lookup table which maps from word to integer).
- An embedding layer stores one vector per word. When called, it converts the sequences of word indices to sequences of vectors. These vectors are trainable. After training (on enough data), words with similar meanings often have similar vectors. If you want to know more about embedding layers, see [here](http://jalammar.github.io/illustrated-word2vec/).

To complete the movie review classifier, we add a bidirectional RNN on top of these preprocessing layers.
The concatenated final states of the two RNNs are then used as an input for two feed-forward network layers with output size of 64 and 1, respectively.

The following cell shows one possible way to define the RNN model. There are many other ways to define such a model.

In [None]:
class MyModel(Model):
    """
    RNN for text classification of movie reviews.

    Parameters
    ----------
    vocabulary : ndarray (None,)
        Training vocabulary for text vectorizer
    """

    def __init__(self, vocabulary, **kwargs):
        super(MyModel, self).__init__(**kwargs)

        self.enc_1 = TextVectorization(max_tokens=1000)
        self.enc_1.adapt(vocabulary)
        self.emb_1 = Embedding(
            input_dim=len(self.enc_1.get_vocabulary()), output_dim=64, mask_zero=True
        )
        self.bid_1 = Bidirectional(GRU(64))
        self.dense_1 = Dense(64, activation="relu")
        self.dense_2 = Dense(1)

    def call(self, x, training=False):
        """
        Forward pass through RNN.

        Parameters
        ----------
        x : ndarray (None,)
            Mini-batch of byte strings which corresponds to RNN input
        training : bool, optional
            Training or testing mode, by default False

        Returns
        -------
        Tensor float32 (None, 1)
            Logits of classifier (pre-sigmoid activations)
        """
        t_enc_1 = self.enc_1(x)
        t_emb_1 = self.emb_1(t_enc_1)
        t_bid_1 = self.bid_1(t_emb_1)
        t_dense_1 = self.dense_1(t_bid_1)
        out = self.dense_2(t_dense_1)

        return out

# Training process

In [None]:
%tensorboard --logdir ./logs

In [None]:
TENSORBOARD_PATH = "./logs"

data_loader = DataLoaderIMDB()

train_dataset = data_loader.train_dataset
valid_dataset = data_loader.valid_dataset
test_dataset = data_loader.test_dataset
unsupervised_dataset = data_loader.unsupervised_dataset

# Create vocabulary for text vectorizer
vocab_1 = train_dataset.map(lambda text, label: text)
vocab_2 = unsupervised_dataset.map(lambda text, label: text)
vocabulary = vocab_1.concatenate(vocab_2)

model = MyModel(vocabulary)

# Compile the Keras model to configure the training process
model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
    metrics=["accuracy"],
)

tensorboard_callback = tf.keras.callbacks.TensorBoard(
    log_dir=TENSORBOARD_PATH, write_graph=False
)

# Train model
model.fit(
    train_dataset,
    epochs=10,
    validation_data=valid_dataset,
    callbacks=[tensorboard_callback],
)

# Test model
test_loss, test_accuracy = model.evaluate(test_dataset)

print(f"Test Loss: {test_loss}")
print(f"Test Accuracy: {test_accuracy}")