<a href="https://colab.research.google.com/github/sadeelmu/deeplearning/blob/main/Rnn_applications.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TP 5b - RNN applications

This lab session will show you how RNNs can be applied to text data. A lot of the workload is dedicated to preparing data; this is true for deep learning in general, but text requires special considerations. Do not worry too much about training the actual RNN, as that is the generic and easy part you have already done in the previous lab.

In [None]:
import random
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from timeit import default_timer as timer
import pickle

random.seed(2111994)

## Dataset

The dataset we will be using is a corpus of user-submitted movie reviews from **IMDB**.

The task you will attempt to perform is **sentiment analysis**. A 0 or 1 label is provided for each review, indicating whether the user rated the movie as bad (0) or good (1) overall.

Run the cell below to load the train and test sets.

In [None]:
import tensorflow_datasets as tfds

N_LIMIT_TRAIN = 10000
N_LIMIT_TEST = 5000

ds_train = list(tfds.load('imdb_reviews', split='train').take(N_LIMIT_TRAIN))
ds_test = list(tfds.load('imdb_reviews', split='test').take(N_LIMIT_TEST))

Preview one item from the dataset:

In [None]:
ds_test[0]

A few caveats about the format:
- those are lists of dictionaries with two keys each, *label* and *text*
- the corresponding values are TF tensors; they need to be converted before usage
- the *text* is encoded (as signaled by the *b* before the quote), meaning an extra conversion step is necessary

The next cell will take care of those conversions.

In [None]:
def convert_dict(d):
    d["label"] = d["label"].numpy()
    d["text"] = d["text"].numpy().decode("utf-8")
    return d

ds_train = [convert_dict(d) for d in ds_train]
ds_test = [convert_dict(d) for d in ds_test]

## 1 - Preprocessing

The purpose of this section is to turn your text data into something usable for an RNN: **vectors**. The steps are the following:
- build a **tokenizer** for breaking text into words
- determine a reduced **vocabulary** for your text data
- convert your text to lists of integers



The first step is to *cleanly* break down your text into words. This is actually more difficult than it seems; you will try your code on the following string.


In [None]:
sample_string = ds_test[0]["text"]

We will rely on a library called *NLTK*. Import it with the following cell:

In [None]:
import nltk
nltk.download("punkt")

#### QUESTION 1
Use *nltk.word_tokenizer* to separate the string.

In [None]:
sample_string_tokens = # TODO
print(sample_string)

#### QUESTION 2

Eliminate punctuation from each word and convert it to lowercase. Eliminate the annoying *br* special character as well, if it is there.

In [None]:
sample_string_tokens_clean = # TODO
print(sample_string)

#### QUESTION 3

Complete *tokenize* so that it returns a list of separate cleaned-up (lowercase, no punctuation, no *br*) words.

In [None]:
def tokenize(string_in):
    # TODO
    return string_out

## Vocabulary

Now that you are able to break down your text into words, you will determine what **vocabulary** you will be using.

#### QUESTION 4

Randomly sample 2000 reviews from your entire dataset using *random.sample*. How many words do they contain (*tokenize* will help)? How many **distinct** words do they contain?

In [None]:
full_dataset = ds_train + ds_test
# TODO

We don't need to use so many words; we will focus on a high enough number to cover 95% of the text in those 2000 reviews you sampled, and discard the rest.

#### QUESTION 5


Rank the words you encountered by number of appearances, and display the ranking on a plot (maybe just the top 100 for better readability).

In [None]:
appearances = {}
# TODO

It looks like most of the text in the reviews uses only very few words. The word distribution you just saw obeys something called [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law). Weirdly enough this phenomenon happens in any language, not just English.

#### QUESTION 6

Append enough words to *vocab_list* to cover 95% of the text in *sampled_reviews*. Complete the code for *vocab_dict* so that it matches each word to its index in *vocab_list*

In [None]:
vocab_list = []
vocab_list[0] = "___"
# TODO: fill the rest
vocab_dict = # TODO

#### QUESTION 7

Time to put it together; convert the string in every item in *train* and *test* to a list of indices (use *vocab_dict*) - discard words that do not belong to your vocabulary. **Pad with zeros or crop to give all lists a length of 320**.

In [None]:
ds_train = # TODO
ds_test = # TODO

## 2 - Vectorization

We've successfully converted the text from the dataset to integer sequences. The next step is to turn those into sequences of "meaningful" **word vectors**, ready to throw into an RNN. This wouldn't be easy to get from scratch; to speed things up we'll rely on a set of pretrained word vectors. Load them with the following cells.

In [None]:
!wget --content-disposition https://seafile.unistra.fr/f/1daee01e85904416878e/?dl=1

In [None]:
import pickle

with open("word_vectors.pickle", "rb") as f:
    word_vecs = pickle.load(f)

This dictionary maps words to 250-D vectors. They were generated by an embedding trained using the Word2Vec methodology on a text corpus from Wikipedia.

#### QUESTION 8

Play with your word vectors for a bit. Look up the vectors for "queen", "cool", "warm". Which word in *word_vecs* is the closest (use the *dist* function provided below) to *cold + (new - old)*? Why does that make sense? You may try other combinations if you want.

#### QUESTION 9

Build a (n_vocab, 250) numpy array; the i-th row should contain the vector for the i-th word in your vocabulary.

In [None]:
embedding_array = # TODO

We can convert this to a tf.Tensor. We will use this later during training and testing to convert indices to vectors.

In [None]:
embedding_tensor = tf.Variable(embedding_array, trainable=False)

## 3 - Training

The RNN model below is virtually the same as in the previous lab, except for two things: *tf.nn.embedding_lookup* converts the integer indices coming from the input to the corresponding word vectors, and a fully connected layer is inserted before the actual RNN to make the vector even smaller.

In [None]:
class RNN_model(tf.keras.Model):
  def __init__(self, memory_size):
    super().__init__()
    self._dense_1 = tf.keras.layers.Dense(
      units=8,
      activation="relu"
    )
    self._cell = tf.keras.layers.LSTMCell(memory_size)
    self._rnn = tf.keras.layers.RNN(self._cell)
    self._dense_2 = tf.keras.layers.Dense(
      units=N_CLASSES,
      activation="softmax"
    )

  def call(self, x):
    res = x
    res = tf.nn.embedding_lookup(embedding_tensor, res)
    res = self._dense_1(res)
    res = self._rnn(res)
    res = self._dense_2(res)
    return res

Let's wrap the data into tf Datasets as usual:

In [None]:
x_train = [d["text"] for d in ds_train]
y_train = [d["label"] for d in ds_train]
x_test = [d["text"] for d in ds_train]
y_test = [d["label"] for d in ds_train]

train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
test_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))

#### QUESTION 10
Train the RNN for 10 epochs and evaluate it.