<a href="https://colab.research.google.com/github/thalitadru/ml-class-epf/blob/main/LabTextRNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text classification with RNNs
## Preamble: installing and importing packages

In [None]:
try:
    import datasets
except ModuleNotFoundError:
    !pip install datasets
    import datasets

In [None]:
try:
    from unidecode import unidecode
except ModuleNotFoundError:
    !pip install unidecode
    from unidecode import unidecode

In [None]:
import os
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

In [None]:
SEED=34

## Load training dataset

We are going to work with a [dataset of movie reviews in french collected by the AlloCine website](https://huggingface.co/datasets/allocine). 
This dataset can be retreived using the [`datasets` library from the company HuggingFace](https://huggingface.co/docs/datasets/index).

The next cells load some information on the dataset:

In [None]:
DATA_HANDLE = "allocine"

In [None]:
from datasets import load_dataset_builder
ds_builder = load_dataset_builder(DATA_HANDLE)

Checking the dataset description, we can see it is intended to be used for binary sentiment analysis:

In [None]:
ds_builder.info.description

Each element in the dataset has two features: the review text itself, and the associated label:

In [None]:
ds_builder.info.features

Now we are going to load the training data:

In [None]:
from datasets import load_dataset

train_ds = load_dataset(DATA_HANDLE, split="train")

As seen in `ds_builder.info.features`, each data sample has two fields: the `review` text and the `label` string. Here is the review text for one particular sample

In [None]:
train_ds[10]['review']

### Normalizing characters
Some of the tools we'll be using later cannot flawlessly handle all unicode characters. To avoid problems, we will normalize all characters to their closest ASCII equivalent using the function `unidecode` (imported from [`unidecode` package](https://pypi.org/project/Unidecode/)).

The function basically replaces all characters bearing [diacritic signs](https://en.wikipedia.org/wiki/Diacritic) with their corresponding plain character, as well as any symbols with close ASCII equivalents. The result is a text with no accents, cedillas, no € symbol, etc.

In [None]:
unidecode(train_ds[10]['review'])

We will use the method `map` to apply this transformation to all `review` texts

In [None]:
train_ds = train_ds.map(lambda sample: {'review': unidecode(sample['review']), 'label': sample['label']})

### Creating a TF dataset

The current dataset object is not in the format recognized by TensorFlow.
The `datasets` library provides a method to convert individual samples to the tensorflow format:

In [None]:
train_ds.with_format("tf")[10]

It is also possible to convert the entire object into a batched `tf.Dataset`:

In [None]:
BATCH_SIZE=80

In [None]:
tf.keras.utils.set_random_seed(SEED)
train_tfds = train_ds.to_tf_dataset(
            columns=["review"],
            label_cols=["label"],
            batch_size=BATCH_SIZE,
            shuffle=True
            )

In [None]:
for example, label in train_tfds.take(1):
    print('Example batch shape: ', example.shape)
    print('Label batch shape: ', label.shape)
    print('text: ', example.numpy()[10,...])
    print('label: ', label.numpy()[10,...])

Check here how many batches there are in the dataset with the method `cardinality()`:

In [None]:
train_tfds.cardinality()

## TODO Loading validation data

We now load the validation data:

In [None]:
# TODO set the split to validation
val_ds = ...

We must repeat the same pre-treatment steps applied to the training set:

In [None]:
# TODO apply the same character normalization operation you applied to the training set
val_ds = ...

In [None]:
tf.keras.utils.set_random_seed(SEED)
# TODO convert the val_ds into a tf dataset
val_tfds = ...

Check how many validation batches we have:

In [None]:
# TODO Use the method cardinality on the tf.Dataset object
val_tfds ...

## Text encoding layers


### Text vectorization

The simplest way to process text for training is using the [`TextVectorization` layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization). This layer has many capabilities, but in this notebook we stick to the default behavior.


#### TODO Creating and fitting the vectorizer

First we create the layer, with default parameters. We need to inform an upper limit to the vocabulary size, using the keyword argument `max_tokens`:


In [None]:
VOCAB_SIZE = 1000
# TODO set the max_tokens argument to VOCAB_SIZE
encoder = tf.keras.layers.TextVectorization(
    # TODO your code here
    ...
    )

encoder

We then need to train `encoder` (our vectorizer) on our training texts. This encoder is fitted in an **unsupervised** manner: we only use the texts, not the labels. Moreover, this encoder needs to be fully fitted prior to training of any subsequent NN models (since it defines the vector space on which NN models will work).
In keras, this type of training uses a different method: `.adapt` (instead of `fit`). 

`.adapt` must receive a different version of the dataset, that only contains the review text and does not contain any labels. We can do this transformation using the method `.map`:



In [None]:
train_tfds_txt = train_tfds.map(lambda text, label: text)

Now we can pass the text only dataset to the layer's `.adapt` method:

In [None]:
# TODO adapt the encoder to the texts in the training dataset
encoder...

#### Checking the vocabulary
The `.adapt` method sets the layer's **vocabulary**. Here are the first 50 tokens. 
- the first is an empty string token, corresponding to zero-padded sequence positions
- the second `[UNK]` stands for any unkknown tokens, all encoded with value 1.
- the remaining tokens are words sorted by frequency of appearence in the text corpus

In [None]:
vocab = np.array(encoder.get_vocabulary())
vocab[:50]

Once the vocabulary is set, the layer can encode text into indices (following the vocabulary order). That is, `de` is encoded as 2, `et` encoded as 3, `le` encoded as 4, `a` encoded as 5, and so on.


#### Example of encoded sequence



Let us look back into the 2 first samples in the batch of training samples loaded previously:


In [None]:
example[:2]


After encoding, the tensors of indexes are 0-padded to the longest sequence in the batch (unless you set a fixed `output_sequence_length`):

In [None]:
encoded_example = encoder(example).numpy()
encoded_example[:2]

#### Dealing with variable sequence lenghts

You have just seen the `encoder` layer pads sequence endding so that all sequences in a batch have the same lenght.


To see an example, lets compute the encoding for a short review. Since the batch contain only this review, no padding needs to be done:

In [None]:
short_review = "rien a redire"

batch = np.array([short_review])

We can confirm this by thecking the encoder output for this batch:

In [None]:
encoder(batch)

Note that each one of the three words is represented by the corresponding vocabulary index. No zeros are added.

Now see how the encoder pads the end of shorter sequences with zeros.
First we include a long review in the batch. 



In [None]:
long_review = ("un tres bon film qui vaut au moins 3 etoile car le casting est"
            " superbe avec notamment Rachel Hurd-Wood qui est exeptionnelle")

batch = np.array([short_review,
                 long_review])

The `encoder` will need to pad the short sentence so it matches the lenght of the longest one in the batch:


In [None]:
encoder(batch)

The first 3 values are the same, the rest of the sequence is filled with zeros.

#### Decoding an encoded sequence
Let's focus on the beggining of the first batch sample:

In [None]:
print("Example (first 50 chars): ", example[0].numpy()[:50])
print("Encoded example (first 10 words):", encoded_example[0][:10])
print("Decoded with vocabulary (first 10 words): ", " ".join(vocab[encoded_example[0][:10]]))


With the default settings, the process is not completely reversible. There are two main reasons for that:

1. The default value for `preprocessing.TextVectorization`'s `standardize` argument is `"lower_and_strip_punctuation"`: punctuation and uppercase information is lost
2. The limited vocabulary size: any infrequent words which did not make it up into the top list (here top-1000) will be assigned the code `1` corresponding to the `[UNK]` unknown token.

Here we compare original text and encoded-decoded text for some batch samples:

In [None]:
for n in range(3):
  print("Original: ", example[n].numpy())
  print("Encoded-decoded: ", " ".join(vocab[encoded_example[n]]))
  print()

### Word embedding
An embedding layer stores one vector per word. When called, it converts the sequences of word indices to sequences of vectors.



In theory, this operation is equivalent to one-hot encoding each word in a sequence, then passing the sequence through a `tf.keras.layers.Dense` layer. This implementation however avoids explicit one-hot encoding and works with index-lookup to be more computationally efficient.

Here is an example of some one-hot encoded words with a 5-word vocabulary:
![one hot encoded](https://www.tensorflow.org/static/text/guide/images/one-hot.png)

After a projection to $\mathbb{R}^4$, these same words get represented in 4-D:
![Embedding example](https://www.tensorflow.org/static/text/guide/images/embedding2.png)


Despite not explicitly using one, an `Embedding` layer has trainable weights just like a `Dense` layer. These weights project all vocabulary words into a common vectorspace.  After training (on enough data), words with similar meanings often have similar vectors. 
    


#### TODO Creating the layer
The code bellow creates a [`tf.keras.Embedding`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer that represents words with 64-dimensional vectors (as set via the argument `output_dim`):


In [None]:
# set seed for reproducibility
tf.keras.utils.set_random_seed(SEED)

# TODO check the documentation and complete the call
embedding = tf.keras.layers.Embedding(
        # TODO set the input dimension to be the length of the encoder vocabulary
        ...
        # TODO set the output dimension to 64
        ...
        # Use masking to handle the variable sequence lengths
        mask_zero=True)

#### Dealing with variable sequence lenghts
The embedding layer uses [masking](https://www.tensorflow.org/guide/keras/masking_and_padding) to handle the varying sequence-lengths. 

Masking allows the layer to **ignore the portions that got zero-padded** by the `encoder` layer. 



We have activated it by declaring the layer with the keyweord argument `mask_zero=True`.




In [None]:
print(embedding.supports_masking)

We will include both the `encoding` and `embedding` layer in the following models.

## LSTM Model

### Description
![A drawing of the information flow in the model](https://github.com/tensorflow/text/blob/master/docs/tutorials/images/bidirectional.png?raw=1)

Above is a diagram of the model. This model can be build as a `tf.keras.Sequential`.

2. The first layer is the `encoder`, which converts the text to a sequence of token indices.

3. After the encoder is an `embedding` layer, that converts the sequences of word indices to sequences of vectors.

4. A recurrent neural network (RNN) processes sequence input by iterating through the elements. RNNs pass the outputs from one timestep to their input on the next timestep. 
Here we use a recurrent layer of the [`LSTM`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM) type.

5. Additionally, we use the [`tf.keras.layers.Bidirectional`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Bidirectional) wrapper. This propagates the input forward and backwards through the RNN layer and then concatenates the final output. It is often used with text sequences (but should not be used with time series since it breaks causality).

  * The main advantage of a bidirectional RNN is that the signal from the beginning of the input doesn't need to be processed all the way through every timestep to affect the output.  

  * The main disadvantage of a bidirectional RNN is that you can't efficiently compute online predictions for a word stream, since new words keep getting added at the end of the sequence.

6. After the RNN has converted the sequence to a single vector the two `layers.Dense` do some final processing, and convert from this vector representation to a single logit as the classification output. 


### TODO Declaration


In [None]:
# set seed for reproducibility
tf.keras.utils.set_random_seed(SEED)

# TODO complete the call bellow
model = tf.keras.Sequential([
    # We reuse the encoder and embedding layers previously created
    encoder,
    embedding,
    # TODO add a bidirectional LSTM layer with 64 units
    ...,
    # TODO add a dense layer with 64 units and relu activation
    ...,
    # TODO add an output layer
    ...
])

Observe the output shapes in the model summary. Note that the `encoder` layer has a 2D output shape `(None, None)`:
- The first dimension `None`, as usual, is a placeholder for the batch dimension
- the second dimension `None` is a placeholder for the sequence lenght dimension. Since each batch of sequences has variable lenght, this dimension does not have a fixed size.



In [None]:
model.summary()

Note that since we are using bidirectional LSTM, each unit has two outputs, leading to an output shape of `2*64=128`.

### A note on masking padded sequences


All the layers after the `Embedding` support masking, meaning they all ignore padding in short sequences:



In [None]:
print([(layer.name, layer.supports_masking) for layer in model.layers])

This means that **predictions for a given sample should ramain the same, regardless zero-padding**.

To see an example, let's compute predictions for a short review. 


In [None]:
print(short_review)


Since the batch contain only this review, no padding needs to be done.

Applying the model to this batch will give the following prediction:

In [None]:
# predict on a sample text without padding.
batch = np.array([short_review])

predictions = model.predict(batch)
print(predictions[0])

Now we check that even if the sentence needs pading, its corresponding model output remains the same. First we include a long review in the batch. As seen before, we know the encoder will pad the shorter sequence up to the lenght of the longest sequence.

In [None]:
print(long_review)

In [None]:
batch = np.array([short_review,
                 long_review])

Now, we compute predictions in the new batch. The prediction for the short review should be identical:

In [None]:
# predict on a sample text with padding
batch = np.array([short_review,
                 long_review])


predictions = model.predict(batch)
print(predictions)

### TODO Compile

In [None]:
# TODO complete the compile call
model.compile(...)

### Calbacks and logs


In [None]:
# dictionary to keep history output from fit calls
logs = {}

# directory in which model checkpoints and logs are saved
LOG_DIR = 'logs'

def best_model_path(model_name):
    base_dir  = os.path.join(LOG_DIR, model_name)
    return os.path.join(base_dir, 'best_val_accuracy.ckpt')

def callback_list(model_name):
    base_dir  = os.path.join(LOG_DIR, model_name)
    tb_cb = tf.keras.callbacks.TensorBoard(base_dir)
    ckpt = tf.keras.callbacks.ModelCheckpoint(
         best_model_path(model_name),
         monitor='val_accuracy',
         mode='max', 
         verbose=0,
         save_best_only=True)
    backup_dir = os.path.join(base_dir, 'backup_checkpoint')
    bkp = tf.keras.callbacks.BackupAndRestore(
        backup_dir)
    return [tb_cb, ckpt, bkp]

### Tensorboard

In [None]:
# TODO laod tensorboard extension
%load_ext tensorboard

In [None]:
# TODO call tensorboard on your log directory
%tensorboard --logdir logs

### Fit
Complete the fit call with training and validation data. Train the model for 10 epochs. With Colab's GPU backend, this should take you around 20 minutes. In the mean-time, **go back to Moodle and check this week's quiz ✔✍ 😀**

In [None]:
MODEL_NAME = 'LSTM'
logs[MODEL_NAME] = model.fit(
    # complete the fit call
    ...,
    callbacks=callback_list(MODEL_NAME)
    )

## Stacking 2 LSTM layers

When using the output of a recurrent layer as input to another recurrent layer, the second should process not only the final output, but acctualy all the intermediate results (corresponding to each position in the sequence). To do that we need to tell the keras layer to return intermediate results. This behavior is controlled by the `return_sequences` constructor argument:

* If `False` it returns only the last output for each input sequence (a 2D tensor of shape (batch_size, output_features)). This is the default, used in the previous model.

* If `True` the full sequences of successive outputs for each timestep is returned (a 3D tensor of shape `(batch_size, timesteps, output_features)`).

Here is what the flow of information looks like with `return_sequences=True`:

![layered_bidirectional](https://github.com/tensorflow/text/blob/master/docs/tutorials/images/layered_bidirectional.png?raw=1)

### TODO Declaration

In [None]:
# set seed for reproducibility
tf.keras.utils.set_random_seed(SEED)

# TODO complete the model declaration 
model2 = tf.keras.Sequential([
    encoder,
    embedding,
    # TODO set the LSTM layer to return sequences
    ...
    # TODO add another Bidirectional LSTM layer with 32 units
    ...
    tf.keras.layers.Dense(1)
])

**Note:**
- because we set `return_sequences=True`, the output for the first LSTM layer still has 3-dimensions, like its input, so it can be passed to another recurrent layer.
- The second LSTM layer behaves as in the previous model, with a 2D output 


In [None]:
model2.summary()

### TODO Compile and fit

In [None]:
# TODO complete the compile call
model2.compile(...)

This model takes longer to train because there are more forward and backward computations to be done with the addition of the extra RNN layer. 10 epochs should take about 35 min on colab with GPU backend. To limit the time spent, we will **train for 5 epochs only.**

In [None]:
MODEL_NAME = 'Stack2LSTM'
logs[MODEL_NAME] = model2.fit(
    # TODO complete the fit call
    ...,
    callbacks=callback_list(MODEL_NAME)
    )

### Comments


If we use a randomly initialized embedding, this model does not outperform the previous one even after 10 (long) epochs. We can however reuse the embedding that was trained with the previous model, in hopes it gives us a head start (and this was the strategy used in the code above). Reusing the trained embedding let us achieve and surpass the previous model's performance after 3 epochs.

Nonethelees, keep in mind it is possible that simply training the previous model for extra 10 epochs would lead to similar improvements, though it remains to be tested. If that were the case, the computational overhead of a second recurrent layer could not be as easily justified.

In [None]:
# Optional: save all logs and checkpoints to a compressed archive you can download
#!tar -czf logs.tgz logs

## TODO Test time!
1. Load the test split for this dataset
2. Apply the same pre-processing steps used for training and validation
3. Load your best model from the corresponding model checkpoint and evaluate it on the test set. What was your accuracy?
4. Write your own fake movie review (positive or negative) and process it through your model. Did it get correctly classified?

In [None]:
#TODO your code here

# References
This notebook is based on the following tutorials:
- [Text classification with an RNN |Tensorflow documentation](https://www.tensorflow.org/text/tutorials/text_classification_rnn) 
- [Word embeddings |Tensorflow documentation](https://www.tensorflow.org/text/guide/word_embeddings)

Tensorflow documentation is release under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) license, with code samples under [Apache license 2.0](https://www.apache.org/licenses/LICENSE-2.0).


Additional references/sources are:

- [Load a dataset from the Hub | Huggingface Datasets documentation](https://huggingface.co/docs/datasets/load_hub)
- [Using Datasets with TensorFlow | Huggingface Datasets documentation](https://)huggingface.co/docs/datasets/use_with_tensorflow
- [Masking and padding |Tensorflow documentation](https://www.tensorflow.org/guide/keras/masking_and_padding)
