<table align="center">
  <td align="center"><a target="_blank" href="https://colab.research.google.com/github/sherifmost/DeepLearning/blob/master/Labs/lab7/lab7.ipynb">
        <img src="http://introtodeeplearning.com/images/colab/colab.png?v2.0"  style="padding-bottom:5px;" />Run in Google Colab</a></td>
</table>

# Assignment 4: Transformer-based Sentiment Analysis

![Transformer Sentiment Analysis](https://github.com/sherifmost/DeepLearning/blob/master/Labs/lab7/Cover.png?raw=1)

## 4.1 Problem Statement

In this Assignment you will build a transformer encoder **from scratch** to be used as part of a classifier for movie review sentiment analysis on the IMDB dataset.

The full classifier consists of the following 3 main components:
1. Tokenizer: we will use a readily available BERT tokenizer
2. Transformer Encoder: you will implement a transformer encoder from scratch that takes in the outputs of the tokenizer and uses mutli-headed attention blocks and feed forward networks to obtain an output feature representation.
3. Classification Head: we will use a fully connected network that takes the output feature representation from the transformer encoder and obtains the output sentiment prediction using sigmoid activation.

The IMDB dataset consists of a total of 25,000 training examples and 25,000 testing examples with different sentence lengths for the reviews.

We will rely on both quantitative evaluation (using the accuracy metric) and qualitative evaluation (by inspecting the model's output on some test samples and comparing it to the actual output).

**IMPORTANT NOTE:** You have to change runtime type on Google Colab to GPU since this assignment requires much computation resources and it will run very slowly on CPU (Default runtime type)

Click on "Runtime" => "Change runtime type" => make sure that GPU is selected in the "Hardware accelerator"

Now lets walk through the code, and tell you the parts you need to fill.

**MAKE SURE YOU KEEP ALL THE OUTPUTS FOR THE SUBMISSION**

## 4.2 Problem Details

### 4.2.1 Installing and Importing the Needed Packages

Note: You might need to restart your session after running the following cell. **If prompted to do so**, just click restart session and run the cells again. Otherwise, continue running the cells without restarting.

In [None]:
# Need to install this particular version of tensorflow_text as it allows integrating the BERT tokenizer into the model
################################### YOU MIGHT NEED TO RESTART YOUR SESSION AFTER RUNNING THIS CELL ###################################
!pip install -U "tensorflow-text==2.13.*"

In [None]:
# The datasets package will be used for loading the IMDB dataset
!pip install datasets

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model
# Needed for the tokenizer part of the model
import tensorflow_hub as hub
import tensorflow_text
from datasets import Dataset, DatasetDict, load_dataset
import matplotlib.pyplot as plt

### 4.2.2 The IMDB Dataset

In this assignment, we will use the [IMDB dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews), consisting of 25k training movie reviews and 25k testing movie reviews with a label (0/1) for each review indicating whether it is negative or positive.

Let's load the dataset to the session and inspect it.

#### 4.2.2.1 Loading the dataset

In [None]:
# Load the training and testing datasets
dataset_train = load_dataset("imdb", split = "train")
dataset_test = load_dataset("imdb", split = "test")

# Convert the training dataset to a temporary dataframe to inspect it
# You can use the interactive table option of the dataframe to inspect the dataset as you like
pd.DataFrame.from_dict(dataset_train)

### 4.2.3 The BERT Tokenizer

Tokenization transforms your raw text data by numerically encoding them based on your vocabulary. This allows a transformer model to understand the input text.

Here we use the BERT tokenizer to obtain the numerical token corresponding to each word in the text, type id values representing the sentence this token belongs to (in our case we have only one input sentence so we should expect all tokens to have a type id of *zero*), and attention masks that allow us to only include the parts of the sentence that contains valid text.

**You can read more about the BERT tokenizer [here](https://www.analyticsvidhya.com/blog/2021/09/an-explanatory-guide-to-bert-tokenizer/).**

In [None]:
# Obtaining the BERT tokenizer from tensorflow_hub
tokenizer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3", name = "tokenizer")

Let's Have a look at the tokenizer in action.

*Now pause for a second and think about the following:*

*   What do the output arrays obtained from the tokenizer: 'input_word_ids', 'input_mask', and 'input_type ids' represent? How will you use them as input to your transformer encoder?
* Why do the output arrays have the same number of elements?
* What do the first and last elements in the 'input_word_ids' array represent?

Note that as this tokenizer is a tensorflow model, we can include it directly as part of our full model regardless of how you implement the transformer encoder.

In [None]:
tokenizer(["this is an amazing movie!"])

### 4.2.4 The Transformer Encoder

<img src="https://github.com/sherifmost/DeepLearning/blob/master/Labs/lab7/Encoder_Abstract.png?raw=1" width="300" height="500">

**The main grading criteria for this part is that your implementation correctly maps the transformer encoder architecture and that it works without errors. For the hyperparameters, you can choose any value you like as long as it allows you to get satisfactory testing accuracy at the end without underfitting/overfitting.**

Here you will implement your transformer encoder.

Your encoder should follow the transformer encoder architecture and should include the following:


*   Word Embeddings and Position Embeddings: you can use Tensorflow's [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer (Note that we don't need the token type embedding since our input consists of a single sentence).
*   Multiple Consecutive Blocks (**at least 2 blocks**) of:
  * Multi-Headed Self-Attention: you can use Tensorflow's [MultiHeadAttention](https://www.tensorflow.org/api_docs/python/tf/keras/layers/MultiHeadAttention) layer
  * Skip Connection and Normalization: you can use Tensorflow's [Add](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Add) and [LayerNormalization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LayerNormalization) layers
  * Intermediate Feed-Forward Network: you can use Tensorflow's [Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) layer
  * Another Skip Connection and Normalization
  * [Dropout](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout) layers as needed to avoid overfitting

The input of the encoder will be the matrices output by the tokenizer while the output of the encoder will be the output of the feed foward network in the final block.

You can check this [diagram](https://raw.githubusercontent.com/gmihaila/ml_things/master/notebooks/pytorch/bert_inner_workings/bert_inner_workings.png) showcasing the full arctiecture of a BERT encoder model. Use it as a guide when building your own transformer encoder, but don't follow its hyperparameters exactly as the BERT encoder takes a long time to train.

**TODO:** fill in the missing code to define a transformer encoder model.

In [None]:
# TODO: fill in this function to define a transformer encoder
# The input to the encoder should be the outputs of the tokenizer inspected above
# The output of the encoder should be a flat feature vector representing the accumulated representation for the features extracted by the encoder

# Make sure your encoder follows the typical transformer architecture and uses each of the following imported tensorflow layers
from tensorflow.keras.layers import Input, Dense, Embedding, LayerNormalization, Dropout, Flatten, MultiHeadAttention, Add

# This function should return a tensorflow Model()
# Note that relying on the functional approach of building the model (similar to the past assignments) will facilitate the code
def get_transformer_encoder(vocab_size = 30522):
  # Inputs
  input_word_ids = Input(shape=(None,), dtype=tf.int32, name='input_word_ids')
  input_mask = Input(shape=(None,), dtype=tf.int32, name='input_mask')
  # Position indexes for the input tokens
  position_indexes = tf.range(start=0, limit=tf.shape(input_word_ids)[1], delta=1)

  # ToDo: Define the Embedding layers
  word_embedding_layer = #ToDo
  positions_embedding_layer = #ToDo

  # Combine embeddings
  embeddings = word_embedding_layer(input_word_ids) + positions_embedding_layer(position_indexes)

  # ToDo: Implement Transformer attention and feedforward blocks given the created embeddings.
  # Use the input_mask to mask the attention operation. Add dropout regularization as needed to handle overfitting.
  # Implement at least 2 consecutive attention and feedforward blocks.

  # First Block
  block_output1 = #ToDo

  # Second Block
  block_output2 = #ToDo

  # Add more blocks if needed to handle underfitting

  # ToDo: add here the output of the feedforward network of the last block
  blocks_output = #ToDo

  # The encoder output is a layer normalization on the last feed forward network output
  encoder_output = LayerNormalization(epsilon=1e-6)(blocks_output)
  encoder_output = Flatten()(encoder_output)

  return Model(inputs=[input_word_ids, input_mask], outputs=encoder_output)

### 4.2.5 The Full Classifier Model

Now let's define the full classification model by combining the tokenizer with your implemented transformer encoder model and adding a classification head to the encoder's output.

In [None]:
def build_classifier_model():
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
  tokenizer_output = tokenizer(text_input)
  encoder_inputs = {
      'input_word_ids': tokenizer_output['input_word_ids'],
      'input_mask': tokenizer_output['input_mask'],
  }

  # Use here your defined encoder that takes the output of the tokenizer
  encoder = get_transformer_encoder()
  encoder_outputs = encoder(encoder_inputs)

  output = tf.keras.layers.Dropout(0.2)(encoder_outputs)
  output = tf.keras.layers.Dense(1, activation=None, name='classification_output')(output)
  return tf.keras.Model(text_input, output)

 Take a look at your model's summary. Note the number of parameters and their size in MBs (transformers are large models and require extensive resources for training and storage).

In [None]:
model = build_classifier_model()
model.summary()

Take a look at the model's structure as a diagram.

In [None]:
tf.keras.utils.plot_model(model, show_shapes=True, expand_nested=True)

### 4.2.6 Training the Model and Performing Quantitative Evaluation

The following function helps us plot the training and validation accuracy and loss after training.

In [None]:
# Plotting the training history
def plot_train_history(hist,
                       metric = 'accuracy'):

  fig = plt.figure(figsize=(10, 5))
  # Get training and test loss histories
  trainingLoss = hist.history['loss']
  valLoss = hist.history['val_loss']

  # Create count of the number of epochs
  epochCount = range(1, len(trainingLoss) + 1)

  # Visualize loss history
  fig.add_subplot(1,2,1)
  plt.plot(epochCount, trainingLoss, 'r--')
  plt.plot(epochCount, valLoss, 'b-')
  plt.legend(['Training Loss', 'Val Loss'])
  plt.xlabel('Epoch')
  plt.ylabel('Loss')

  # Get training and test accuracy histories
  trainingAcc = hist.history[metric]
  valAcc = hist.history['val_' + metric]

  # Create count of the number of epochs
  epoch_count = range(1, len(trainingAcc) + 1)

  # Visualize accuracy history
  fig.add_subplot(1,2,2)
  plt.plot(epoch_count, trainingAcc, 'r--')
  plt.plot(epoch_count, valAcc, 'b-')
  plt.legend(['Training Accuracy', 'Validation Accuracy'])
  plt.xlabel('Epoch')
  plt.ylabel('Accuracy')

#### 4.2.6.1 Defining the Training Hyperparameters

In [None]:
# Defining the loss function and the evaluation metric
loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
metrics = tf.metrics.BinaryAccuracy()

In [None]:
# You are encouraged to experiment with tuning the following hyperparameters to handle overfitting/underfitting
epochs = 15
batch_size = 32
lr = 3e-5
optimizer = tf.keras.optimizers.AdamW(learning_rate=lr)

#### 4.2.6.2 Compiling and Training the Model

In [None]:
# Compiling the model using the loss and evaluation metrics
model.compile(optimizer=optimizer,
              loss=loss,
              metrics=metrics)

**TODO:** Make sure to handle any underfitting (*training accuracy should at least be more than 90%*) or overfitting in the training. You can try early stopping and/or regularization methods.

In [None]:
# ToDo: Make sure to handle overfitting/underfitting
# Running the training and obtaining a plot for it
history = model.fit(dataset_train['text'],
                    dataset_train['label'],
                    validation_data = (dataset_test['text'], dataset_test['label']),
                    epochs=epochs,
                    batch_size=batch_size)

plot_train_history(history, metric = 'binary_accuracy')

### 4.2.7 Qualitative Evaluation

In [None]:
# This function converts the output model probabiltiy into a sentiment (Positive or Negative)
def get_sentiment(probability_positive):
  if probability_positive > 0.5:
    return "Positive"
  else:
    return "Negative"

Let's check the model's output compared to random review samples from the testing data. You can run the following cell multiple times to see different examples!

In [None]:
# Randomly select a test review and check the model's output on it
# You can run it multiple times to check different samples
# After running this cell, keep your output
for i in range(10):
  random_id = np.random.randint(0, len(dataset_test))
  test_review = dataset_test['text'][random_id]
  test_label = dataset_test['label'][random_id]
  print("Review: ", test_review)
  print("Ground Truth Sentiment: ", get_sentiment(test_label))
  # Prediction probability of the Positive review, i.e., 1 using sigmoid function:
  probability_positive = 1/(1 + np.exp(-model.predict([test_review], verbose = 0)[0][0]))
  probability_negative = 1 - probability_positive
  print("Predicted Sentiment: ", get_sentiment(probability_positive))
  print("Prediction Probability: Positive({}), Negative({})".format(probability_positive, probability_negative))
  print("-"*200)

## 4.3 Conclusion

That's it! Congratulations on training a transformer-based sentiment analysis model.

Make sure you deliver all the requirements for the submission and to keep the outputs in the notebook!