# Sentiment Classification With a Transformer

In this notebook, we revert to the classification task of the International Movie Database website www.imdb.com with reviews labeled with a binary rating whether they are positive (label 1) or negative (label 0).

## Set-up
First of all, we need to load the libraries that we will need for this task. We will use keras and tensorflow for this code example, so we load the relevant parts of this framework:

In [None]:
!pip install tensorflow_datasets

In [None]:
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.layers import Input, TextVectorization, Embedding, Conv1D, MaxPooling1D, Flatten, LSTM, Bidirectional
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.optimizers import Adam, RMSprop
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.initializers import Constant

In [None]:
tf.config.run_functions_eagerly(True)
tf.experimental.numpy.experimental_enable_numpy_behavior()

In [None]:
# initialize random number generators to ensure reproducibility:
tf.random.set_seed(123)
np.random.seed(123)

In [None]:
# some more general libraries for evaluation purposes:
import matplotlib.pyplot as plt
import datetime

In [None]:
import pickle

Define parameters:

In [None]:
VOCAB_SIZE = 5000  # Only consider the top 20k words
MAX_LEN = 200  # Only consider the first 200 words of each movie review

EMBED_DIM = 100  # Embedding size for each token
NUM_HEADS = 3  # Number of attention heads
FF_DIM = 32  # Hidden layer size in feed forward network inside transformer

NUM_EPOCHS = 50

batch_size = 32

Prepare Google drive (if used):

In [None]:
use_gdrive = False

if use_gdrive:
    from google.colab import drive
    drive.mount('/content/gdrive')
    targetDir_root = 'gdrive/MyDrive/CAS_AIS_2024_FS/Results/'
else:
    targetDir_root = './'
    
targetDir_models = targetDir_root + 'trainedWeights/'
targetDir_results = targetDir_root + 'PerformanceMeasures/'

## Data Loading

We now use a different, more direct way to get the imdb data. Using the tensorflow keras datasets subpackage, we can directly get a vectorized representation of the imdb movie reviews

In [None]:
(x_train, y_train), (x_val, y_val) = tf.keras.datasets.imdb.load_data(num_words=VOCAB_SIZE)
print(len(x_train), "Training sequences")
print(len(x_val), "Validation sequences")
x_train = tf.keras.utils.pad_sequences(x_train, maxlen=MAX_LEN)
x_val = tf.keras.utils.pad_sequences(x_val, maxlen=MAX_LEN)

## Transformer


In [None]:
import os

In [None]:
if use_gdrive:
    %run gdrive/MyDrive/CAS_AIS_2024_FS/Colab_Notebooks/W2_3.4_Transformers.ipynb
else:
    %run W2_3.4_Transformers.ipynb

# Review Rater with Transformers
Using the building blocks defined in the previous notebook, we can simply combine the necessary blocks. Here, we use the Encoder part of the transformer, and then just add a few layers as classification head.

In [None]:
# initialize random number generators to ensure reproducibility:
tf.random.set_seed(123)
np.random.seed(123)

In [None]:
class ReviewRater(tf.keras.Model):
    def __init__(self, *, num_layers, embed_dim, num_heads, ff_dim,
                 input_vocab_size, target_vocab_size, dropout_rate=0.1):
        super().__init__()
        # Encoder Block
        self.encoder = Encoder(num_layers=num_layers, embed_dim=embed_dim,
                               num_heads=num_heads, ff_dim=ff_dim,
                               vocab_size=input_vocab_size,
                               dropout_rate=dropout_rate)

        # custom classification head
        self.globAvgPool = tf.keras.layers.GlobalAveragePooling1D()

        self.dropout = tf.keras.layers.Dropout(dropout_rate)

        self.dense20 = tf.keras.layers.Dense(20, activation="relu")
        self.dense2  = tf.keras.layers.Dense(1, activation="sigmoid")

    def call(self, inputs):
        x = self.encoder(inputs)
        x = self.globAvgPool(x)
        x = self.dropout(x)
        x = self.dense20(x)
        outputs = self.dense2(x)

        return outputs

In [None]:
model_5kW_trans_RR = ReviewRater(num_layers=1, embed_dim=EMBED_DIM, num_heads=NUM_HEADS,
                                 ff_dim=FF_DIM, input_vocab_size=VOCAB_SIZE,
                                 target_vocab_size=VOCAB_SIZE)

In [None]:
model_5kW_trans_RR.compile(loss = BinaryCrossentropy(from_logits=False),
                           optimizer = 'adam', metrics = ['accuracy'])

As with the previous models, we only train the models from scratch if needed, and load the pre-trained model weights and results from files otherwise:

In [None]:
train_from_scatch = True

model_name = 'model_5kW_trans_RR'
model_weight_file = model_name + '_weights'
model_result_file = model_name + '_Results.pkl'

if train_from_scatch: 
    myRRHistory = model_5kW_trans_RR.fit(
        x_train, y_train,
        validation_data = (x_val, y_val),
        epochs = NUM_EPOCHS, verbose = 1,
        callbacks = [ EarlyStopping(monitor='val_accuracy', patience=5,
                                    verbose=False, restore_best_weights=True)])

    myRRHistory_dict = myRRHistory.history
    resDict_5kW_trans_RR = {}
    resDict_5kW_trans_RR['train_loss'] = myRRHistory_dict['loss']
    resDict_5kW_trans_RR['val_loss'] = myRRHistory_dict['val_loss']
    resDict_5kW_trans_RR['train_accuracy'] = myRRHistory_dict['accuracy']
    resDict_5kW_trans_RR['val_accuracy'] = myRRHistory_dict['val_accuracy']
    resDict_5kW_trans_RR['epochs'] = range(1, len(resDict_5kW_trans_RR['train_accuracy']) + 1)
    resDict_5kW_trans_RR['model_name'] = model_name
    
    # save weights and results
    model_5kW_trans_RR.save_weights(model_weight_file)
    with open(model_result_file, 'wb') as f:
        pickle.dump(resDict_5kW_trans_RR, f)
else:
    model_5kW_trans_RR.load_weights(model_weight_file)
    with open(model_result_file, 'rb') as input_file:
        resDict_5kW_trans_RR = pickle.load(input_file)

In [None]:
model_name = 'model_5kW_trans_RR'
model_weight_file = model_name + '_weights'
model_result_file = model_name + '_Results.pkl'

with open(model_result_file, 'rb') as input_file:
    resDict_5kW_trans_RR = pickle.load(input_file)

Now we visualize the development of the accuracy and the loss over the training epochs:

In [None]:
plt.subplot(2, 1, 1)
plt.plot(resDict_5kW_trans_RR['epochs'], resDict_5kW_trans_RR['train_loss'],
         'g:', label = resDict_5kW_trans_RR['model_name'] +', Training loss')
plt.plot(resDict_5kW_trans_RR['epochs'], resDict_5kW_trans_RR['val_loss'],
         'g',  label = resDict_5kW_trans_RR['model_name'] +', Validation loss')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)

plt.subplot(2, 1, 2)
plt.plot(resDict_5kW_trans_RR['epochs'], resDict_5kW_trans_RR['train_accuracy'],
         'k:', label = resDict_5kW_trans_RR['model_name'] +', Training acc')
plt.plot(resDict_5kW_trans_RR['epochs'], resDict_5kW_trans_RR['val_accuracy'],
         'k',  label = resDict_5kW_trans_RR['model_name'] +', Validation acc')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)

plt.xlabel('Epochs')
plt.show()

Using the results from the previous notebook on LSTM, we can compare the accuracy of e.g. the 1-layer LSTM network with our transformer network.

In [None]:
model_name = 'model_5kW_ae100_1LSTM_ADAM'
model_weight_file = model_name + '_weights'
model_result_file = model_name + '_Results.pkl'

with open(model_result_file, 'rb') as input_file:
    resDict_ae100_1LSTM = pickle.load(input_file)

In [None]:
plt.plot(resDict_5kW_trans_RR['epochs'], resDict_5kW_trans_RR['train_accuracy'],
         'k:', label = resDict_5kW_trans_RR['model_name'] +', Training acc')
plt.plot(resDict_5kW_trans_RR['epochs'], resDict_5kW_trans_RR['val_accuracy'],
         'k',  label = resDict_5kW_trans_RR['model_name'] +', Validation acc')

plt.plot(resDict_ae100_1LSTM['epochs'], resDict_ae100_1LSTM['train_accuracy'],
         'g:', label = resDict_ae100_1LSTM['model_name'] +', Training acc')
plt.plot(resDict_ae100_1LSTM['epochs'], resDict_ae100_1LSTM['val_accuracy'],
         'g',  label = resDict_ae100_1LSTM['model_name'] +', Validation acc')

plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)

plt.xlabel('Epochs')
plt.show()

We see that the transformer-based solution achieves a good accuracy already after a first training epoch, but then suffers from overfitting. The LSTM-based classifier with adapted, pretrained embeddings takes longer to achieve good performance, but then actually outperforms the transformer-based solution.