# Exercise project 5 - Transformer networks

The goal of this notebook was to implement a Transformer-based text classification model for sentiment analysis. The model architecture was from the Keras example on text classification using Transformer layers, which demonstrated how to efficiently process and classify text data.

The dataset consisted of 50,000 movie reviews split equally between training and validation sets. The reviews were preprocessed to keep only the top 20,000 most frequent words, and each review was truncated or padded to a length of 200 words.

The data was prepared with a vocabulary size of 20,000 and a maximum sequence length of 200. Padding was applied to make all sequences the same length.


https://keras.io/examples/nlp/text_classification_with_transformer/


In [None]:
import keras
from keras import ops
from keras import layers

## Implement a Transformer block as a laye

In [None]:
class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super().__init__()
        self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output)
        return self.layernorm2(out1 + ffn_output)

## Implement embedding layer
Two separate embedding layers, one for tokens, one for token index (positions).

In [None]:
class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super().__init__()
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        maxlen = ops.shape(x)[-1]
        positions = ops.arange(start=0, stop=maxlen, step=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions

## Download and prepare dataset

In [None]:
vocab_size = 20000  # Only consider the top 20k words
maxlen = 200  # Only consider the first 200 words of each movie review
(x_train, y_train), (x_val, y_val) = keras.datasets.imdb.load_data(num_words=vocab_size)
print(len(x_train), "Training sequences")
print(len(x_val), "Validation sequences")
x_train = keras.utils.pad_sequences(x_train, maxlen=maxlen)
x_val = keras.utils.pad_sequences(x_val, maxlen=maxlen)

25000 Training sequences
25000 Validation sequences


## Create classifier model using transformer layer
Transformer layer outputs one vector for each time step of our input sequence. Here, we take the mean across all time steps and use a feed forward network on top of it to classify text.

In [None]:
embed_dim = 32  # Embedding size for each token
num_heads = 2  # Number of attention heads
ff_dim = 32  # Hidden layer size in feed forward network inside transformer

inputs = layers.Input(shape=(maxlen,))
embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)
x = embedding_layer(inputs)
transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
x = transformer_block(x)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.1)(x)
x = layers.Dense(20, activation="relu")(x)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(2, activation="softmax")(x)

model = keras.Model(inputs=inputs, outputs=outputs)

The model architecture is a simplified version of the Transformer, focusing on text classification instead of language generation or translation.


## Train and Evaluate




The model was trained using the Adam optimizer with Sparse Categorical Cross-Entropy as the loss function since the labels were integers, not one-hot encoded. Training was done for 2 epochs with a batch size of 32.


In [None]:
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
history = model.fit(
    x_train, y_train, batch_size=32, epochs=2, validation_data=(x_val, y_val)
)

Epoch 1/2
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m120s[0m 147ms/step - accuracy: 0.7160 - loss: 0.5180 - val_accuracy: 0.8796 - val_loss: 0.2856
Epoch 2/2
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m114s[0m 146ms/step - accuracy: 0.9266 - loss: 0.1965 - val_accuracy: 0.8704 - val_loss: 0.3271


## Personal Analysis / Refelection

The model achieved an accuracy of 92.66% on the training set and 87.04% on the validation set, which I found to be impressive, considering the model was trained for only 2 epochs. However, I noticed that the training loss continued to decrease while the validation loss increased, indicating overfitting.

The Transformer block captured relationships well, even in long sequences of 200 words. The Embedding Layer with positional embeddings helped provide context. Despite training for only 2 epochs, the model performed well, showing it learned the dataset, patterns effectively.
