# Lab 05 - Transformers
The lab is adopted from the [popular PyTorch sentiment analysis tutorial by bentrevett](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/6%20-%20Transformers%20for%20Sentiment%20Analysis.ipynb).

In this notebook we will be using the transformer model, first introduced in [this](https://arxiv.org/abs/1706.03762) paper. Specifically, we will be using the BERT (Bidirectional Encoder Representations from Transformers) model from [this](https://arxiv.org/abs/1810.04805) paper.

Transformer models are considerably larger than anything else covered so far in the module. As such we are going to use the [transformers library](https://github.com/huggingface/transformers) to `get pre-trained transformers and use them as our embedding layers`. We will freeze (not train) the transformer and only train a remainder of the model which learns from the representations produced by our frozen pre-trained Transformer. In this lab, we will be using a multi-layer Bi-directional Gated Recurrent Unit (BiGRU), however any neural network (MLP/LSTM/...) can learn from these representations.

Additionally, it's important to note that transformer models like BERT come with their own pre-trained tokenizers. Since the pre-training process has already used a tokenizer to format the data prior to model input, you must utilize the same tokenizer designed to work seamlessly with it's respective model, ensuring that the text input is appropriately formatted. This includes tokenizing the text into tokens understood by the model, adding necessary special tokens, and converting these tokens into their corresponding ID numbers from the model's vocabulary, which is also a crucial part of our learning in this lab. As a result, we do not need to perform manual tokenization or vocabulary mapping steps, which significantly simplifies the preprocessing pipeline. By leveraging these integrated tokenizers, we ensure that our text data is processed in a manner that is fully compatible with the transformer model's requirements, allowing us to focus on fine-tuning the model for our specific task - `sentiment analysis`.

Furthermore, BERT in an autoencoder stacked with up to 12 encoder layers as shown to you in the vanilla Transformers architecture. Different transformer models (like many popular decoder models - GPT/Llama/DeepSeek) utilize distinct tokenizers tailored to their architecture and training paradigms. For instance, models such as LLaMA (https://medium.com/@vyperius117/understanding-the-llama2-tokenizer-working-with-the-tokenizer-locally-using-transformers-2e0f9e69d786), GPT-4(https://www.youtube.com/watch?v=zduSFxRajkE&si=qXbrvMZSqPlwdKT6), and Mistral(https://keras.io/api/keras_nlp/models/mistral/mistral_tokenizer/) each come with their specialized tokenizers. One notable tokenizer that is widely used across various models is the SentencePiece tokenizer. Unlike traditional tokenizers that operate on the word level and may struggle with languages without clear word boundaries, SentencePiece tokenizes text at the subword level. This approach allows for a more flexible handling of unknown words, better preservation of linguistic information, and improved model performance across diverse languages. SentencePiece does not rely on pre-tokenization and works directly on the raw text, making it highly versatile and effective for a wide range of NLP tasks.

In [20]:
from IPython.display import HTML, display
colab_button = HTML(
    '<a target="_blank" href="https://colab.research.google.com/github/surrey-nlp/NLP-2025/blob/main/lab05/lab05-Transformers%2BTransferLearning.ipynb">'
    '<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>'
)
display(colab_button)

In [21]:
# Install dependencies
%pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
%pip install torchdata==0.6.1 torchtext==0.15.2 portalocker==2.7.0
%pip install ipywidgets transformers tqdm



## Preparing Data

First, as always, let's set the random seeds for deterministic results.

In [None]:
import torch
import torchtext

SEED = 1234
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

print("PyTorch Version: ", torch.__version__)
print("torchtext Version: ", torchtext.__version__)
print(f"Using {'GPU' if str(DEVICE) == 'cuda' else 'CPU'}.")

PyTorch Version:  1.11.0+cu113
torchtext Version:  0.12.0
Using GPU.


The transformer has already been trained with a specific vocabulary, which means we need to train with the exact same vocabulary and also tokenize our data in the same way that the transformer did when it was initially trained.

Luckily, the transformers library has tokenizers for each of the transformer models provided. In this case we are using the BERT model which ignores casing (i.e. will lower case every word). We get this by loading the pre-trained `bert-base-uncased` tokenizer.

In [None]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The `tokenizer` has a `vocab` attribute which contains the actual vocabulary we will be using. We can check how many tokens are in it by checking its length.

In [None]:
len(tokenizer.vocab)

30522

Using the tokenizer is as simple as calling `tokenizer.tokenize` on a string. This will tokenize and lower case the data in a way that is consistent with the pre-trained transformer model.

In [None]:
tokens = tokenizer.tokenize('Hello WORLD how ARE yoU?')

print(tokens)

['hello', 'world', 'how', 'are', 'you', '?']


##### Handling punctuations
The above text contains punctuations and it is also considered a seperate token. Now lets see some examples how spacing characters like tabs and newlines are handled by the BERT tokenizer.

In [None]:
# original input string
print(tokenizer(['hello world']))

# input string with tab (\t) character
print(tokenizer(['hello	world']))

# input string with newline (\n) character
print(tokenizer(['''
    hello
    world
''']))

{'input_ids': [[101, 7592, 2088, 102]], 'token_type_ids': [[0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1]]}
{'input_ids': [[101, 7592, 2088, 102]], 'token_type_ids': [[0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1]]}
{'input_ids': [[101, 7592, 2088, 102]], 'token_type_ids': [[0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1]]}


Whitespace is added before and after every punctuation character. This allows punctuation characters to be treated as separate input tokens, apart from the words that they are connected with in the input string.

For example, the string "hello, world!" is split into following 6 tokens : \
[CLS] <br>
hello <br>
,  <br>
world <br>  
!  <br>
[SEP] <br>

In [None]:
print(tokenizer(['hello, world!']))

##### Out-of-vocabulary tokens
The BERT Tokenizer’s vocabulary contains a limited set of unique tokens, which means that there is a possibility of coming across a token that is not present in the vocabulary. To handle such cases, the vocabulary contains a special token, [UNK] which is used to represent any “out-of-vocabulary” input token.

In [None]:
# Print only the 'input_ids'
print(tokenizer(['hello world 👋'])['input_ids'])

# Use f-string for formatting (Python 3.6+) to access the token corresponding to id 100
token_with_id_100 = list(tokenizer.get_vocab().keys())[list(tokenizer.get_vocab().values()).index(100)]
print(f"Token with id 100: {token_with_id_100}")

## Or, if you're using an older version of Python, use the .format() method
#print("Token with id 100: {}".format(token_with_id_100))

We can numericalize tokens using our vocabulary using `tokenizer.convert_tokens_to_ids`.

In [None]:
tokens = tokenizer.tokenize('Hello WORLD how ARE yoU?')

indexes = tokenizer.convert_tokens_to_ids(tokens)

print(indexes)

[7592, 2088, 2129, 2024, 2017, 1029]


The transformer was also trained with special tokens to mark the beginning and end of the sentence, detailed [here](https://huggingface.co/transformers/model_doc/bert.html#transformers.BertModel). As well as a standard padding and unknown token. We can also get these from the tokenizer.

**Note**: the tokenizer does have a beginning of sequence and end of sequence attributes (`bos_token` and `eos_token`) but these are not set and should not be used for this transformer.

In [None]:
init_token = tokenizer.cls_token
eos_token = tokenizer.sep_token
pad_token = tokenizer.pad_token
unk_token = tokenizer.unk_token

print(init_token, eos_token, pad_token, unk_token)

[CLS] [SEP] [PAD] [UNK]


We can get the indexes of the special tokens by converting them using the vocabulary...

In [23]:
init_token_idx = tokenizer.convert_tokens_to_ids(init_token)
eos_token_idx = tokenizer.convert_tokens_to_ids(eos_token)
pad_token_idx = tokenizer.convert_tokens_to_ids(pad_token)
## Add correct call for converting pad_token to id.
unk_token_idx = tokenizer.convert_tokens_to_ids(unk_token)

print(init_token_idx, eos_token_idx, pad_token_idx, unk_token_idx)

101 102 0 100


...or by explicitly getting them from the tokenizer.

In [25]:
init_token_idx = tokenizer.cls_token_id
eos_token_idx = tokenizer.sep_token_id
pad_token_idx = tokenizer.pad_token_id
unk_token_idx = tokenizer.unk_token_id
## Add correct tokenizer token id for unknown token.

print(init_token_idx, eos_token_idx, pad_token_idx, unk_token_idx)

101 102 0 100


Another thing we need to handle is that the model was trained on sequences with a defined maximum length - it does not know how to handle sequences longer than it has been trained on. We can get the maximum length of these input sizes by checking the `max_model_input_sizes` for the version of the transformer we want to use. In this case, it is 512 tokens.

In [26]:
max_input_length = tokenizer.model_max_length
print(max_input_length)

512


Much like in the previous labs, we will need to define a pipeline component that will call the tokenizer and handle all the tokenization for us. We will also convert the tokenizer's vocab to a torchtext `Vocab` object.

In [28]:
from torchtext.vocab import vocab
from torchtext.data.utils import get_tokenizer

class TransformerTokenizer(torch.nn.Module):
    def __init__(self, tokenizer):
        super().__init__()
        self.tokenizer = tokenizer

    def forward(self, input):
        if isinstance(input, list):
            tokens = []
            for text in input:
                tokens.append(self.tokenizer.tokenize(text))
            return tokens
        elif isinstance(input, str):
            return self.tokenizer.tokenize(input)
        raise ValueError(f"Type {type(input)} is not supported.")

tokenizer_vocab = vocab(tokenizer.vocab, min_freq=0)

We will then define our text processing pipeline.

1. First we use the tokenizer to tokenize the text.
2. Then we convert each token to its vocabulary ID.
3. We will then cut the text to a maximum length. Note that the actual length we truncate to is 2 tokens shorter than the maximum length allowed by the model. This is because we will add two more tokens, one at the begginning and one at the end.
4. Add the Beginning of Sentence token a the beginning.
5. Add the End of Sentence token at the end.
6. Convert to tensor and pad

In [29]:
import torchtext.transforms as T

text_transform = T.Sequential(
    TransformerTokenizer(tokenizer),  # Tokenize
    T.VocabTransform(tokenizer_vocab),  # Convert to vocab IDs
    T.Truncate(max_input_length - 2),  # Cut to max length
    T.AddToken(token=tokenizer_vocab["[CLS]"], begin=True),  # BOS token
    T.AddToken(token=tokenizer_vocab["[SEP]"], begin=False),  # EOS token
    T.ToTensor(padding_value=tokenizer_vocab["[PAD]"]),  # Convert to tensor and pad
)

We load the data and create the validation splits as before.

**WARNING**: this will download the data in a hidden folder ".data" and will take some time

In [31]:
from torchtext.datasets import IMDB
from torchtext.data.functional import to_map_style_dataset

# Load dataset
train_data_full, test_data_full = IMDB(root="./", split=("train", "test"))

# Convert to map style
train_data_full = to_map_style_dataset(train_data_full)
test_data_full = to_map_style_dataset(test_data_full)
## Add same function call from above but for test_data_full.

Since the dataset is substantial (several thousand examples), for this lab we will limit this to just a few ($1000$, but feel free to reduce further) so that the training can finish in approximately 5-10 min. Obviously the model will not really perform well with such small amount of data, so ideally you would run this overnight or on a GPU device to get the model to train properly.

In [34]:
from torch.utils.data import random_split

print("Full train data:", len(train_data_full))
print("Full test data:", len(test_data_full))

N_SAMPLES = 1_000

# Validation split
split_ratio = 0.7  # 70/30 split
train_samples = int(split_ratio * N_SAMPLES)
valid_samples = int((1 - split_ratio) * N_SAMPLES)
test_samples = N_SAMPLES
rest_samples = len(train_data_full + test_data_full) - (2 * N_SAMPLES)  # Rest of the data

# Split the entire dataset (train + test) *randomly* into our new train, valid, test sets
train_data, valid_data, test_data, rest_data = random_split(train_data_full + test_data_full, [train_samples, valid_samples, test_samples, rest_samples])

print("\nTrimmed train data:", len(train_data))
print("Validation data:", len(valid_data))
print("Trimmed test data:", len(test_data))
print(rest_samples)

Full train data: 25000
Full test data: 25000

Trimmed train data: 700
Validation data: 300
Trimmed test data: 1000
48000


In [64]:
print(f"Number of training examples: {len(train_data)}")
print(f"Number of validation examples: {len(valid_data)}")
print(f"Number of testing examples: {len(test_data)}")

Number of training examples: 700
Number of validation examples: 300
Number of testing examples: 1000


Although we've handled the vocabulary for the text, we still need to build the vocabulary for the labels.

In [105]:
from collections import OrderedDict, Counter

## CHALLENGE - VERIFY IF THE BELOW LINE IS CORRECT? If not, rectify based on your understanding of mapping labels.
#label_vocab = vocab(OrderedDict([("neg", 0), ("pos", 1)]))

In [106]:
label_map = vocab(Counter(("neg","pos")))

In [107]:
print(label_map.get_stoi())


{'pos': 1, 'neg': 0}


Finally, the label processing pipeline:

In [108]:
label_transform = T.Sequential(
    T.LabelToIndex(label_vocab.get_itos()),  # Convert to integer
    T.ToTensor(),  # Convert to tensor
)

As before, we create the `DataLoader`s.

Note that the batch size is smaller than usual. This is mostly to speed up training. In a usual scenario this should be 128.

In [40]:
from torch.utils.data import DataLoader

#The dataset from torchtext is labelled with 1 and 2, we therefore need to map this to string so that the transform function works
mapping = {1: 'neg', 2: 'pos'}
BATCH_SIZE = 64

def collate_batch(batch):
    labels, texts = zip(*batch)

    #We map the numerical labels to string labels
    labels = [mapping[label] for label in labels]
    labels = label_transform(list(labels))
    texts = text_transform(list(texts))

    return labels.float().to(DEVICE), texts.to(DEVICE)

def _get_dataloader(data):
    return DataLoader(data, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)

train_dataloader = _get_dataloader(train_data)
valid_dataloader = _get_dataloader(valid_data)
test_dataloader = _get_dataloader(test_data)

## Build the Model

Next, we'll load the pre-trained model, making sure to load the same model as we did for the tokenizer.

In [41]:
from transformers import BertTokenizer, BertModel

bert = BertModel.from_pretrained('bert-base-uncased')



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Next, we'll define our actual model.

Instead of using an embedding layer to get embeddings for our text, we'll be using the pre-trained transformer model. These embeddings will then be fed into a GRU to produce a prediction for the sentiment of the input sentence. We can get the embedding dimension size (called the `hidden_size`) from the transformer via its config attribute. The rest of the initialization is standard.

**Challenge**: Fill in the `TODO` segments to define the model's standard PyTorch layers.

Within the forward pass, we wrap the transformer in a `no_grad` to ensure no gradients are calculated over this part of the model. The transformer actually returns the embeddings for the whole sequence as well as a *pooled* output. The [documentation](https://huggingface.co/transformers/model_doc/bert.html#transformers.BertModel) states that the pooled output is "usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence", hence we will not be using it. The rest of the forward pass is the standard implementation of a recurrent model, where we take the hidden state over the final time-step, and pass it through a linear layer to get our predictions.

In [42]:
import torch.nn as nn

class BERTGRUSentiment(nn.Module):
    def __init__(self, bert, hidden_dim, output_dim, n_layers, bidirectional, dropout):
        super().__init__()

        self.bert = bert
        self.embedding_dim = bert.config.to_dict()['hidden_size']

        # TODO - Define a GRU layer with n_layers layers
        # bidirectionality conditional on the bidirectional variable, and
        # dropout if there are more than two layers present.
        # Note that the batch dimension should be first.
        # You can take a look at Lab 6 for inspiration on PyTorch's recurrent unit API,
        # or look at the GRU documentation:
        # https://pytorch.org/docs/stable/generated/torch.nn.GRU.html
        self.rnn = ...

        # TODO - Define a linear layer that takes the GRU output and transforms it to a dimensionality
        # of output_dim.
        # Hint: consider what the in_features argument should be if the GRU is bidirectional and each
        # direction has dimensionality of hidden_dim
        self.out = ...

        # TODO - Define a dropout layer
        self.dropout = ...

    def forward(self, text):

        with torch.no_grad():
            embedded = self.bert(text)[0]

        _, hidden = self.rnn(embedded)

        if self.rnn.bidirectional:
            hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
        else:
            hidden = self.dropout(hidden[-1,:,:])

        return self.out(hidden)

Next, we create an instance of our model using standard hyperparameters.

In [43]:
HIDDEN_DIM = 64  # 256 is better, less than 64 is not very favourable.
OUTPUT_DIM = 1  # We only need one neuron as output
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.25

model = BERTGRUSentiment(bert, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, BIDIRECTIONAL, DROPOUT)

We can check how many parameters the model has. Our standard models have under 5M, but this one has 110M! Luckily, most of these parameters are from the transformer and we will not be training those.

In [44]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 109,482,240 trainable parameters


In order to freeze paramers (not train them) we need to set their `requires_grad` attribute to `False`. To do this, we simply loop through all of the `named_parameters` in our model and if they're a part of the `bert` transformer model, we set `requires_grad = False`.

In [45]:
for name, param in model.named_parameters():
    if name.startswith('bert'):
        param.requires_grad = False

In [46]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 0 trainable parameters


We can now see that our model has under 3M trainable parameters, making it almost comparable to the `FastText` model. However, the text still has to propagate through the transformer which causes training to take considerably longer.

We can double check the names of the trainable parameters, ensuring they make sense. As we can see, they are all the parameters of the GRU (`rnn`) and the linear layer (`out`).

In [47]:
for name, param in model.named_parameters():
    if param.requires_grad:
        print(name)

## Train the Model

As is standard, we define our optimizer and criterion (loss function).

In [48]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())

In [49]:
criterion = nn.BCEWithLogitsLoss()

Place the model and criterion onto the GPU (if available)

In [50]:
model = model.to(DEVICE)
criterion = criterion.to(DEVICE)

Next, we'll define functions for: calculating accuracy, performing a training epoch, performing an evaluation epoch and calculating how long a training/evaluation epoch takes.

In [51]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division
    acc = correct.sum() / len(correct)
    return acc

In [52]:
from tqdm import tqdm

def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0

    model.train()

    for batch in tqdm(iterator, desc="\tTraining"):
        optimizer.zero_grad()

        labels, texts, lengths = batch  # TODO: this has to match the order in collate_batch
        predictions = model(texts, lengths).squeeze(1)
        loss = criterion(predictions, labels)
        acc = binary_accuracy(predictions, labels)

        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()
        epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [53]:
from tqdm import tqdm

def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0

    model.eval()

    with torch.no_grad():
        for batch in tqdm(iterator, desc="\tEvaluation"):
            labels, texts, lengths = batch  # TODO: this has to match the order in collate_batch
            predictions = model(texts, lengths).squeeze(1)
            loss = criterion(predictions, labels)
            acc = binary_accuracy(predictions, labels)

            epoch_loss += loss.item()
            epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [54]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Finally, we'll train our model. This takes considerably longer than any of the previous models due to the size of the transformer. Even though we are not training any of the transformer's parameters we still need to pass the data through the model which takes a considerable amount of time on a standard GPU.

The performance won't be great due to using a subset of the data, a small number of epochs and small batches, but raising those values should yield considerably better performance.

In [112]:
# TODO: You will get an error when you run this section, read the error message and fix the issue.
# Hint: Take a look at your train and evaluate function

N_EPOCHS = 5

best_valid_loss = float('inf')
print(f"Using {'GPU' if str(DEVICE) == 'cuda' else 'CPU'} for training.")

for epoch in range(N_EPOCHS):
    print(f'Epoch: {epoch+1:02}')
    start_time = time.time()

    train_loss, train_acc = train(model, train_dataloader, optimizer, criterion)
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')

    valid_loss, valid_acc = evaluate(model, valid_dataloader, criterion)
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'transformer-model.pt')

Using CPU for training.
Epoch: 01


	Training:   0%|          | 0/11 [00:00<?, ?it/s]


RuntimeError: Token neg not found and default index is not set

We'll load up the parameters that gave us the best validation loss and try these on the test set - which gives us our best results so far!

In [None]:
model.load_state_dict(torch.load('transformer-model.pt'))

# If you want to load a model trained on a GPU, but the current device is on CPU, then you need to explicitly state that
# >>> model.load_state_dict(torch.load('tut6-model.pt', map_location=torch.device('cpu')))

In [None]:
test_loss, test_acc = evaluate(model, test_dataloader, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

## Inference

We'll then use the model to test the sentiment of some sequences. We tokenize the input sequence, trim it down to the maximum length, add the special tokens to either side, convert it to a tensor, add a fake batch dimension and then pass it through our model.

In [None]:
def predict_sentiment(model, tokenizer, sentence):
    model.eval()
    tokens = tokenizer.tokenize(sentence)
    tokens = tokens[:max_input_length-2]
    indexed = [init_token_idx] + tokenizer.convert_tokens_to_ids(tokens) + [eos_token_idx]
    tensor = torch.LongTensor(indexed).to(DEVICE)
    tensor = tensor.unsqueeze(0)
    prediction = torch.sigmoid(model(tensor))
    return prediction.item()

In [None]:
predict_sentiment(model, tokenizer, "This film is terrible")

In [None]:
predict_sentiment(model, tokenizer, "This film is great")