<a href="https://colab.research.google.com/github/redmondoisin/ml_short_projects/blob/main/modded_bert_attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 0. Imports etc

In [None]:
!pip install datasets tokenizers
!pip install torch --upgrade
from IPython.display import clear_output
clear_output()

import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
from datasets import load_dataset
import random
import math
import time
from tqdm import tqdm
from contextlib import nullcontext
from google.colab import drive

SEED = 1234
random.seed(SEED)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed(SEED)
!nvidia-smi

Mon Mar 31 01:34:27 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   43C    P8              9W /   70W |       2MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

# 1. Tokenizer and Dataset Preperation

In [None]:
special_tokens = ["[PAD]", "[UNK]", "[SEP]", "[MASK]"]
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
trainer = BpeTrainer(vocab_size=10000, min_frequency=5, special_tokens=special_tokens)
tokenizer.pre_tokenizer = Whitespace()
wiki_train_dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split='train')
tokenizer.train_from_iterator(wiki_train_dataset['text'], trainer=trainer)
tokenizer.save("./wiki-text-bpe.tokenizer.json")
tokenizer = Tokenizer.from_file("./wiki-text-bpe.tokenizer.json")
print(f"Vocab Size: {tokenizer.get_vocab_size()}")

class MaskedLanguageModelingDataset(Dataset):
    def __init__(self, texts: list, tokenizer: Tokenizer, seq_length: int, mask_prob: float = 0.15) -> None:
        self.texts = texts
        self.tokenizer = tokenizer
        self.seq_length = seq_length
        self.mask_prob = mask_prob
        self.pad_token_id = self.tokenizer.token_to_id("[PAD]")
        self.mask_token_id = self.tokenizer.token_to_id("[MASK]")
        self.vocab_size = self.tokenizer.get_vocab_size()

    def __len__(self) -> int:
        return len(self.texts)

    def __getitem__(self, idx: int) -> tuple:
        encoded_text = self.tokenizer.encode(self.texts[idx]).ids
        if len(encoded_text) < self.seq_length:
            encoded_text += [self.pad_token_id] * (self.seq_length - len(encoded_text))
        else:
            encoded_text = encoded_text[:self.seq_length]
        input_ids = encoded_text.copy()
        labels = [-100] * self.seq_length
        for i in range(self.seq_length):
            if random.random() < self.mask_prob:
                labels[i] = input_ids[i]
                rand_val = random.random()
                if rand_val < 0.8:
                    input_ids[i] = self.mask_token_id
                elif rand_val < 0.9:
                    input_ids[i] = random.randint(0, self.vocab_size - 1)
                    assert 0 <= input_ids[i] < self.vocab_size
        return torch.tensor(input_ids, dtype=torch.long), torch.tensor(labels, dtype=torch.long)

seq_length = 512
train_dataset = MaskedLanguageModelingDataset(wiki_train_dataset['text'], tokenizer, seq_length)
wiki_valid_dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split='validation')
wiki_test_dataset  = load_dataset("wikitext", "wikitext-2-raw-v1", split='test')
valid_dataset = MaskedLanguageModelingDataset(wiki_valid_dataset['text'], tokenizer, seq_length)
test_dataset  = MaskedLanguageModelingDataset(wiki_test_dataset['text'], tokenizer, seq_length)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/733k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Vocab Size: 10000


# 2. Model Definition

In [None]:
class MultiHeadSelfAttention(nn.Module):
    def __init__(self, embed_size: int, heads: int) -> None:
        super().__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads
        self.values = nn.Linear(embed_size, embed_size)
        self.keys = nn.Linear(embed_size, embed_size)
        self.queries = nn.Linear(embed_size, embed_size)
        self.proj_out = nn.Linear(embed_size, embed_size)

    def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
        batch_size, seq_len, embed_size = x.size()
        all_queries = self.queries(x).view(batch_size, seq_len, self.heads, self.head_dim).transpose(1, 2)
        all_keys = self.keys(x).view(batch_size, seq_len, self.heads, self.head_dim).transpose(1, 2)
        all_values = self.values(x).view(batch_size, seq_len, self.heads, self.head_dim).transpose(1, 2)

        scores = (all_queries @ all_keys.transpose(-1, -2)) * (1 / math.sqrt(self.head_dim))

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn_weights = F.softmax(scores, dim=-1)
        out = (attn_weights @ all_values).transpose(1, 2).contiguous().view(batch_size, seq_len, embed_size)
        return self.proj_out(out)

class PositionwiseFeedForward(nn.Module):
    def __init__(self, embed_size: int, ff_size: int) -> None:
        super().__init__()
        self.fc1 = nn.Linear(embed_size, ff_size)
        self.fc2 = nn.Linear(ff_size, embed_size)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.fc2(torch.relu(self.fc1(x)))

class TransformerEncoderLayer(nn.Module):
    def __init__(self, embed_size: int, heads: int, ff_size: int) -> None:
        super().__init__()
        self.norm_attn = nn.LayerNorm(embed_size)
        self.norm_ff = nn.LayerNorm(embed_size)
        self.self_attention = MultiHeadSelfAttention(embed_size, heads)
        self.feed_forward = PositionwiseFeedForward(embed_size, ff_size)

    def forward(self, x: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
        x = x + self.self_attention(self.norm_attn(x), mask)
        x = x + self.feed_forward(self.norm_ff(x))
        return x

class PositionalEmbedding(nn.Module):
    def __init__(self, max_seq_len: int, embed_size: int) -> None:
        super().__init__()
        self.embedding = nn.Embedding(max_seq_len, embed_size)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        positions = torch.arange(x.size(1), device=x.device).expand(x.size(0), -1)
        return x + self.embedding(positions)

class BERTModel(nn.Module):
    def __init__(self, vocab_size: int, embed_size: int, ff_size: int, num_layers: int, num_heads: int,
                 max_seq_len: int = 512, pad_token_id: int = 0) -> None:  #
        super().__init__()
        self.pad_token_id = pad_token_id
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.positional_encoding = PositionalEmbedding(max_seq_len, embed_size)
        self.layers = nn.ModuleList([TransformerEncoderLayer(embed_size, num_heads, ff_size) for _ in range(num_layers)])
        self.un_embedding = nn.Linear(embed_size, vocab_size)

    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
        mask = (input_ids != self.pad_token_id).unsqueeze(1).unsqueeze(2)
        x = self.embedding(input_ids)
        x = self.positional_encoding(x)
        for layer in self.layers:
            x = layer(x, mask)
        return self.un_embedding(x)


# 3. Training and Evaluation

In [None]:
load_from_drive = input("Load model from Google Drive? (y/n): ").strip().lower()
drive_mounted = False

drive.mount('/content/drive')
drive_mounted = os.path.exists('/content/drive/MyDrive')

vocab_size = tokenizer.get_vocab_size()
embed_size = 768
ff_size = 3072
num_layers = 12
num_heads = 12
BATCH_SIZE = 64

if load_from_drive == "y":
    model = BERTModel(vocab_size, embed_size, ff_size, num_layers, num_heads, seq_length).to(device)
    model_path = input("Enter the full path to your model file in Google Drive: ").strip()
    try:
        model.load_state_dict(torch.load(model_path))
        print("Model loaded successfully from Google Drive.")
    except Exception as e:
        print(f"An error occurred while loading the model: {e}")
    further_training = (input("Would you like to further train the model? (y/n): ").strip().lower() == "y")
else:
    further_training = True
    model = BERTModel(
    vocab_size=tokenizer.get_vocab_size(),
    embed_size=768,
    ff_size=3072,
    num_layers=12,
    num_heads=12,
    pad_token_id=tokenizer.token_to_id("[PAD]")
    ).to(device)
    def init_weights(m: nn.Module) -> None:
        for _, param in m.named_parameters():
            nn.init.normal_(param.data, mean=0.0, std=0.02)
    model.apply(init_weights)

if further_training:
    num_epochs = int(input("Enter the number of epochs to train: ").strip())
    optimizer = optim.AdamW(model.parameters(), lr=5e-5, weight_decay=0.01)
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=1, verbose=True)
    pad_token_id = tokenizer.token_to_id("[PAD]")
    criterion = nn.CrossEntropyLoss(ignore_index=-100)
    num_workers = 4 if torch.cuda.is_available() else 0
    pin_memory = True if torch.cuda.is_available() else False

    if torch.cuda.is_available():
        autocast = lambda: torch.amp.autocast(device_type='cuda')
        scaler = torch.amp.GradScaler()
    else:
        autocast = nullcontext
        scaler = None

    def train(model: nn.Module, dataset: Dataset, optimizer: optim.Optimizer, criterion: nn.Module, clip: float) -> float:
        model.train()
        epoch_loss = 0.0
        dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=num_workers, pin_memory=pin_memory)
        for input_ids, labels in tqdm(dataloader, total=len(dataloader), desc='Training'):
            input_ids, labels = input_ids.to(device), labels.to(device)
            optimizer.zero_grad()
            with autocast():
                output = model(input_ids)
                loss = criterion(output.view(-1, output.shape[-1]), labels.view(-1))
            if scaler is not None:
                scaler.scale(loss).backward()
                scaler.unscale_(optimizer)
                torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
                scaler.step(optimizer)
                scaler.update()
            else:
                loss.backward()
                torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
                optimizer.step()
            epoch_loss += loss.item()
        return epoch_loss / len(dataloader)

    def evaluate(model: nn.Module, dataset: Dataset, criterion: nn.Module) -> float:
        model.eval()
        epoch_loss = 0.0
        dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=num_workers, pin_memory=pin_memory)
        with torch.no_grad():
            for input_ids, labels in tqdm(dataloader, total=len(dataloader), desc='Eval'):
                input_ids, labels = input_ids.to(device), labels.to(device)
                with autocast():
                    output = model(input_ids)
                    loss = criterion(output.view(-1, output.shape[-1]), labels.view(-1))
                epoch_loss += loss.item()
        return epoch_loss / len(dataloader)

    def epoch_time(start_time: float, end_time: float) -> tuple:
        elapsed = end_time - start_time
        return int(elapsed / 60), int(elapsed % 60)

    best_valid_loss = float('inf')
    for epoch in range(num_epochs):
        start_time = time.time()
        train_loss = train(model, train_dataset, optimizer, criterion, clip=1.0)
        valid_loss = evaluate(model, valid_dataset, criterion)
        end_time = time.time()
        mins, secs = epoch_time(start_time, end_time)
        if valid_loss < best_valid_loss:
            best_valid_loss = valid_loss
            torch.save(model.state_dict(), 'BERT-LM-model.pt')
        scheduler.step(valid_loss)
        print(f"Epoch {epoch+1} | Time: {mins}m {secs}s | LR: {optimizer.param_groups[0]['lr']:.1e}")
        print(f"Train Loss: {train_loss:.3f}, PPL: {math.exp(train_loss):.3f}")
        print(f"Val Loss:   {valid_loss:.3f}, PPL: {math.exp(valid_loss):.3f}")
        if (epoch + 1) % 1 == 0 and drive_mounted:
            if input("Save to Drive? (y/n): ").strip().lower() == "y":
                torch.save(model.state_dict(), '/content/drive/MyDrive/dl-for_nlp-week-6-model.pt')


Load model from Google Drive? (y/n): y
Mounted at /content/drive
Enter the full path to your model file in Google Drive: /content/drive/MyDrive/dl-for_nlp-week-6-model.pt
Model loaded successfully from Google Drive.
Would you like to further train the model? (y/n): n


# 4. Inference and Testing

In [None]:
def predict_masked_tokens(model: nn.Module, tokenizer: Tokenizer, sentence: str, temperature: float = 0.7, top_k: int = 50) -> str:
    model.eval()
    modified = sentence.replace("<mask>", "[MASK]")
    encoded = tokenizer.encode(modified).ids
    input_ids = torch.tensor([encoded], dtype=torch.long).to(device)
    with torch.no_grad():
        with (torch.amp.autocast(device_type='cuda') if torch.cuda.is_available() else nullcontext()):
            output = model(input_ids)
    predicted = []
    mask_token_id = tokenizer.token_to_id("[MASK]")
    for idx, token_id in enumerate(input_ids[0]):
        if token_id == mask_token_id:
            logits = output[0, idx] / temperature
            topk_logits, topk_ids = torch.topk(logits, top_k)
            probs = torch.softmax(topk_logits, dim=-1)
            pred_id = topk_ids[torch.multinomial(probs, 1).item()].item()
            decoded = tokenizer.decode([pred_id]).strip().replace("##", "")
            predicted.append(decoded)
    for token in predicted:
        modified = modified.replace("[MASK]", token, 1)
    return modified

def evaluate_validation_masked_tokens(model: nn.Module, tokenizer: Tokenizer, validation_texts: list, num_samples: int = 20) -> float:
    correct = 0
    count = 0
    samples = random.sample(validation_texts, min(num_samples, len(validation_texts)))
    for text in samples:
        words = text.split()
        if len(words) < 2:
            continue
        idx = random.randint(0, len(words) - 1)
        original = words[idx]
        words[idx] = "<mask>"
        masked = " ".join(words)
        prediction = predict_masked_tokens(model, tokenizer, masked)
        pred_word = prediction.split()[idx] if idx < len(prediction.split()) else ""
        print(f"Original: {original}, Predicted: {pred_word}")
        if pred_word.strip().lower() == original.strip().lower():
            correct += 1
        count += 1
    acc = correct / count if count > 0 else 0.0
    print(f"Validation Accuracy: {acc:.3f}")
    return acc

_ = evaluate_validation_masked_tokens(model, tokenizer, wiki_valid_dataset['text'])

if input("Save final model to Drive? (y/n): ").strip().lower() == "y" and drive_mounted:
    torch.save(model.state_dict(), '/content/drive/MyDrive/dl-for_nlp-week-6-model.pt')
    print("Final model saved.")


Original: him, Predicted: The
Original: Montgomery, Predicted: that
Original: 27, Predicted: '
Original: the, Predicted: is
Original: Human, Predicted: '
Original: =, Predicted: In
Original: Rugby, Predicted: was
Original: recording, Predicted: .
Original: Spacey, Predicted: the
Original: and, Predicted: a
Validation Accuracy: 0.000
Save final model to Drive? (y/n): n


# DL for NLP Lab 6 Report

## Methodology

The `MultiHeadSelfAttention` class was adapted from a GPT-style causal attention mechanism to a BERT-style bidirectional attention mechanism. This modification involved removing causal masking to enable tokens to attend to both preceding and succeeding tokens within sequences. This bidirectional attention is essential for masked language modelling tasks typically encountered during BERT pretraining.

The `MaskedLanguageModelingDataset` class constructs training samples suitable for masked language modelling, a critical component of BERT pretraining. The dataset implementation applies random masking to tokens, following an 80-10-10 masking strategy: 80% of masked tokens are replaced by the `[MASK]` token, 10% by random tokens, and 10% remain unchanged. This stochastic masking method enhances the robustness of the model. Labels are initialised with a value of `-100`, allowing the loss function to ignore unmasked positions and consequently reducing computational overhead. Furthermore, the complete Wikitext dataset was loaded in an attempt to enhance model performance.

Several optimisations have been implemented to accelerate model training and inference. These include adopting mixed-precision training via PyTorch's automatic mixed precision (AMP), which leverages Tensor Cores available on Colab GPUs to enhance computational efficiency while reducing memory requirements. The training procedure incorporates gradient scaling through `torch.amp.GradScaler`, ensuring stable gradient updates even with reduced precision. Additionally, gradient clipping using `torch.nn.utils.clip_grad_norm_` further stabilises training by mitigating issues related to exploding gradients without adversely affecting model performance.

Moreover, `DataLoader` configurations using parallel data loading (with `num_workers=4`) and pin memory (`pin_memory=True`) maximise GPU utilisation by efficiently handling data transfer between CPU and GPU memory. The AdamW optimiser, combined with a learning rate scheduler (`ReduceLROnPlateau`), dynamically adjusts learning rates based on validation performance, further optimising training efficiency.

## Discussion

The results demonstrate the BERT-style model's ability to learn from the masked language modelling task. However, the results weren't very promising no matter how much data and complexity was added to the model. Perplexity appeared to stabilise around `2.4` on both the training and validation sets, even when trained on an A100 GPU using the entire dataset. It is suspected that there might be an issue with the masking process, despite considerable efforts spent refining it. Another possibility is that previous weaker models might still be used during evaluation due to potential leakage from Google Drive. Overall, this task felt close to completion, yet the model never fully achieved optimal results due to constraints related to time and computational resources.

