# DS4440 - Practical Neural Networks
## Understanding Transformers & Tokenization 

___
**Instructor** : Prof. Steve Schmidt <br/>
**Teaching Assistants** : Vishwajeet Hogale (hogale.v@northeastern.edu) | Chaitanya Agarwal (agarwal.cha@northeastern.edu)

## Problem Statement  
In this notebook, we'll explore **Transformers** using a popular NLP dataset. Transformers are a class of deep learning models designed to understand and generate natural language by learning the underlying contextual relationships in text.

### We will:  
- Build a **Transformer-based model** to generate synthetic text  
- Understand how **Self-Attention** and **Positional Encoding** work together in an encoder-decoder setup (or a decoder-only setup for language modeling)  
- Explore how Transformers can be used for **text generation**, **translation**, and **summarization**  

### What makes Transformers special?  
- They employ **Self-Attention Mechanisms** to capture dependencies across all positions in the input sequence  
- They enable **parallel processing** of tokens, leading to more efficient training compared to traditional sequential models  
- They have become the foundation for state-of-the-art models such as **BERT, GPT, and T5**  
- Transformers are versatile and have revolutionized tasks in **language understanding, text synthesis, and beyond**


## 0. Setup and Load libraries

The below cell helps you download all the necessary libraries or packages required to run this notebook without running into any errors.

In [1]:
! pip install -r requirements.txt



## 1. Data Gathering

## WikiText-2 Dataset Overview

WikiText-2 is a widely used benchmark dataset for language modeling and text generation tasks. It is curated from Wikipedia articles, providing a rich source of natural language data that is ideal for training and evaluating NLP models.

### Dataset Details

- **Source:** Extracted from Wikipedia articles
- **Content:** Long-form natural language text
- **Token Count:** Approximately 2 million tokens
- **Splits:** 
  - **Training Set:** Main portion of the data for training language models
  - **Validation Set:** Used for hyperparameter tuning and model selection
  - **Test Set:** Reserved for evaluating the final model performance

### Applications

WikiText-2 is particularly suited for:
- **Language Modeling:** Training models to predict the next word in a sequence.
- **Text Generation:** Generating coherent and contextually relevant text.
- **Transfer Learning:** Fine-tuning pre-trained models for specialized tasks.
- **Benchmarking:** Comparing the performance of various language modeling techniques.

### Advantages of Using WikiText-2

- **Rich Linguistic Content:** Maintains detailed natural language structure and a diverse vocabulary.
- **Real-World Data:** The dataset reflects authentic written language, making it valuable for developing robust models.
- **Community Standard:** Widely adopted in the NLP community, facilitating reproducibility and comparison across research works.

### Accessing WikiText-2

The dataset is easily accessible through popular libraries such as Hugging Face's `datasets`. This allows you to quickly download, preprocess, and integrate it into your NLP projects.




In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from datasets import load_dataset
from transformers import AutoTokenizer
import random
from tqdm import tqdm


### Load the Dataset
We use the WikiText‑2 raw dataset (small enough to run on a laptop). We first concatenate the text (skipping empty lines)

In [3]:
# Load the WikiText-2 dataset (still using the full dataset object)
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

# Modify the function to take only a subset (e.g., first 100 lines)
def concatenate_text(split, num_lines=100):
    # Filter non-empty lines and take only the first num_lines entries
    lines = [line for line in split["text"] if line.strip() != ""]
    return " ".join(lines[:num_lines])

# Use a much smaller subset of the training data
train_text = concatenate_text(dataset["train"], num_lines=100)


In [4]:
train_text

' = Valkyria Chronicles III = \n  Senjō no Valkyria 3 : Unrecorded Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven " . \n  The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more forgiving for series

### In this notebook, we will pass the images through the neural network. But, there are a few extra steps that needs to be performed. 

## Tokenizers: A Simple Overview

Tokenizers are essential tools in natural language processing (NLP) that convert raw text into a format that models can understand. They break down text into smaller units called *tokens*—which can be words, subwords, or even individual characters—and then map these tokens to unique numerical identifiers (token IDs).

### How Tokenizers Work

- **Text Splitting:** They segment sentences into tokens based on rules or learned patterns.
- **Mapping to IDs:** Each token is converted into a number using a vocabulary. This mapping allows models to process text as numerical data.
- **Handling Special Tokens:** Tokenizers often add special tokens (like `[CLS]`, `[SEP]` for Transformers) to indicate the beginning, separation, or end of sequences.

### How Models Use Tokenizers

- **Transformers and Other Models:** Models such as Transformers, LSTMs, or even CNNs for NLP require numerical input. Tokenizers provide this by turning text into sequences of token IDs.
- **Embedding Layers:** The token IDs are passed to an embedding layer that transforms them into dense vector representations. These vectors capture semantic information about the tokens.
- **Sequence Processing:** The models then process these sequences to perform tasks like text classification, translation, or generation.
- **Fine-Tuning:** When models are fine-tuned on specific tasks, the tokenized inputs ensure consistency with the pre-training phase, making the learned representations effective for new tasks.

In summary, tokenizers bridge the gap between human language and machine-readable input, enabling complex models to understand and generate text.


### Initialize and Compare Two Tokenizers

In [5]:
# Sample text to demonstrate tokenization differences
sample_text = "Hello, how are you doing today?"

In [6]:
# Initialize tokenizers
tokenizer_gpt2 = AutoTokenizer.from_pretrained("gpt2")
tokenizer_bert = AutoTokenizer.from_pretrained("bert-base-uncased")



tokens_gpt2 = tokenizer_gpt2.tokenize(sample_text)
tokens_bert = tokenizer_bert.tokenize(sample_text)

print("GPT-2 tokens:", tokens_gpt2)
print("BERT tokens:", tokens_bert)


GPT-2 tokens: ['Hello', ',', 'Ġhow', 'Ġare', 'Ġyou', 'Ġdoing', 'Ġtoday', '?']
BERT tokens: ['hello', ',', 'how', 'are', 'you', 'doing', 'today', '?']


You’ll notice that GPT‑2’s tokenizer (being byte‑level) produces tokens that may include punctuation attached to words, while BERT’s WordPiece tokenizer tends to split words into subtokens (often with a "##" prefix for word pieces).

### Tokenize the Full Text and Build the Data Pipeline
We tokenize the entire training text (without adding extra special tokens) and then create a dataset class that returns fixed‑length chunks for language modeling.

In [7]:
# Tokenize full training text
encodings_gpt2 = tokenizer_gpt2(train_text, return_tensors="pt", add_special_tokens=False)
encodings_bert = tokenizer_bert(train_text, return_tensors="pt", add_special_tokens=False)


Token indices sequence length is longer than the specified maximum sequence length for this model (9067 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (8620 > 512). Running this sequence through the model will result in indexing errors


### Define a dataset class that slices the long token sequence into blocks of a fixed size:

In [8]:
class LMTokenDataset(Dataset):
    def __init__(self, encodings, block_size=4):
        # Squeeze to remove batch dimension (we provided one long string)
        self.input_ids = encodings["input_ids"].squeeze()  # Shape: [total_tokens]
        self.block_size = block_size

    def __len__(self):
        # Total number of training examples is the total token length minus the block size
        return self.input_ids.size(0) - self.block_size

    def __getitem__(self, idx):
        # x is a sequence of tokens, y is the same sequence shifted one position to the left
        x = self.input_ids[idx : idx + self.block_size]
        y = self.input_ids[idx + 1 : idx + self.block_size + 1]
        return x, y


In [9]:
block_size = 4

# GPT-2 tokenized dataset and loader
dataset_gpt2 = LMTokenDataset(encodings_gpt2, block_size)
dataloader_gpt2 = DataLoader(dataset_gpt2, batch_size=4, shuffle=True)

# BERT tokenized dataset and loader
dataset_bert = LMTokenDataset(encodings_bert, block_size)
dataloader_bert = DataLoader(dataset_bert, batch_size=4, shuffle=True)


### Define a Simple LSTM Language Model

In [10]:
class LSTMLanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim=256, hidden_dim=512, num_layers=2):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers, batch_first=True)
        self.fc_out = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x, hidden=None):
        # x: [batch_size, seq_len]
        x_embed = self.embed(x)
        out, hidden = self.lstm(x_embed, hidden)
        logits = self.fc_out(out)  # [batch_size, seq_len, vocab_size]
        return logits, hidden


Initialize separate models using each tokenizer’s vocabulary size:

In [11]:
vocab_size_gpt2 = tokenizer_gpt2.vocab_size
vocab_size_bert = tokenizer_bert.vocab_size

model_gpt2 = LSTMLanguageModel(vocab_size_gpt2)
model_bert = LSTMLanguageModel(vocab_size_bert)

# Define a common loss and optimizer (for demonstration, we show one forward pass)
criterion = nn.CrossEntropyLoss()
optimizer_gpt2 = optim.Adam(model_gpt2.parameters(), lr=1e-3)
optimizer_bert = optim.Adam(model_bert.parameters(), lr=1e-3)


### Feeding Tokenized Data into the LSTM

In [12]:
# For GPT-2 tokenized data
batch_gpt2 = next(iter(dataloader_gpt2))
x_gpt2, y_gpt2 = batch_gpt2
logits_gpt2, _ = model_gpt2(x_gpt2)
print("GPT-2 model logits shape:", logits_gpt2.shape)
# Expected shape: [batch_size, block_size, vocab_size_gpt2]

# For BERT tokenized data
batch_bert = next(iter(dataloader_bert))
x_bert, y_bert = batch_bert
logits_bert, _ = model_bert(x_bert)
print("BERT model logits shape:", logits_bert.shape)
# Expected shape: [batch_size, block_size, vocab_size_bert]


GPT-2 model logits shape: torch.Size([4, 4, 50257])
BERT model logits shape: torch.Size([4, 4, 30522])


This shows that regardless of which tokenizer is used, the resulting integer token IDs can be passed directly into an LSTM model for next‑token prediction. You could proceed to train the models using a loop (with proper handling like gradient clipping, multiple epochs, and validation) and later experiment with text generation.

In [13]:
# Define device: use GPU if available, otherwise CPU
device = torch.device("cuda" if torch.cuda.is_available() else "mps")
print("Using device:", device)

# Move models to the device
model_gpt2 = model_gpt2.to(device)
model_bert = model_bert.to(device)

# Training hyperparameters
num_epochs = 3
criterion = nn.CrossEntropyLoss()

# # ----- Train GPT-2 Tokenized Model -----
print("Training GPT-2 Tokenized Model")
model_gpt2.train()
for epoch in range(1, num_epochs + 1):
    total_loss_gpt2 = 0.0
    for x, y in tqdm(dataloader_gpt2):
        # Move batch data to GPU
        x = x.to(device)
        y = y.to(device)
        
        optimizer_gpt2.zero_grad()            # Clear gradients
        logits, _ = model_gpt2(x)              # Forward pass: obtain logits
        # Reshape logits and targets for loss computation
        loss = criterion(logits.view(-1, vocab_size_gpt2), y.view(-1))
        loss.backward()                        # Backpropagation
        optimizer_gpt2.step()                   # Update weights
        total_loss_gpt2 += loss.item()
    avg_loss_gpt2 = total_loss_gpt2 / len(dataloader_gpt2)
    print(f"GPT-2 Epoch {epoch}/{num_epochs} | Average Loss: {avg_loss_gpt2:.4f}")

# ----- Train BERT Tokenized Model -----
print("\nTraining BERT Tokenized Model")
model_bert.train()
for epoch in range(1, num_epochs + 1):
    total_loss_bert = 0.0
    for x, y in tqdm(dataloader_bert):
        # Move batch data to GPU
        x = x.to(device)
        y = y.to(device)
        
        optimizer_bert.zero_grad()            # Clear gradients
        logits, _ = model_bert(x)              # Forward pass: obtain logits
        # Reshape logits and targets for loss computation
        loss = criterion(logits.view(-1, vocab_size_bert), y.view(-1))
        loss.backward()                        # Backpropagation
        optimizer_bert.step()                   # Update weights
        total_loss_bert += loss.item()
    avg_loss_bert = total_loss_bert / len(dataloader_bert)
    print(f"BERT Epoch {epoch}/{num_epochs} | Average Loss: {avg_loss_bert:.4f}")

Using device: mps
Training GPT-2 Tokenized Model


100%|██████████| 2266/2266 [04:07<00:00,  9.14it/s]


GPT-2 Epoch 1/3 | Average Loss: 5.4512


100%|██████████| 2266/2266 [04:00<00:00,  9.43it/s]


GPT-2 Epoch 2/3 | Average Loss: 2.4916


100%|██████████| 2266/2266 [04:02<00:00,  9.35it/s]


GPT-2 Epoch 3/3 | Average Loss: 1.4530

Training BERT Tokenized Model


100%|██████████| 2154/2154 [02:20<00:00, 15.28it/s]


BERT Epoch 1/3 | Average Loss: 5.4839


100%|██████████| 2154/2154 [02:17<00:00, 15.64it/s]


BERT Epoch 2/3 | Average Loss: 2.6388


100%|██████████| 2154/2154 [02:20<00:00, 15.30it/s]

BERT Epoch 3/3 | Average Loss: 1.5352





In [14]:
def generate_text(model, tokenizer, prompt, max_tokens=50):
    model.eval()
    # Encode prompt into token IDs and move to device
    tokens = tokenizer.encode(prompt, return_tensors="pt").to(device)
    hidden = None
    for _ in range(max_tokens):
        logits, hidden = model(tokens, hidden)
        # Get logits for the last token and compute probabilities
        next_token_logits = logits[0, -1, :]
        probs = torch.softmax(next_token_logits, dim=-1)
        # Sample the next token from the probability distribution
        next_token = torch.multinomial(probs, num_samples=1)
        tokens = torch.cat([tokens, next_token.unsqueeze(0)], dim=1)
    # Decode tokens to string and return
    return tokenizer.decode(tokens[0])

# Example usage with the GPT-2 based model:
prompt = "Did you know that"
generated_text = generate_text(model_gpt2, tokenizer_gpt2, prompt)
print("Generated Text (GPT-2 Tokenization):", generated_text)

# Example usage with the BERT based model:
generated_text = generate_text(model_bert, tokenizer_bert, prompt)
print("Generated Text (BERT Tokenization):", generated_text)


Generated Text (GPT-2 Tokenization): Did you know that a garrison 500 strong . H were issued M1816 / M work . The game 's DLC , and both 2011 in Japan , it is Dahau . At the general unpopularity of the J certain members of the arsenal with the Gallian '
Generated Text (BERT Tokenization): [CLS] did you know that [SEP] flintlocks from the little rock arsenal was classified in 1860 as arsenal that the standard ammunition made ", for material had been the time for at this ammunition, clubs was established ; performed : " machinery was made ", ", are a splendid
