# Report: GPT-2 Training – Architecture and Methodology

This report explains the GPT-2 training process, its architecture, and the mathematical foundations behind its operation. The GPT-2 model is a large-scale language model based on the Transformer architecture. Its primary training objective is **next-token prediction** – predicting the next token in a sequence given the preceding tokens.

---

## 1. Overview of GPT-2

GPT-2 is designed as a decoder-only Transformer model. Unlike encoder–decoder architectures used in machine translation, GPT-2 uses only the Transformer decoder, which is autoregressive. This means that it generates text sequentially by predicting one token at a time, using the previously generated tokens as context.

---

## 2. Model Architecture

The GPT-2 architecture consists of several key components:

### 2.1 Token and Positional Embeddings

Each input token is mapped to a high-dimensional vector using a token embedding matrix. Since the Transformer has no inherent sense of token order, positional embeddings are added to provide sequential information.

- **Token Embedding:**  
  Given a vocabulary size $V$ and embedding dimension $d_{model}$, the token embedding matrix is:
  $$ E \in \mathbb{R}^{V \times d_{model}} $$

- **Positional Embedding:**  
  For a sequence of maximum length $L$, the positional embedding matrix is:
  $$ P \in \mathbb{R}^{L \times d_{model}} $$

For an input sequence of token IDs $x = [x_1, x_2, \dots, x_n]$, the input representation is computed as:
$$
\text{InputEmbedding}(x) = E[x] + P[0:n]
$$

### 2.2 Transformer Blocks

GPT-2 consists of $N$ identical Transformer blocks. Each block contains two sub-layers: a multi-head self-attention mechanism and a feedforward neural network.

#### 2.2.1 Multi-Head Self-Attention

The self-attention mechanism allows the model to focus on different parts of the sequence simultaneously. In multi-head attention, the input is linearly projected into query ($Q$), key ($K$), and value ($V$) vectors. For each attention head, the output is computed as:
$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V
$$
where:
- $d_k$ is the dimension of each head.
- The softmax function ensures the weights sum to one.

Multiple heads allow the model to capture diverse relationships:
$$
\text{MultiHead}(x) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O
$$
with $W^O$ being a learned projection matrix.

#### 2.2.2 Feedforward Network

After self-attention, a position-wise feedforward network is applied. The feedforward network consists of several linear layers interleaved with non-linear activation functions (typically ReLU). Mathematically, for an input $z$, the feedforward network can be described as:
$$
\text{FFN}(z) = W_2 \cdot \text{ReLU}(W_1 \cdot z + b_1) + b_2
$$

In our implementation, this feedforward network is extended into 10 sequential layers, each applying:
$$
z_{i+1} = \text{ReLU}(W_i z_i + b_i)
$$
for $i = 1, \dots, 10$, with $z_1 = z$.

#### 2.2.3 Residual Connections and Layer Normalization

Both sub-layers (self-attention and feedforward) use residual connections and layer normalization:
$$
\tilde{z} = \text{LayerNorm}(z + \text{Sublayer}(z))
$$

---

## 3. Training Objective: Next-Token Prediction

GPT-2 is trained using the next-token prediction task. The goal is to maximize the likelihood of the next token given the previous tokens.

Given a sequence of tokens $x = [x_1, x_2, \dots, x_n]$, the training objective is to maximize the conditional probability:
$$
P(x) = \prod_{t=1}^{n} P(x_t \mid x_{<t})
$$

### 3.1 Loss Function

The loss function used is the cross-entropy loss. For each token prediction, if $\hat{y}_t$ is the model's output logits for token $t$, and $y_t$ is the true token, the loss is given by:
$$
\mathcal{L} = -\sum_{t=1}^{n} \log P(y_t \mid x_{<t}) = -\sum_{t=1}^{n} \log \left( \frac{\exp(\hat{y}_{t, y_t})}{\sum_{j=1}^{V} \exp(\hat{y}_{t, j})} \right)
$$

This loss is averaged over all tokens in the batch and serves as the training signal for backpropagation.

---

## 4. Autoregressive Text Generation

Once trained, GPT-2 generates text autoregressively. Starting from a prompt, the model predicts one token at a time. At each step, the most recently generated tokens are used as input, and the model computes a probability distribution over the vocabulary for the next token:
$$
x_{t+1} \sim \text{softmax}\left(\frac{\hat{y}_{t}}{\text{temperature}}\right)
$$

The **temperature** parameter controls randomness:
- A temperature $< 1$ makes the model more confident (less random).
- A temperature $> 1$ increases randomness.

The generation loop terminates when an end-of-sequence token is produced or after a predetermined maximum length.

---

## 5. Summary

The GPT-2 model leverages the power of the Transformer architecture to model complex language patterns through self-attention and deep feedforward networks. The training process uses next-token prediction with a cross-entropy loss, guiding the model to capture long-range dependencies in text. Through iterative, autoregressive generation, GPT-2 is capable of producing coherent and contextually relevant text.

This report outlines the mathematical underpinnings and architecture of GPT-2, providing insight into how the model processes input tokens, learns representations, and generates text.



In [4]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import random
from torch.utils.data import Dataset, DataLoader
from datasets import load_dataset

# ----------------------------
# Simple Tokenizer Definition
# ----------------------------
class SimpleTokenizer:
    def __init__(self, corpus):
        # Split corpus by whitespace and build vocabulary.
        tokens = corpus.split()
        unique_tokens = set(tokens)
        # Reserve indices 0-3 for special tokens.
        self.vocab = {"<PAD>": 0, "<UNK>": 1, "<MASK>": 2, "<EOS>": 3}
        for i, token in enumerate(unique_tokens):
            self.vocab[token] = i + 4
        self.inv_vocab = {i: token for token, i in self.vocab.items()}
        self.mask_token_id = self.vocab["<MASK>"]

    def encode(self, text):
        tokens = text.split()
        # Append EOS token at the end.
        token_ids = [self.vocab.get(token, self.vocab["<UNK>"]) for token in tokens] + [self.vocab["<EOS>"]]
        return token_ids

    def decode(self, token_ids):
        tokens = [self.inv_vocab.get(i, "<UNK>") for i in token_ids]
        return " ".join(tokens)

# ----------------------------
# Transformer Components
# ----------------------------
class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, dropout=0.1):
        super(TransformerBlock, self).__init__()
        self.attn = nn.MultiheadAttention(embed_dim=d_model, num_heads=n_heads, dropout=dropout)
        self.layernorm1 = nn.LayerNorm(d_model)
        self.layernorm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
        # Feedforward network that is 10 layers deep.
        ff_layers = []
        for _ in range(10):
            ff_layers.append(nn.Linear(d_model, d_model))
            ff_layers.append(nn.ReLU())
        self.feedforward = nn.Sequential(*ff_layers)
        
    def forward(self, x):
        # x shape: (seq_length, batch_size, d_model)
        attn_output, _ = self.attn(x, x, x)
        x = self.layernorm1(x + self.dropout(attn_output))
        ff_output = self.feedforward(x)
        x = self.layernorm2(x + self.dropout(ff_output))
        return x

# ----------------------------
# GPT-2–like Model Definition
# ----------------------------
class GPT2Model(nn.Module):
    def __init__(self, vocab_size, d_model=255, n_heads=5, n_layers=4, max_seq_length=128, dropout=0.1):
        super(GPT2Model, self).__init__()
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.positional_embedding = nn.Embedding(max_seq_length, d_model)
        
        # Stack transformer blocks.
        self.layers = nn.ModuleList([TransformerBlock(d_model, n_heads, dropout) for _ in range(n_layers)])
        self.ln_f = nn.LayerNorm(d_model)
        # Final projection to vocabulary.
        self.head = nn.Linear(d_model, vocab_size, bias=False)
        self.max_seq_length = max_seq_length

    def forward(self, input_ids):
        # input_ids shape: (batch_size, seq_length)
        batch_size, seq_length = input_ids.size()
        positions = torch.arange(0, seq_length, device=input_ids.device).unsqueeze(0).expand(batch_size, seq_length)
        x = self.token_embedding(input_ids) + self.positional_embedding(positions)
        x = x.transpose(0, 1)  # (seq_length, batch_size, d_model)
        for layer in self.layers:
            x = layer(x)
        x = self.ln_f(x)
        x = x.transpose(0, 1)  # back to (batch_size, seq_length, d_model)
        logits = self.head(x)
        return logits

# ----------------------------
# Dataset Definition
# ----------------------------
class TextDataset(Dataset):
    def __init__(self, text, tokenizer, seq_length=128, mask_prob=0.15):
        self.tokenizer = tokenizer
        tokens = tokenizer.encode(text)
        # Split tokens into sequences of fixed length.
        self.sequences = []
        for i in range(0, len(tokens) - seq_length, seq_length):
            self.sequences.append(tokens[i:i+seq_length])
        self.seq_length = seq_length
        self.mask_prob = mask_prob

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        seq = self.sequences[idx]
        input_seq = []
        target_seq = []
        # Randomly mask tokens with mask_prob.
        for token in seq:
            if random.random() < self.mask_prob:
                input_seq.append(self.tokenizer.mask_token_id)
            else:
                input_seq.append(token)
            target_seq.append(token)
        input_seq = torch.tensor(input_seq, dtype=torch.long)
        target_seq = torch.tensor(target_seq, dtype=torch.long)
        return input_seq, target_seq

# ----------------------------
# Training Setup
# ----------------------------
def train(model, dataloader, optimizer, device, epochs=3):
    model.train()
    loss_fn = nn.CrossEntropyLoss(ignore_index=0)  # ignore PAD token
    for epoch in range(epochs):
        total_loss = 0.0
        for batch_idx, (input_seq, target_seq) in enumerate(dataloader):
            input_seq = input_seq.to(device)
            target_seq = target_seq.to(device)
            optimizer.zero_grad()
            logits = model(input_seq)  # (batch_size, seq_length, vocab_size)
            loss = loss_fn(logits.view(-1, logits.size(-1)), target_seq.view(-1))
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
            if batch_idx % 10 == 0:
                print(f"Epoch {epoch+1} Batch {batch_idx}: Loss = {loss.item():.4f}")
        avg_loss = total_loss / len(dataloader)
        print(f"Epoch {epoch+1} Average Loss: {avg_loss:.4f}")

# ----------------------------
# Text Generation Function
# ----------------------------
def generate_text(model, tokenizer, prompt, max_length=50, temperature=1.0, device="cpu"):
    model.eval()
    # Encode the prompt.
    input_ids = torch.tensor(tokenizer.encode(prompt), dtype=torch.long).unsqueeze(0).to(device)
    generated = input_ids.tolist()[0]
    with torch.no_grad():
        for _ in range(max_length):
            # Use only the last max_seq_length tokens.
            input_seq = input_ids[:, -model.max_seq_length:]
            logits = model(input_seq)
            # Focus on the last token's logits.
            logits = logits[:, -1, :] / temperature
            probabilities = torch.softmax(logits, dim=-1)
            next_token = torch.multinomial(probabilities, num_samples=1)
            input_ids = torch.cat([input_ids, next_token], dim=1)
            generated.append(next_token.item())
            if next_token.item() == tokenizer.vocab["<EOS>"]:
                break
    return tokenizer.decode(generated)

# ----------------------------
# Main Execution
# ----------------------------
if __name__ == "__main__":
    # Load a generic text corpus from Hugging Face.
    # Here we use the WikiText-2 raw dataset.
    dataset_hf = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
    # Concatenate all text entries into one large corpus.
    corpus = "\n".join(dataset_hf["text"])
    print("Corpus loaded. Corpus length:", len(corpus))

    # Initialize the tokenizer and dataset.
    tokenizer = SimpleTokenizer(corpus)
    # You might choose a smaller sequence length for testing; adjust as needed.
    dataset = TextDataset(corpus, tokenizer, seq_length=128, mask_prob=0.15)
    dataloader = DataLoader(dataset, batch_size=8, shuffle=True)

    # Define device.
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    # Instantiate model.
    vocab_size = len(tokenizer.vocab)
    model = GPT2Model(vocab_size, d_model=255, n_heads=5, n_layers=4, max_seq_length=128, dropout=0.1)
    model.to(device)
    
    # Define optimizer.
    optimizer = optim.Adam(model.parameters(), lr=1e-4)
    
    # Train the model.
    train(model, dataloader, optimizer, device, epochs=3)
    
    # Generate sample text.
    prompt = "In a village"
    sample = generate_text(model, tokenizer, prompt, max_length=50, temperature=1.0, device=device)
    print("\nGenerated Text:")
    print(sample)


Corpus loaded. Corpus length: 10929707
Epoch 1 Batch 0: Loss = 11.4190
Epoch 1 Batch 10: Loss = 11.1744
Epoch 1 Batch 20: Loss = 10.7312
Epoch 1 Batch 30: Loss = 10.2201
Epoch 1 Batch 40: Loss = 9.7670
Epoch 1 Batch 50: Loss = 9.4062
Epoch 1 Batch 60: Loss = 9.2608
Epoch 1 Batch 70: Loss = 8.8148
Epoch 1 Batch 80: Loss = 8.6505
Epoch 1 Batch 90: Loss = 8.4590
Epoch 1 Batch 100: Loss = 8.1591
Epoch 1 Batch 110: Loss = 7.7683
Epoch 1 Batch 120: Loss = 7.7228
Epoch 1 Batch 130: Loss = 7.4813
Epoch 1 Batch 140: Loss = 7.3624
Epoch 1 Batch 150: Loss = 7.5579
Epoch 1 Batch 160: Loss = 7.0650
Epoch 1 Batch 170: Loss = 6.8210
Epoch 1 Batch 180: Loss = 6.9552
Epoch 1 Batch 190: Loss = 6.6358
Epoch 1 Batch 200: Loss = 6.5709
Epoch 1 Batch 210: Loss = 6.6885
Epoch 1 Batch 220: Loss = 6.4037
Epoch 1 Batch 230: Loss = 6.5217
Epoch 1 Batch 240: Loss = 6.3920
Epoch 1 Batch 250: Loss = 6.2245
Epoch 1 Batch 260: Loss = 6.3280
Epoch 1 Batch 270: Loss = 6.1458
Epoch 1 Batch 280: Loss = 6.0368
Epoch 1 Bat

In [5]:
# ----------------------------
# Main Execution
# ----------------------------
if __name__ == "__main__":
    # Load a generic text corpus from Hugging Face.
    # Here we use the WikiText-2 raw dataset.
    dataset_hf = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
    # Concatenate all text entries into one large corpus.
    corpus = "\n".join(dataset_hf["text"])
    print("Corpus loaded. Corpus length:", len(corpus))

    # Initialize the tokenizer and dataset.
    tokenizer = SimpleTokenizer(corpus)
    dataset = TextDataset(corpus, tokenizer, seq_length=128, mask_prob=0.15)
    dataloader = DataLoader(dataset, batch_size=8, shuffle=True)

    # Define device.
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    # Instantiate model.
    vocab_size = len(tokenizer.vocab)
    model = GPT2Model(vocab_size, d_model=255, n_heads=5, n_layers=4, max_seq_length=128, dropout=0.1)
    model.to(device)
    
    # Define optimizer.
    optimizer = optim.Adam(model.parameters(), lr=1e-4)
    
    # Train the model for 10 epochs.
    train(model, dataloader, optimizer, device, epochs=10)
    
    # Generate sample text.
    prompt = "In a village"
    sample = generate_text(model, tokenizer, prompt, max_length=50, temperature=1.0, device=device)
    print("\nGenerated Text:")
    print(sample)


Corpus loaded. Corpus length: 10929707
Epoch 1 Batch 0: Loss = 11.3968
Epoch 1 Batch 10: Loss = 11.1051
Epoch 1 Batch 20: Loss = 10.6287
Epoch 1 Batch 30: Loss = 10.2193
Epoch 1 Batch 40: Loss = 9.8461
Epoch 1 Batch 50: Loss = 9.4496
Epoch 1 Batch 60: Loss = 9.0708
Epoch 1 Batch 70: Loss = 8.7852
Epoch 1 Batch 80: Loss = 8.6846
Epoch 1 Batch 90: Loss = 8.3715
Epoch 1 Batch 100: Loss = 8.1207
Epoch 1 Batch 110: Loss = 7.7645
Epoch 1 Batch 120: Loss = 7.8222
Epoch 1 Batch 130: Loss = 7.5582
Epoch 1 Batch 140: Loss = 7.2951
Epoch 1 Batch 150: Loss = 7.1902
Epoch 1 Batch 160: Loss = 6.8851
Epoch 1 Batch 170: Loss = 6.8737
Epoch 1 Batch 180: Loss = 6.7968
Epoch 1 Batch 190: Loss = 6.9937
Epoch 1 Batch 200: Loss = 6.8764
Epoch 1 Batch 210: Loss = 6.5269
Epoch 1 Batch 220: Loss = 6.6886
Epoch 1 Batch 230: Loss = 6.5595
Epoch 1 Batch 240: Loss = 6.2645
Epoch 1 Batch 250: Loss = 6.3690
Epoch 1 Batch 260: Loss = 6.1817
Epoch 1 Batch 270: Loss = 6.1428
Epoch 1 Batch 280: Loss = 6.2256
Epoch 1 Bat