# Week 3.1: Introducing GPT Architecture

In this notebook, we'll explore the GPT (Generative Pre-trained Transformer) architecture, building on the transformer concepts we learned in Week 2. We'll examine how GPT2 specifically works, why its design choices are effective, and implement key components of the architecture.

## 1. Introduction to GPT Architecture

GPT models are a family of transformer-based language models that use a decoder-only architecture. Unlike the original transformer that had both encoder and decoder components, GPT models consist only of transformer decoder blocks stacked on top of each other.

Key characteristics of GPT architecture:
- **Decoder-only**: Uses only the decoder part of the transformer
- **Causal attention**: Each token can only attend to itself and previous tokens (not future tokens)
- **Pre-training and fine-tuning**: Trained on large text corpora and then fine-tuned for specific tasks
- **Autoregressive generation**: Generates text by predicting one token at a time

## 2. Setting Up

Let's start by importing the necessary libraries:

In [1]:
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cpu


## 3. GPT2 Configuration

GPT2 comes in different sizes, with different hyperparameters. Here's a class to encapsulate these configuration options:

In [2]:
class GPT2Config:
    """Configuration class for GPT2 model variants"""
    
    def __init__(self, vocab_size=50257, n_positions=1024, n_embd=768, n_layer=12, n_head=12, 
                 dropout=0.1, activation_function="gelu"):
        self.vocab_size = vocab_size  # Size of the vocabulary
        self.n_positions = n_positions  # Maximum sequence length
        self.n_embd = n_embd  # Embedding dimension
        self.n_layer = n_layer  # Number of layers
        self.n_head = n_head  # Number of attention heads
        self.dropout = dropout  # Dropout probability
        self.activation_function = activation_function  # Activation function ("gelu" or "relu")

# Define different GPT2 model sizes
gpt2_small_config = GPT2Config()  # 124M parameters
gpt2_medium_config = GPT2Config(n_embd=1024, n_layer=24, n_head=16)  # 355M parameters
gpt2_large_config = GPT2Config(n_embd=1280, n_layer=36, n_head=20)  # 774M parameters
gpt2_xl_config = GPT2Config(n_embd=1600, n_layer=48, n_head=25)  # 1.5B parameters

# We'll use the small config for our implementation
config = gpt2_small_config
print(f"Model config: {config.__dict__}")

Model config: {'vocab_size': 50257, 'n_positions': 1024, 'n_embd': 768, 'n_layer': 12, 'n_head': 12, 'dropout': 0.1, 'activation_function': 'gelu'}


## 4. Building Blocks of GPT2

Now, let's implement the key components of the GPT2 architecture.

### 4.1 Layer Normalization

GPT2 uses Layer Normalization to stabilize the activations in each layer.

In [3]:
class LayerNorm(nn.Module):
    """Layer normalization module with optional bias"""
    
    def __init__(self, ndim, bias=True):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(ndim))
        self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None
        
    def forward(self, x):
        # Calculate the mean and variance along the last dimension
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        
        # Normalize
        x = (x - mean) / torch.sqrt(var + 1e-5)
        
        # Scale and shift
        if self.bias is not None:
            x = self.weight * x + self.bias
        else:
            x = self.weight * x
            
        return x

### 4.2 Multi-Head Causal Self-Attention

One of the key innovations in GPT2 is the causal self-attention mechanism. Unlike the original transformer, GPT models ensure that each token can only attend to itself and previous tokens, not future tokens. This is essential for autoregressive generation.

In [4]:
class CausalSelfAttention(nn.Module):
    """Multi-head causal self-attention module"""
    
    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        
        # Key, query, value projections
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=True)
        
        # Output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=True)
        
        # Regularization
        self.attn_dropout = nn.Dropout(config.dropout)
        self.resid_dropout = nn.Dropout(config.dropout)
        
        # Save hyperparameters
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.head_size = config.n_embd // config.n_head
        
        # Register a buffer for the causal mask to avoid future tokens
        self.register_buffer(
            "mask", 
            torch.tril(torch.ones(config.n_positions, config.n_positions))
            .view(1, 1, config.n_positions, config.n_positions)
        )
        
    def forward(self, x):
        # x shape: (batch, sequence_length, embedding_dimension)
        batch_size, sequence_length, _ = x.size()
        
        # Calculate query, key, values for all heads in batch
        # (batch, sequence_length, 3*n_embd)
        qkv = self.c_attn(x)
        
        # Split into query, key, value and heads
        # Each has shape (batch, n_head, sequence_length, head_size)
        q, k, v = qkv.split(self.n_embd, dim=2)
        q = q.view(batch_size, sequence_length, self.n_head, self.head_size).transpose(1, 2)
        k = k.view(batch_size, sequence_length, self.n_head, self.head_size).transpose(1, 2)
        v = v.view(batch_size, sequence_length, self.n_head, self.head_size).transpose(1, 2)
        
        # Compute attention scores
        # (batch, n_head, sequence_length, sequence_length)
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        
        # Apply causal mask - sets attention scores for future tokens to -inf
        att = att.masked_fill(self.mask[:, :, :sequence_length, :sequence_length] == 0, float('-inf'))
        
        # Apply softmax and dropout
        att = F.softmax(att, dim=-1)
        att = self.attn_dropout(att)
        
        # Apply attention to values
        # (batch, n_head, sequence_length, head_size)
        y = att @ v
        
        # Restore original shape and project back
        # (batch, sequence_length, n_embd)
        y = y.transpose(1, 2).contiguous().view(batch_size, sequence_length, self.n_embd)
        
        # Output projection and dropout
        y = self.resid_dropout(self.c_proj(y))
        
        return y

### 4.3 MLP Block

The MLP (Multi-Layer Perceptron) block in GPT2 consists of two linear layers with a GELU activation function in between.

In [5]:
class MLP(nn.Module):
    """MLP module with GELU activation"""
    
    def __init__(self, config):
        super().__init__()
        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=True)
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=True)
        self.dropout = nn.Dropout(config.dropout)
        
        # Choose activation function
        if config.activation_function == "gelu":
            self.activation = F.gelu
        elif config.activation_function == "relu":
            self.activation = F.relu
        else:
            raise ValueError(f"Unknown activation function: {config.activation_function}")
    
    def forward(self, x):
        # First linear layer and activation
        x = self.c_fc(x)
        x = self.activation(x)
        
        # Second linear layer and dropout
        x = self.c_proj(x)
        x = self.dropout(x)
        
        return x

### 4.4 Transformer Block

The transformer block combines the attention and MLP blocks along with layer normalization and residual connections.

In [6]:
class Block(nn.Module):
    """Transformer block: communication followed by computation"""
    
    def __init__(self, config):
        super().__init__()
        self.ln_1 = LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = LayerNorm(config.n_embd)
        self.mlp = MLP(config)
    
    def forward(self, x):
        # Self-attention with residual connection
        x = x + self.attn(self.ln_1(x))
        
        # MLP with residual connection
        x = x + self.mlp(self.ln_2(x))
        
        return x

## 5. Complete GPT2 Model

Now let's put everything together to create the complete GPT2 model.

In [7]:
class GPT2(nn.Module):
    """GPT-2 Language Model"""
    
    def __init__(self, config):
        super().__init__()
        
        # Token embedding
        self.wte = nn.Embedding(config.vocab_size, config.n_embd)
        
        # Position embedding
        self.wpe = nn.Embedding(config.n_positions, config.n_embd)
        
        # Dropout
        self.drop = nn.Dropout(config.dropout)
        
        # Transformer blocks
        self.blocks = nn.ModuleList([Block(config) for _ in range(config.n_layer)])
        
        # Final layer normalization
        self.ln_f = LayerNorm(config.n_embd)
        
        # Language modeling head
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        
        # Weight tying: the token embedding matrix is tied to the LM head weight matrix
        self.wte.weight = self.lm_head.weight
        
        # Initialize weights
        self.apply(self._init_weights)
        
        # Save hyperparameters
        self.config = config
    
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
    
    def forward(self, idx, targets=None):
        # idx: (batch, sequence_length) of token indices
        device = idx.device
        batch_size, sequence_length = idx.size()
        assert sequence_length <= self.config.n_positions, f"Sequence length ({sequence_length}) exceeds model's maximum length ({self.config.n_positions})"
        
        # Get token embeddings
        # (batch, sequence_length, n_embd)
        token_embeddings = self.wte(idx)
        
        # Get position embeddings
        # Create position ids tensor: 0, 1, 2, ..., sequence_length-1
        position_ids = torch.arange(0, sequence_length, dtype=torch.long, device=device).unsqueeze(0)
        
        # (batch, sequence_length, n_embd)
        position_embeddings = self.wpe(position_ids)
        
        # Add token and position embeddings
        x = self.drop(token_embeddings + position_embeddings)
        
        # Apply transformer blocks
        for block in self.blocks:
            x = block(x)
        
        # Apply final layer normalization
        x = self.ln_f(x)
        
        # Language modeling logits
        # (batch, sequence_length, vocab_size)
        logits = self.lm_head(x)
        
        # Calculate loss if targets are provided
        loss = None
        if targets is not None:
            # Reshape logits to (batch*sequence_length, vocab_size)
            logits_view = logits.view(-1, logits.size(-1))
            # Reshape targets to (batch*sequence_length,)
            targets_view = targets.view(-1)
            # Cross entropy loss
            loss = F.cross_entropy(logits_view, targets_view)
        
        return logits, loss
    
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        """Generate new tokens beyond the context provided in idx"""
        # idx: (batch, sequence_length) of token indices
        for _ in range(max_new_tokens):
            # Crop the sequence to the maximum allowed length
            idx_cond = idx if idx.size(1) <= self.config.n_positions else idx[:, -self.config.n_positions:]
            
            # Get model predictions
            logits, _ = self.forward(idx_cond)
            
            # Focus on the last token's prediction
            logits = logits[:, -1, :] / temperature
            
            # Optional: top-k sampling
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('Inf')
            
            # Apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)
            
            # Sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)
            
            # Append to the existing sequence
            idx = torch.cat((idx, idx_next), dim=1)
        
        return idx

## 6. Creating a Model Instance

Let's create an instance of our GPT2 model with the small configuration:

In [8]:
model = GPT2(config).to(device)

# Calculate number of parameters
num_params = sum(p.numel() for p in model.parameters())
print(f"Number of parameters: {num_params:,}")

Number of parameters: 124,439,808


## 7. Basic Training Loop

Let's implement a basic training loop function for our GPT2 model:

In [9]:
def train(model, dataloader, optimizer, max_epochs):
    """Basic training loop for GPT2"""
    model.train()
    losses = []
    
    for epoch in range(max_epochs):
        epoch_losses = []
        for batch_idx, (inputs, targets) in enumerate(dataloader):
            # Move data to device
            inputs = inputs.to(device)
            targets = targets.to(device)
            
            # Forward pass
            _, loss = model(inputs, targets)
            
            # Backward pass and optimization
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            # Record loss
            loss_value = loss.item()
            epoch_losses.append(loss_value)
            
            # Print progress
            if batch_idx % 10 == 0:
                print(f"Epoch: {epoch+1}/{max_epochs}, Batch: {batch_idx}, Loss: {loss_value:.4f}")
        
        # Record average epoch loss
        avg_loss = sum(epoch_losses) / len(epoch_losses)
        losses.append(avg_loss)
        print(f"Epoch {epoch+1} Average Loss: {avg_loss:.4f}")
    
    # Plot training loss
    plt.figure(figsize=(10, 6))
    plt.plot(losses)
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.title('Training Loss')
    plt.grid(True)
    plt.show()
    
    return losses

## 8. Text Generation with GPT2

We've already implemented a generation method in our GPT2 class. Let's create a helper function to actually generate text using a trained model and a tokenizer:

In [10]:
def generate_text(model, tokenizer, prompt, max_new_tokens=50, temperature=1.0, top_k=40):
    """Generate text using the GPT2 model"""
    model.eval()  # Set model to evaluation mode
    
    # Tokenize the prompt
    prompt_tokens = tokenizer.encode(prompt)
    input_ids = torch.tensor([prompt_tokens], dtype=torch.long).to(device)
    
    # Generate new tokens
    with torch.no_grad():
        output_ids = model.generate(
            input_ids, 
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_k=top_k
        )
    
    # Decode the generated tokens
    generated_text = tokenizer.decode(output_ids[0].tolist())
    
    return generated_text

## 9. Key Architectural Innovations in GPT2

GPT2 introduced several important innovations to the transformer architecture:

1. **Layer Normalization Position**: GPT2 uses a modified pre-normalization scheme, where layer normalization is applied before the self-attention and feed-forward blocks, rather than after as in the original transformer.

2. **Scale Initialization**: Weight initialization was modified to account for the increased depth of the model.

3. **Increased Context Length**: GPT2 can handle sequences of up to 1024 tokens, allowing it to maintain longer context.

4. **Larger Vocabulary**: The vocabulary size was increased to 50,257 tokens.

5. **Byte Pair Encoding (BPE)**: GPT2 uses BPE tokenization with byte-level operations, allowing it to encode any text without out-of-vocabulary tokens.

6. **Sparse Attention Patterns**: Some variants of GPT use sparse attention patterns to reduce computational complexity.

## 10. Why GPT2 Architecture Works Well

The GPT2 architecture is effective for several reasons:

1. **Pretraining and Transfer Learning**: By learning from a vast corpus of text, the model acquires general language understanding that can be adapted to various tasks.

2. **Causal Attention Mechanism**: The causal attention mechanism is perfectly suited for language modeling, as it explicitly enforces the autoregressive property of text generation.

3. **Layer Normalization and Residual Connections**: These help stabilize training and allow for building deeper networks.

4. **Scaled Dot-Product Attention**: This efficiently captures relationships between tokens in a sequence.

5. **Parameter Sharing**: The same transformer blocks are reused throughout the network, allowing the model to learn hierarchical representations.

6. **Weight Tying**: Sharing weights between the input embedding and output layers reduces parameters and improves performance.

## 11. Conclusion

In this notebook, we've explored the GPT2 architecture, implemented its key components, and discussed why it's effective for language modeling and generation tasks. The GPT architecture has been foundational for modern language models and continues to evolve in models like GPT-3, GPT-4, and beyond.

In the next notebooks, we'll dive deeper into advanced transformer components and techniques for fine-tuning these models for specific tasks.