# GPT-2 Small (124M) Architecture

This notebook implements the **GPT-2 Small** model architecture from scratch using PyTorch.

## Model Overview
| Component | Value |
|-----------|-------|
| Parameters | ~124 Million |
| Embedding Dimension | 768 |
| Attention Heads | 12 |
| Transformer Layers | 12 |
| Context Length | 1024 tokens |
| Vocabulary Size | 50,257 |

## Architecture Components
1. **Token + Positional Embeddings** ‚Üí Input representation
2. **Multi-Head Self-Attention** ‚Üí Captures relationships between tokens
3. **Feed-Forward Network** ‚Üí Processes each position independently
4. **Layer Normalization** ‚Üí Stabilizes training
5. **Residual Connections** ‚Üí Enables gradient flow

## 1. Configuration

In [7]:
import torch
import torch.nn as nn

# GPT-2 Small (124M) Configuration
# These hyperparameters define the model architecture
GPT2_CONFIG = {
    "vocab_size": 50257,      # BPE vocabulary size (50,000 merges + 256 bytes + 1 special token)
    "context_length": 1024,   # Maximum sequence length the model can process
    "emb_dim": 768,           # Embedding dimension (hidden size)
    "n_heads": 12,            # Number of attention heads (768/12 = 64 dim per head)
    "n_layers": 12,           # Number of transformer blocks
    "drop_rate": 0.1,         # Dropout rate for regularization
    "qkv_bias": True          # Whether to use bias in Q, K, V projections
}

## 2. Multi-Head Self-Attention

The core mechanism that allows tokens to "attend" to other tokens in the sequence.

**Key Operations:**
- **Q, K, V Projections**: Transform input into Query, Key, Value vectors
- **Scaled Dot-Product**: `Attention(Q,K,V) = softmax(QK^T / ‚àöd_k) √ó V`
- **Causal Masking**: Prevents attending to future tokens (autoregressive)
- **Multi-Head**: Runs attention in parallel across multiple "heads" for richer representations

In [8]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert d_out % num_heads == 0, "d_out must be divisible by num_heads"
        
        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads  # Dimension per head (768/12 = 64)
        
        # Linear projections for Query, Key, Value
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        
        # Output projection to combine all heads
        self.out_proj = nn.Linear(d_out, d_out)
        self.dropout = nn.Dropout(dropout)
        
        # Causal mask: upper triangular matrix of 1s (to be masked with -inf)
        # This prevents attending to future tokens
        self.register_buffer("mask", torch.triu(torch.ones(context_length, context_length), diagonal=1))

    def forward(self, x):
        b, n, _ = x.shape  # batch_size, num_tokens, embedding_dim
        
        # Project to Q, K, V and reshape for multi-head: (b, n, d) -> (b, heads, n, head_dim)
        q = self.W_query(x).view(b, n, self.num_heads, self.head_dim).transpose(1, 2)
        k = self.W_key(x).view(b, n, self.num_heads, self.head_dim).transpose(1, 2)
        v = self.W_value(x).view(b, n, self.num_heads, self.head_dim).transpose(1, 2)
        
        # Scaled dot-product attention: (Q @ K^T) / sqrt(d_k)
        attn = (q @ k.transpose(-2, -1)) / (self.head_dim ** 0.5)
        
        # Apply causal mask (set future positions to -inf before softmax)
        attn = attn.masked_fill(self.mask[:n, :n].bool(), float("-inf"))
        
        # Softmax to get attention weights, then apply dropout
        attn = self.dropout(torch.softmax(attn, dim=-1))
        
        # Apply attention to values and reshape back: (b, heads, n, head_dim) -> (b, n, d_out)
        out = (attn @ v).transpose(1, 2).contiguous().view(b, n, self.d_out)
        
        return self.out_proj(out)  # Final linear projection

## 3. Layer Normalization, GELU & Feed-Forward Network

**LayerNorm**: Normalizes across the embedding dimension (not batch), stabilizing training.

**GELU (Gaussian Error Linear Unit)**: Smooth activation function used in GPT-2.
- Formula: `GELU(x) = 0.5x(1 + tanh(‚àö(2/œÄ)(x + 0.044715x¬≥)))`

**FeedForward**: Two-layer MLP that expands (√ó4) then contracts the dimension.
- `768 ‚Üí 3072 ‚Üí 768`

In [9]:
class LayerNorm(nn.Module):
    """Layer Normalization - normalizes across embedding dimension"""
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5  # Small constant for numerical stability
        self.scale = nn.Parameter(torch.ones(emb_dim))   # Learnable gain (gamma)
        self.shift = nn.Parameter(torch.zeros(emb_dim))  # Learnable bias (beta)

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)                    # Mean across embedding dim
        var = x.var(dim=-1, keepdim=True, unbiased=False)      # Variance across embedding dim
        norm_x = (x - mean) / torch.sqrt(var + self.eps)       # Normalize
        return self.scale * norm_x + self.shift                 # Scale and shift


class GELU(nn.Module):
    """Gaussian Error Linear Unit - smooth activation function"""
    def forward(self, x):
        # Approximation used in GPT-2 (faster than exact GELU)
        return 0.5 * x * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0 / torch.pi)) * (x + 0.044715 * x ** 3)
        ))


class FeedForward(nn.Module):
    """Position-wise Feed-Forward Network (MLP)"""
    def __init__(self, cfg):
        super().__init__()
        # Expand to 4x, apply GELU, then project back
        self.layers = nn.Sequential(
            nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),  # 768 -> 3072
            GELU(),
            nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"])   # 3072 -> 768
        )

    def forward(self, x):
        return self.layers(x)

## 4. Transformer Block

Each transformer block follows the **Pre-Norm** architecture (GPT-2 style):

```
x ‚Üí LayerNorm ‚Üí Attention ‚Üí Dropout ‚Üí + (residual)
                                      ‚Üì
x ‚Üí LayerNorm ‚Üí FeedForward ‚Üí Dropout ‚Üí + (residual) ‚Üí output
```

**Residual connections** allow gradients to flow directly through the network, enabling training of deep models.

In [10]:
class TransformerBlock(nn.Module):
    """Single Transformer Block with Pre-LayerNorm (GPT-2 style)"""
    def __init__(self, cfg):
        super().__init__()
        # Multi-head self-attention
        self.att = MultiHeadAttention(
            d_in=cfg["emb_dim"],
            d_out=cfg["emb_dim"],
            context_length=cfg["context_length"],
            dropout=cfg["drop_rate"],
            num_heads=cfg["n_heads"],
            qkv_bias=cfg["qkv_bias"]
        )
        # Feed-forward network
        self.ff = FeedForward(cfg)
        
        # Layer normalization (applied before attention and FFN)
        self.norm1 = LayerNorm(cfg["emb_dim"])
        self.norm2 = LayerNorm(cfg["emb_dim"])
        
        # Dropout for residual connections
        self.drop = nn.Dropout(cfg["drop_rate"])

    def forward(self, x):
        # Pre-norm + Attention + Residual connection
        x = x + self.drop(self.att(self.norm1(x)))
        
        # Pre-norm + FFN + Residual connection
        x = x + self.drop(self.ff(self.norm2(x)))
        
        return x

## 5. Complete GPT Model

The full GPT-2 architecture:

1. **Token Embedding**: Converts token IDs ‚Üí vectors (50257 √ó 768)
2. **Positional Embedding**: Adds position information (1024 √ó 768)
3. **Transformer Blocks**: 12 stacked blocks for deep processing
4. **Final LayerNorm**: Normalizes before output projection
5. **Output Head**: Projects back to vocabulary size for next-token prediction

In [11]:
class GPTModel(nn.Module):
    """Complete GPT-2 Model"""
    def __init__(self, cfg):
        super().__init__()
        # Token embedding: vocab_size -> emb_dim
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        
        # Positional embedding: context_length -> emb_dim
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        
        # Embedding dropout
        self.drop_emb = nn.Dropout(cfg["drop_rate"])
        
        # Stack of transformer blocks
        self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])]
        )
        
        # Final layer normalization
        self.final_norm = LayerNorm(cfg["emb_dim"])
        
        # Output projection: emb_dim -> vocab_size (no bias, often tied with tok_emb)
        self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)

    def forward(self, idx):
        batch_size, seq_len = idx.shape
        
        # Get token embeddings
        tok_emb = self.tok_emb(idx)  # (batch, seq_len, emb_dim)
        
        # Get positional embeddings
        pos_emb = self.pos_emb(torch.arange(seq_len, device=idx.device))  # (seq_len, emb_dim)
        
        # Combine embeddings and apply dropout
        x = self.drop_emb(tok_emb + pos_emb)
        
        # Pass through transformer blocks
        x = self.trf_blocks(x)
        
        # Final normalization
        x = self.final_norm(x)
        
        # Project to vocabulary size (logits)
        logits = self.out_head(x)  # (batch, seq_len, vocab_size)
        
        return logits

## 6. Model Verification

Let's verify the model is correctly built by:
1. Counting total parameters (~124M expected)
2. Testing with a sample input
3. Checking output shape

In [13]:
# Initialize model
model = GPTModel(GPT2_CONFIG)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"‚úÖ Total parameters: {total_params:,}")

# With weight tying (tok_emb.weight == out_head.weight), subtract duplicate
params_with_tying = total_params - model.out_head.weight.numel()
print(f"‚úÖ With weight tying: {params_with_tying:,} (~124M)")

# Test with sample input
batch_size, seq_len = 2, 64
sample_input = torch.randint(0, GPT2_CONFIG["vocab_size"], (batch_size, seq_len))

# Forward pass
model.eval()
with torch.no_grad():
    output = model(sample_input)

# Verify output shape
expected_shape = (batch_size, seq_len, GPT2_CONFIG["vocab_size"])
print(f"\n‚úÖ Input shape: {sample_input.shape}")
print(f"‚úÖ Output shape: {output.shape}")
print(f"   Expected: {expected_shape}")

# Verify correctness
assert output.shape == expected_shape, "‚ùå Output shape mismatch!"
print("\nüéâ Model architecture is correct!")

‚úÖ Total parameters: 163,037,184
‚úÖ With weight tying: 124,439,808 (~124M)

‚úÖ Input shape: torch.Size([2, 64])
‚úÖ Output shape: torch.Size([2, 64, 50257])
   Expected: (2, 64, 50257)

üéâ Model architecture is correct!
