# Chapter 4: The Complete GPT Model

Welcome to the fourth notebook in our LLM from Scratch series! In this chapter, we'll bring everything together and build the **complete GPT model** - a fully functional transformer language model.

## What You'll Learn

1. **Token and positional embeddings**: Converting discrete tokens to continuous vectors
2. **Weight tying**: Sharing parameters between embeddings and output
3. **Scaled initialization**: Ensuring stable training for deep networks
4. **Complete forward pass**: From token IDs to next-token predictions
5. **Loss computation**: Cross-entropy for language modeling
6. **Model architecture**: Understanding GPT-2/GPT-3 design choices
7. **Hands-on experimentation**: Building and using a real GPT model

This is where all the pieces come together!

## 1. GPT Architecture Overview

The complete GPT model has the following components:

```
Input Token IDs: [15496, 11, 995, 0]  ("Hello, world!")
        ↓
┌───────────────────────────────────────┐
│ Token Embedding (vocab_size, d_model)│
│    + Position Embedding (max_seq, d) │
│    + Dropout                          │
└───────────────────────────────────────┘
        ↓
┌───────────────────────────────────────┐
│ Transformer Block 1                   │
│  - Multi-Head Attention               │
│  - Feedforward Network                │
└───────────────────────────────────────┘
        ↓
┌───────────────────────────────────────┐
│ Transformer Block 2                   │
└───────────────────────────────────────┘
       ...
┌───────────────────────────────────────┐
│ Transformer Block N                   │
└───────────────────────────────────────┘
        ↓
┌───────────────────────────────────────┐
│ Final LayerNorm                       │
└───────────────────────────────────────┘
        ↓
┌───────────────────────────────────────┐
│ Output Projection (d_model, vocab)   │
│ (shares weights with Token Embedding) │
└───────────────────────────────────────┘
        ↓
Logits: [batch, seq_len, vocab_size]
        ↓
Softmax → Next Token Probabilities
```

## 2. Token Embeddings

**Token embeddings** convert discrete token IDs into continuous vectors.

### Why We Need Them:

- Neural networks process **continuous values**, not discrete IDs
- Token ID "1234" has no inherent meaning (not a number to compute with)
- Embeddings learn **semantic representations** where similar words have similar vectors

### How They Work:

```python
# Embedding layer is a lookup table
embedding = nn.Embedding(vocab_size=50257, d_model=768)

# Shape: (vocab_size, d_model) = (50257, 768)
# Each token has a 768-dimensional vector

# Lookup: convert token ID → vector
token_id = 1234
vector = embedding(torch.tensor([token_id]))  # → (1, 768)
```

### Key Properties:

- **Learnable**: Vectors are learned during training
- **Dense**: Each dimension can be any real number
- **Semantic**: Similar words end up with similar vectors
- **Large**: vocab_size × d_model parameters (e.g., 50,257 × 768 = 38.5M params)

## 3. Positional Embeddings

**Positional embeddings** encode position information into the model.

### Why We Need Them:

Self-attention is **permutation invariant**:
- "cat sat mat" gets same attention as "sat cat mat"
- Word order is crucial for language!
- Need to inject position information

### Two Approaches:

#### 1. Sinusoidal (Original Transformer)
```python
# Fixed mathematical formula
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
```
- ✅ Can extrapolate to longer sequences
- ❌ Fixed, cannot adapt to data

#### 2. Learned (GPT-2/GPT-3)
```python
# Learned embedding layer
pos_embedding = nn.Embedding(max_seq_len, d_model)
```
- ✅ Can learn optimal encodings for the data
- ✅ Often works better empirically
- ❌ Limited to max_seq_len

### Our Implementation (GPT-2 Style):

```python
# Token embedding: which word is it?
token_emb = token_embedding(input_ids)  # (batch, seq_len, d_model)

# Position embedding: where in the sequence?
positions = torch.arange(seq_len)  # [0, 1, 2, ..., seq_len-1]
pos_emb = position_embedding(positions)  # (seq_len, d_model)

# Combine
x = token_emb + pos_emb  # (batch, seq_len, d_model)
```

## 4. Weight Tying

**Weight tying** shares parameters between input embeddings and output projection.

### The Idea:

```python
# WITHOUT weight tying
token_embedding = nn.Embedding(vocab_size, d_model)  # vocab_size × d_model params
output_projection = nn.Linear(d_model, vocab_size)   # vocab_size × d_model params
# Total: 2 × vocab_size × d_model parameters

# WITH weight tying
token_embedding = nn.Embedding(vocab_size, d_model)  # vocab_size × d_model params
output_projection.weight = token_embedding.weight     # SHARED!
# Total: 1 × vocab_size × d_model parameters (50% reduction!)
```

### Why This Makes Sense:

- **Input embedding**: "Which vector represents token X?"
- **Output projection**: "Which token does vector Y represent?"
- These are **inverse operations** - makes sense to share weights!

### Benefits:

1. ✅ **Reduces parameters**: 50% reduction in embedding parameters
2. ✅ **Better generalization**: More efficient use of parameters
3. ✅ **Improved performance**: Empirically works better
4. ✅ **Used in all modern LLMs**: GPT-2, GPT-3, BERT, etc.

See: "Using the Output Embedding to Improve Language Models" (Press & Wolf, 2017)

## 5. Scaled Initialization

**Scaled initialization** is crucial for training deep transformers.

### The Problem:

In deep networks, activations can **explode** or **vanish**:
```
Layer 1: x has std=1.0
Layer 2: x has std=2.0  (growing!)
Layer 3: x has std=4.0
Layer 96: x has std=∞  (exploded!)
```

### The Solution (GPT-2/GPT-3):

Scale residual projections by $\frac{1}{\sqrt{2N}}$ where N = number of layers

```python
# Normal initialization
nn.init.normal_(weight, mean=0.0, std=0.02)

# Scaled initialization for residual projections
nn.init.normal_(weight, mean=0.0, std=0.02 / sqrt(2 * n_layers))
```

### Which Layers Get Scaled?

Only **residual projections**:
1. Attention output projection (`attn.out_proj`)
2. FFN output projection (`ffn.fc2`)

These are the layers that **add** to the residual stream.

### Why It Works:

- At initialization, residual paths contribute **less**
- Model starts close to **identity** (x ≈ x + 0)
- Gradually learns to add meaningful transformations
- Prevents activation explosion in 100+ layer networks

## 6. Hands-On: Building a Complete GPT Model

Let's create and experiment with our GPT implementation!

In [None]:
import sys
sys.path.append('..')

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from src.llm import GPTModel, ModelConfig, Tokenizer

# Set random seed
torch.manual_seed(42)

print("PyTorch version:", torch.__version__)

### 6.1 Create a Small GPT Model

In [None]:
# Create a small GPT configuration
config = ModelConfig(
    vocab_size=50257,  # GPT-2 vocabulary
    max_seq_len=256,   # Maximum sequence length
    d_model=256,       # Model dimension
    n_layers=6,        # Number of transformer blocks
    n_heads=8,         # Number of attention heads
    d_ff=1024,         # Feedforward dimension (4 × d_model)
    dropout=0.1,       # Dropout probability
)

# Create model
model = GPTModel(config)

print(f"Model Configuration:")
print(f"  Vocabulary size: {config.vocab_size:,}")
print(f"  Max sequence length: {config.max_seq_len}")
print(f"  Model dimension: {config.d_model}")
print(f"  Number of layers: {config.n_layers}")
print(f"  Attention heads: {config.n_heads}")
print(f"  FFN dimension: {config.d_ff}")

print(f"\nModel Statistics:")
total_params = model.num_parameters()
non_emb_params = model.num_parameters(exclude_embeddings=True)
print(f"  Total parameters: {total_params:,}")
print(f"  Non-embedding parameters: {non_emb_params:,}")
print(f"  Embedding parameters: {total_params - non_emb_params:,}")
print(f"  Embedding %: {(total_params - non_emb_params) / total_params * 100:.1f}%")

### 6.2 Forward Pass: From Tokens to Logits

In [None]:
# Create tokenizer
tokenizer = Tokenizer()

# Encode some text
text = "The quick brown fox jumps over the lazy dog"
input_ids = tokenizer.encode(text)
print(f"Input text: {text}")
print(f"Token IDs: {input_ids}")
print(f"Num tokens: {len(input_ids)}")

# Convert to tensor
input_tensor = torch.tensor([input_ids])  # Add batch dimension
print(f"\nInput shape: {input_tensor.shape}  # (batch_size, seq_len)")

# Forward pass
model.eval()
with torch.no_grad():
    logits, _ = model(input_tensor)

print(f"\nOutput shape: {logits.shape}  # (batch_size, seq_len, vocab_size)")
print(f"Logits for last position: {logits[0, -1, :5]}... (first 5 vocab items)")

### 6.3 Next Token Prediction

In [None]:
# Get next token probabilities
next_token_logits = logits[0, -1, :]  # Last position
next_token_probs = torch.softmax(next_token_logits, dim=-1)

# Get top-5 most likely next tokens
top_k = 5
top_probs, top_indices = torch.topk(next_token_probs, top_k)

print(f"Top {top_k} most likely next tokens:")
print(f"{'Token':<30} {'Token ID':<12} {'Probability':<12}")
print("-" * 60)
for prob, idx in zip(top_probs, top_indices):
    token = tokenizer.decode([idx.item()])
    print(f"{repr(token):<30} {idx.item():<12} {prob.item():<12.6f}")

### 6.4 Visualizing Token and Position Embeddings

In [None]:
# Get token embeddings for our input
with torch.no_grad():
    token_emb = model.token_embedding(input_tensor)  # (1, seq_len, d_model)
    
    seq_len = input_tensor.shape[1]
    positions = torch.arange(seq_len)
    pos_emb = model.position_embedding(positions)  # (seq_len, d_model)

# Visualize embeddings
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Token embeddings
im1 = ax1.imshow(token_emb[0].T.numpy(), aspect='auto', cmap='RdBu_r', vmin=-1, vmax=1)
ax1.set_xlabel('Token Position', fontsize=12)
ax1.set_ylabel('Embedding Dimension', fontsize=12)
ax1.set_title('Token Embeddings', fontsize=14)
plt.colorbar(im1, ax=ax1, label='Value')

# Position embeddings
im2 = ax2.imshow(pos_emb.T.numpy(), aspect='auto', cmap='RdBu_r', vmin=-1, vmax=1)
ax2.set_xlabel('Token Position', fontsize=12)
ax2.set_ylabel('Embedding Dimension', fontsize=12)
ax2.set_title('Position Embeddings', fontsize=14)
plt.colorbar(im2, ax=ax2, label='Value')

plt.tight_layout()
plt.show()

print("Observations:")
print("- Token embeddings: Different patterns for each word")
print("- Position embeddings: Gradually varying pattern across positions")
print("- Combined: Both 'what' and 'where' information")

### 6.5 Loss Computation for Language Modeling

In [None]:
# For language modeling, targets are inputs shifted by one position
# Input:  "The cat sat"
# Target: "cat sat on" (predict next token at each position)

# Create targets (shift input by 1)
input_ids = torch.tensor([input_ids])  # (1, seq_len)
targets = input_ids.clone()

# In practice, we shift during training
# Input: input_ids[:, :-1]  (all except last)
# Target: input_ids[:, 1:]  (all except first)

# For this example, use same for both (not typical)
logits, loss = model(input_ids, targets=targets, return_loss=True)

print(f"Input shape: {input_ids.shape}")
print(f"Logits shape: {logits.shape}")
print(f"Loss: {loss.item():.4f}")
print(f"Perplexity: {torch.exp(loss).item():.2f}")

print("\nNote: High loss/perplexity is expected for untrained model!")
print("Random predictions ≈ log(vocab_size) ≈ log(50257) ≈ 10.8")

## 7. Comparing Model Sizes

Let's compare our model to famous GPT models:

In [None]:
# Famous GPT configurations
gpt_configs = [
    {"name": "Our Tiny", "d_model": 256, "n_layers": 6, "n_heads": 8},
    {"name": "GPT-2 Small", "d_model": 768, "n_layers": 12, "n_heads": 12},
    {"name": "GPT-2 Medium", "d_model": 1024, "n_layers": 24, "n_heads": 16},
    {"name": "GPT-2 Large", "d_model": 1280, "n_layers": 36, "n_heads": 20},
    {"name": "GPT-2 XL", "d_model": 1600, "n_layers": 48, "n_heads": 25},
    {"name": "GPT-3", "d_model": 12288, "n_layers": 96, "n_heads": 96},
]

print(f"{'Model':<20} {'d_model':<10} {'Layers':<10} {'Heads':<10} {'Parameters':<15}")
print("-" * 75)

for cfg in gpt_configs:
    config = ModelConfig(
        vocab_size=50257,
        d_model=cfg["d_model"],
        n_layers=cfg["n_layers"],
        n_heads=cfg["n_heads"],
        d_ff=4 * cfg["d_model"],  # Standard 4x expansion
    )
    model = GPTModel(config)
    params = model.num_parameters()
    
    print(f"{cfg['name']:<20} {cfg['d_model']:<10} {cfg['n_layers']:<10} "
          f"{cfg['n_heads']:<10} {params:>12,}")

print("\nNote: GPT-3 has 175 BILLION parameters!")

## 8. Text Generation with GPT

Let's use our model to generate text!

In [None]:
# Create a small model for generation
config = ModelConfig(
    vocab_size=50257,
    max_seq_len=128,
    d_model=128,
    n_layers=4,
    n_heads=4,
)
model = GPTModel(config)
model.eval()

# Input prompt
prompt = "Once upon a time"
input_ids = tokenizer.encode(prompt)
input_tensor = torch.tensor([input_ids])

print(f"Prompt: {prompt}")
print(f"Generating...\n")

# Generate with different temperatures
for temperature in [0.5, 1.0, 1.5]:
    generated = model.generate(
        input_tensor,
        max_new_tokens=20,
        temperature=temperature,
        top_k=50
    )
    
    generated_text = tokenizer.decode(generated[0].tolist())
    print(f"Temperature {temperature}:")
    print(f"  {generated_text}")
    print()

print("Note: Output is random/poor quality because model is untrained!")
print("After training on real data, generation quality improves dramatically.")

## 9. Understanding Model Components

In [None]:
# Analyze parameter distribution
config = ModelConfig(d_model=256, n_layers=6, n_heads=8)
model = GPTModel(config)

component_params = {
    "Token Embedding": sum(p.numel() for p in model.token_embedding.parameters()),
    "Position Embedding": sum(p.numel() for p in model.position_embedding.parameters()),
    "Transformer Blocks": sum(p.numel() for p in model.blocks.parameters()),
    "Final LayerNorm": sum(p.numel() for p in model.ln_final.parameters()),
}

total = sum(component_params.values())

# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Pie chart
ax1.pie(
    component_params.values(),
    labels=component_params.keys(),
    autopct='%1.1f%%',
    startangle=90
)
ax1.set_title('Parameter Distribution', fontsize=14)

# Bar chart
bars = ax2.barh(list(component_params.keys()), list(component_params.values()))
ax2.set_xlabel('Number of Parameters', fontsize=12)
ax2.set_title('Parameters by Component', fontsize=14)
ax2.ticklabel_format(axis='x', style='plain')

# Add value labels
for i, (component, params) in enumerate(component_params.items()):
    ax2.text(params, i, f' {params:,}', va='center', fontsize=10)

plt.tight_layout()
plt.show()

print(f"Total parameters: {total:,}")
print(f"\nNote: Transformer blocks contain most parameters (attention + FFN)")
print(f"Note: lm_head shares weights with token_embedding, so not counted separately")

## 10. Key Takeaways

Let's recap what we've learned about the complete GPT model:

1. **Complete architecture = Embeddings + Transformer Blocks + Output**:
   - Token embeddings: Convert IDs to vectors
   - Position embeddings: Encode sequence position
   - Transformer blocks: Process information
   - Output projection: Predict next token

2. **Embeddings are crucial**:
   - Token embeddings learn semantic representations
   - Position embeddings break attention's permutation invariance
   - Combined with addition: simple and effective

3. **Weight tying reduces parameters**:
   - Share weights between input and output embeddings
   - 50% reduction in embedding parameters
   - Better generalization and performance

4. **Scaled initialization enables deep networks**:
   - Scale residual projections by 1/√(2N)
   - Prevents activation explosion
   - Essential for training 100+ layer models

5. **Forward pass**: Token IDs → Embeddings → Transformer Blocks → Logits
   - Causal masking prevents looking at future tokens
   - Softmax converts logits to probabilities
   - Cross-entropy loss for training

6. **Model scaling**:
   - Increase d_model, n_layers, n_heads for more capacity
   - GPT-3: 175B parameters!
   - Most parameters in transformer blocks and embeddings

7. **Generation**:
   - Autoregressive: one token at a time
   - Temperature controls randomness
   - Quality depends on training data

## Next Steps

Now that we have a complete GPT model, we're ready to learn about **training and text generation** - how to train the model on data and generate high-quality text!

Continue to **Notebook 05: Training and Generation** →

---

## Further Reading

- [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) (GPT-2 paper)
- [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165) (GPT-3 paper)
- [Using the Output Embedding to Improve Language Models](https://arxiv.org/abs/1608.05859) (Weight tying)
- [The Illustrated GPT-2](http://jalammar.github.io/illustrated-gpt2/) (Jay Alammar)