In [None]:
# @title
from IPython.display import display, HTML

display(HTML("""
<script>
const firstCell = document.querySelector('.cell.code_cell');
if (firstCell) {
  firstCell.querySelector('.input').style.pointerEvents = 'none';
  firstCell.querySelector('.input').style.opacity = '0.5';
}
</script>
"""))

html = """
<div style="display:flex; flex-direction:column; align-items:center; text-align:center; gap:12px; padding:8px;">
  <h1 style="margin:0;">üëã Welcome to <span style="color:#1E88E5;">Algopath Coding Academy</span>!</h1>

  <img src="https://raw.githubusercontent.com/sshariqali/mnist_pretrained_model/main/algopath_logo.jpg"
       alt="Algopath Coding Academy Logo"
       width="400"
       style="border-radius:15px; box-shadow:0 4px 12px rgba(0,0,0,0.2); max-width:100%; height:auto;" />

  <p style="font-size:16px; margin:0;">
    <em>Empowering young minds to think creatively, code intelligently, and build the future with AI.</em>
  </p>
</div>
"""

display(HTML(html))

## Day 9 ‚Äî Task 2: Hyperparameter Exploration & Analysis of the GPT Model üü°

---

### üéØ **Goal**

Starting from the pre-trained GPT model built in Day 8 (`hands_on_part_7_scaling.ipynb` + `model_weights.pth`), you will:
1. **Load and evaluate** the pre-trained baseline model
2. **Experiment** with architectural changes ‚Äî vary `n_head`, `n_layer`, `n_embed`, `dropout`, and `block_size`
3. **Retrain** smaller variants from scratch
4. **Compare** validation losses across configurations
5. **Visualise** attention weight heatmaps to understand what the model focuses on
6. **Analyse** how each hyperparameter affects generation quality

---

### üìã **Agenda**

| Section | Topic | Description |
|:-------:|-------|-------------|
| 1 | **Setup & Data Pipeline** | Load PyTorch, dataset, tokenizer, batching |
| 2 | **Model Architecture** | Full Transformer definition (from Day 8) |
| 3 | **Load Pre-trained Baseline** | Load `model_weights.pth` and evaluate |
| 4 | **Attention Weight Visualisation** | Extract and plot attention heatmaps from the baseline |
| 5 | **Hyperparameter Experiments** | Train smaller variants with different configs |
| 6 | **Results Comparison** | Bar charts and tables comparing all configurations |
| 7 | **Generation Quality Comparison** | Side-by-side text samples from each model |
| 8 | **Written Analysis** | Discuss findings and draw conclusions |

---

### üéì **Skills Tested**

- ‚úÖ Transformer architecture understanding (Days 7‚Äì8)
- ‚úÖ Attention mechanism visualisation
- ‚úÖ Matplotlib / plotting skills
- ‚úÖ Critical analysis of model behaviour
- ‚úÖ Experimental methodology ‚Äî changing one variable at a time

Let's explore what makes a Transformer tick! üöÄ

---
## Section 1: Setup & Data Pipeline

We reuse the exact same data pipeline from `hands_on_part_7_scaling.ipynb`:
- Load Tiny Shakespeare
- Character-level tokenizer (`stoi` / `itos`)
- 90/10 train/val split
- `get_batch()` and `estimate_loss()` functions

These stay the same across ALL experiments ‚Äî only the **model architecture** changes.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import numpy as np
import time
import copy

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

### Load and Tokenize the Dataset

In [None]:
# Load the tiny shakespeare dataset
with open("tiny_shakespeare.txt", "r", encoding="utf-8") as f:
    text = f.read()

# Build vocabulary
chars = sorted(list(set(text)))
vocab_size = len(chars)

# Tokenizer
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

# Encode and split
data = torch.tensor(encode(text), dtype=torch.long)
n = int(len(data) * 0.9)
train_data = data[:n]
val_data = data[n:]

print(f"Vocabulary size: {vocab_size}")
print(f"Training tokens: {len(train_data):,}")
print(f"Validation tokens: {len(val_data):,}")

### Batch Loader & Loss Estimation

These functions adapt to whatever `batch_size` and `block_size` are currently set. We define them to accept these as parameters so they work across experiments.

In [None]:
def get_batch(split, block_size, batch_size=32):
    """Sample a random mini-batch."""
    d = train_data if split == 'train' else val_data
    ix = torch.randint(len(d) - block_size, (batch_size,))
    x = torch.stack([d[i : i + block_size] for i in ix])
    y = torch.stack([d[i + 1 : i + block_size + 1] for i in ix])
    return x.to(device), y.to(device)


@torch.no_grad()
def estimate_loss(model, block_size, batch_size=32, eval_iters=200):
    """Average loss over many batches for stable measurement."""
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split, block_size, batch_size)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean().item()
    model.train()
    return out

---
## Section 2: Model Architecture (from Day 8)

We define the full Transformer model ‚Äî `Head`, `MultiHeadAttention`, `FeedForward`, `Block`, and `Transformer` ‚Äî exactly as built in Day 8 Part 7.

The key difference: we **parameterise everything** via a config dictionary so we can easily swap hyperparameters for each experiment.

In [None]:
class Head(nn.Module):
    """Single head of self-attention."""

    def __init__(self, n_embed, head_size, block_size, dropout):
        super().__init__()
        self.key   = nn.Linear(n_embed, head_size, bias=False)
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, return_weights=False):
        B, T, C = x.shape
        k = self.key(x)
        q = self.query(x)
        v = self.value(x)
        wei = q @ k.transpose(-2, -1) * C**-0.5
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        if return_weights:
            raw_weights = wei.clone()  # save before dropout
        wei = self.dropout(wei)
        out = wei @ v
        if return_weights:
            return out, raw_weights
        return out


class MultiHeadAttention(nn.Module):
    """Multiple heads of self-attention in parallel."""

    def __init__(self, n_embed, num_heads, head_size, block_size, dropout):
        super().__init__()
        self.heads = nn.ModuleList([
            Head(n_embed, head_size, block_size, dropout) for _ in range(num_heads)
        ])
        self.proj = nn.Linear(n_embed, n_embed)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, return_weights=False):
        if return_weights:
            results = [h(x, return_weights=True) for h in self.heads]
            outs = [r[0] for r in results]
            weights = [r[1] for r in results]
            out = torch.cat(outs, dim=-1)
            out = self.dropout(self.proj(out))
            return out, weights
        else:
            out = torch.cat([h(x) for h in self.heads], dim=-1)
            out = self.dropout(self.proj(out))
            return out


class FeedForward(nn.Module):
    """Simple feed-forward network with ReLU."""

    def __init__(self, n_embed, dropout):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embed, n_embed * 4),
            nn.ReLU(),
            nn.Linear(n_embed * 4, n_embed),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)


class Block(nn.Module):
    """Transformer block: LayerNorm ‚Üí MultiHead Attention ‚Üí LayerNorm ‚Üí FFN, with residual connections."""

    def __init__(self, n_embed, n_head, block_size, dropout):
        super().__init__()
        head_size = n_embed // n_head
        self.sa   = MultiHeadAttention(n_embed, n_head, head_size, block_size, dropout)
        self.ffwd = FeedForward(n_embed, dropout)
        self.ln1  = nn.LayerNorm(n_embed)
        self.ln2  = nn.LayerNorm(n_embed)

    def forward(self, x, return_weights=False):
        if return_weights:
            sa_out, weights = self.sa(self.ln1(x), return_weights=True)
            x = x + sa_out
            x = x + self.ffwd(self.ln2(x))
            return x, weights
        else:
            x = x + self.sa(self.ln1(x))
            x = x + self.ffwd(self.ln2(x))
            return x


class Transformer(nn.Module):
    """GPT-style decoder-only Transformer."""

    def __init__(self, vocab_size, n_embed, n_head, n_layer, block_size, dropout):
        super().__init__()
        self.block_size = block_size
        self.token_embedding_table    = nn.Embedding(vocab_size, n_embed)
        self.position_embedding_table = nn.Embedding(block_size, n_embed)
        self.blocks = nn.ModuleList([
            Block(n_embed, n_head, block_size, dropout) for _ in range(n_layer)
        ])
        self.ln_f    = nn.LayerNorm(n_embed)
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, idx, y=None):
        B, T = idx.shape
        tok_emb = self.token_embedding_table(idx)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        for block in self.blocks:
            x = block(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)

        if y is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B * T, C)
            y = y.view(B * T)
            loss = F.cross_entropy(logits, y)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -self.block_size:]
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)
        return idx

    def get_attention_weights(self, idx, layer_idx=0):
        """
        Forward pass that also returns the attention weights
        from a specific transformer block (layer).
        """
        B, T = idx.shape
        tok_emb = self.token_embedding_table(idx)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        weights = None
        for i, block in enumerate(self.blocks):
            if i == layer_idx:
                x, weights = block(x, return_weights=True)
            else:
                x = block(x)
        return weights  # list of (B, T, T) tensors, one per head

### Helper: Count Model Parameters

In [None]:
def count_parameters(model):
    """Count total trainable parameters."""
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

---
## Section 3: Load the Pre-trained Baseline Model

We load the model from `model_weights.pth` using the **original Day 8 hyperparameters**:

| Parameter | Value |
|-----------|-------|
| `n_embed` | 384 |
| `n_head` | 6 |
| `n_layer` | 6 |
| `block_size` | 256 |
| `dropout` | 0.2 |

In [None]:
# ‚îÄ‚îÄ Baseline configuration (from Day 8 Part 7) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
baseline_config = {
    'n_embed':    384,
    'n_head':     6,
    'n_layer':    6,
    'block_size': 256,
    'dropout':    0.2,
}

# Create model with baseline config
baseline_model = Transformer(
    vocab_size = vocab_size,
    **baseline_config
).to(device)

# Load pre-trained weights
baseline_model.load_state_dict(
    torch.load('model_weights.pth', map_location=device)
)
baseline_model.eval()

print(f"‚úÖ Baseline model loaded successfully!")
print(f"   Parameters: {count_parameters(baseline_model):,}")

### Evaluate the Baseline

In [None]:
baseline_losses = estimate_loss(baseline_model, block_size=256, batch_size=32)
print(f"Baseline ‚Äî Train loss: {baseline_losses['train']:.4f}, Val loss: {baseline_losses['val']:.4f}")

### Generate Sample Text from Baseline

In [None]:
context = torch.zeros((1, 1), dtype=torch.long, device=device)
generated = baseline_model.generate(context, max_new_tokens=500)
print("=" * 60)
print("  BASELINE MODEL ‚Äî Generated Text")
print("=" * 60)
print(decode(generated[0].tolist()))
print("=" * 60)

---
## Section 4: Attention Weight Visualisation üîç

One of the most powerful ways to understand a Transformer is to **look at what it's paying attention to**.

Each attention head in each layer produces a $(T \times T)$ weight matrix, where entry $(i, j)$ tells us:

> "When producing the output for position $i$, how much does the model attend to position $j$?"

We'll extract these weights from the **pre-trained baseline** and plot them as heatmaps.

### Prepare a Short Input Sequence

We pick a short passage from Shakespeare for clear visualisation. Attention maps on long sequences are hard to read!

In [None]:
# Pick a short, recognisable passage
sample_text = "First Citizen:\nBefore we proceed"
sample_tokens = encode(sample_text)
sample_tensor = torch.tensor([sample_tokens], dtype=torch.long, device=device)

T = len(sample_tokens)
print(f"Input text: '{sample_text}'")
print(f"Tokens ({T}): {sample_tokens}")
print(f"Characters: {[itos[t] for t in sample_tokens]}")

### Extract and Plot Attention Weights

We'll visualise attention heads from **Layer 0** (first layer) and **Layer 5** (last layer) of the baseline model to see how attention patterns differ between early and late layers.

In [None]:
def plot_attention_heads(model, input_tensor, layer_idx, n_heads_to_show=6):
    """
    Extract attention weights from a specific layer and plot each head as a heatmap.
    """
    model.eval()
    with torch.no_grad():
        weights = model.get_attention_weights(input_tensor, layer_idx=layer_idx)

    T = input_tensor.shape[1]
    token_labels = [itos[input_tensor[0, i].item()] for i in range(T)]
    # Make whitespace characters visible
    display_labels = []
    for ch in token_labels:
        if ch == '\n':
            display_labels.append('\\n')
        elif ch == ' ':
            display_labels.append('‚ê£')
        else:
            display_labels.append(ch)

    n_heads = min(len(weights), n_heads_to_show)
    fig, axes = plt.subplots(1, n_heads, figsize=(4 * n_heads, 4))
    if n_heads == 1:
        axes = [axes]

    for h in range(n_heads):
        ax = axes[h]
        w = weights[h][0].cpu().numpy()  # (T, T), first batch element
        im = ax.imshow(w, cmap='viridis', vmin=0, vmax=1)
        ax.set_title(f'Head {h}', fontsize=11)
        ax.set_xticks(range(T))
        ax.set_xticklabels(display_labels, fontsize=7, rotation=90)
        ax.set_yticks(range(T))
        ax.set_yticklabels(display_labels, fontsize=7)
        if h == 0:
            ax.set_ylabel('Query position (output)', fontsize=9)
        ax.set_xlabel('Key position (input)', fontsize=9)

    fig.suptitle(f'Attention Weights ‚Äî Layer {layer_idx}', fontsize=14, y=1.02)
    fig.colorbar(im, ax=axes, shrink=0.6, label='Attention Weight')
    plt.tight_layout()
    plt.show()

### Layer 0 ‚Äî First Transformer Block

Early layers tend to learn **local patterns** ‚Äî attending to nearby characters, the previous character, or specific character types (vowels, consonants, etc.).

In [None]:
plot_attention_heads(baseline_model, sample_tensor, layer_idx=0)

### Layer 5 ‚Äî Last Transformer Block

Later layers tend to learn more **abstract, long-range patterns** ‚Äî attending to semantically related positions or structural patterns.

In [None]:
plot_attention_heads(baseline_model, sample_tensor, layer_idx=5)

### üîç What to Look For in the Heatmaps

| Pattern | What It Means |
|---------|---------------|
| **Strong diagonal** | Head attends to the immediately previous token (like a bigram) |
| **Vertical stripes** | Head attends to a specific position regardless of query position |
| **Uniform rows** | Head distributes attention evenly (less specialised) |
| **Block patterns** | Head groups tokens (e.g., within a word, within a line) |
| **Triangular bottom-left** | Natural causal pattern ‚Äî later tokens attend to more context |

Different heads learn **different roles** ‚Äî this is why multi-head attention is powerful!

### Average Attention Across All Heads in Each Layer

Let's also see the **average** attention pattern per layer ‚Äî this smooths out individual head quirks and shows the layer's overall behaviour.

In [None]:
n_layers = baseline_config['n_layer']
fig, axes = plt.subplots(1, n_layers, figsize=(4 * n_layers, 4))

T = sample_tensor.shape[1]
display_labels = []
for i in range(T):
    ch = itos[sample_tensor[0, i].item()]
    if ch == '\n':
        display_labels.append('\\n')
    elif ch == ' ':
        display_labels.append('‚ê£')
    else:
        display_labels.append(ch)

for layer_idx in range(n_layers):
    with torch.no_grad():
        weights = baseline_model.get_attention_weights(sample_tensor, layer_idx=layer_idx)
    
    # Average across all heads
    avg_w = torch.stack([w[0] for w in weights]).mean(dim=0).cpu().numpy()

    ax = axes[layer_idx]
    im = ax.imshow(avg_w, cmap='viridis', vmin=0, vmax=0.5)
    ax.set_title(f'Layer {layer_idx}', fontsize=11)
    ax.set_xticks(range(T))
    ax.set_xticklabels(display_labels, fontsize=6, rotation=90)
    ax.set_yticks(range(T))
    ax.set_yticklabels(display_labels, fontsize=6)

fig.suptitle('Average Attention Weights per Layer (Baseline Model)', fontsize=14, y=1.02)
fig.colorbar(im, ax=axes, shrink=0.6, label='Avg Attention Weight')
plt.tight_layout()
plt.show()

---
## Section 5: Hyperparameter Experiments üß™

Now for the core experiment! We'll train **smaller variants** of the Transformer, changing **one hyperparameter at a time** compared to a reduced baseline.

### Experimental Design

**Why not use the full baseline config?**  
The baseline (384 embed, 6 heads, 6 layers) takes 15-30 min to train on GPU. Instead, we use a **reduced baseline** that trains in 2-5 minutes, then vary one parameter at a time.

| Experiment | What Changes | Reduced Baseline |
|:----------:|:------------|:----------------|
| **Baseline (small)** | nothing | `n_embed=64, n_head=4, n_layer=4, block_size=128, dropout=0.1` |
| **Exp A** | `n_head` | 1 head vs 4 heads vs 8 heads |
| **Exp B** | `n_layer` | 1 layer vs 4 layers vs 8 layers |
| **Exp C** | `n_embed` | 32 vs 64 vs 128 |
| **Exp D** | `dropout` | 0.0 vs 0.1 vs 0.3 |
| **Exp E** | `block_size` | 32 vs 128 vs 256 |

### Training Helper Function

We define a reusable function that creates a model, trains it, records the loss curve, and returns everything we need for comparison.

In [None]:
def train_variant(config, label, max_iters=3000, eval_interval=500, learning_rate=3e-4, batch_size=32):
    """
    Train a Transformer variant from scratch and return results.
    
    Args:
        config: dict with n_embed, n_head, n_layer, block_size, dropout
        label: string name for this experiment
        max_iters: training steps
        eval_interval: how often to evaluate
        learning_rate: optimiser LR
        batch_size: mini-batch size
    
    Returns:
        dict with model, losses, generated text, timing info
    """
    print(f"\n{'='*60}")
    print(f"  Training: {label}")
    print(f"  Config: {config}")
    
    torch.manual_seed(1337)
    
    model = Transformer(vocab_size=vocab_size, **config).to(device)
    n_params = count_parameters(model)
    print(f"  Parameters: {n_params:,}")
    
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
    
    train_losses = []
    val_losses = []
    steps_list = []
    
    start_time = time.time()
    
    for step in range(max_iters):
        if step % eval_interval == 0 or step == max_iters - 1:
            losses = estimate_loss(model, config['block_size'], batch_size)
            train_losses.append(losses['train'])
            val_losses.append(losses['val'])
            steps_list.append(step)
            print(f"  Step {step:5d}: train={losses['train']:.4f}, val={losses['val']:.4f}")
        
        xb, yb = get_batch('train', config['block_size'], batch_size)
        logits, loss = model(xb, yb)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad(set_to_none=True)
    
    elapsed = time.time() - start_time
    print(f"  Training time: {elapsed:.1f}s")
    
    # Generate sample text
    model.eval()
    context = torch.zeros((1, 1), dtype=torch.long, device=device)
    generated = model.generate(context, max_new_tokens=300)
    sample_text = decode(generated[0].tolist())
    
    return {
        'model': model,
        'label': label,
        'config': config,
        'n_params': n_params,
        'train_losses': train_losses,
        'val_losses': val_losses,
        'steps': steps_list,
        'final_train_loss': train_losses[-1],
        'final_val_loss': val_losses[-1],
        'elapsed': elapsed,
        'sample_text': sample_text,
    }

### Define All Experiment Configurations

In [None]:
# ‚îÄ‚îÄ Reduced baseline config ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
small_baseline = {
    'n_embed':    64,
    'n_head':     4,
    'n_layer':    4,
    'block_size': 128,
    'dropout':    0.1,
}

# ‚îÄ‚îÄ Experiment configs (change ONE parameter at a time) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
experiments = {
    # Baseline
    'Baseline (small)': small_baseline,

    # Experiment A: vary n_head
    'n_head=1':  {**small_baseline, 'n_head': 1},
    'n_head=8':  {**small_baseline, 'n_head': 8},

    # Experiment B: vary n_layer
    'n_layer=1': {**small_baseline, 'n_layer': 1},
    'n_layer=8': {**small_baseline, 'n_layer': 8},

    # Experiment C: vary n_embed
    'n_embed=32':  {**small_baseline, 'n_embed': 32, 'n_head': 4},  # 32/4=8 head_size
    'n_embed=128': {**small_baseline, 'n_embed': 128, 'n_head': 4}, # 128/4=32 head_size

    # Experiment D: vary dropout
    'dropout=0.0': {**small_baseline, 'dropout': 0.0},
    'dropout=0.3': {**small_baseline, 'dropout': 0.3},

    # Experiment E: vary block_size
    'block_size=32':  {**small_baseline, 'block_size': 32},
    'block_size=256': {**small_baseline, 'block_size': 256},
}

print(f"Total experiments to run: {len(experiments)}")
for name, cfg in experiments.items():
    print(f"  ‚Ä¢ {name}")

### Run All Experiments

‚ö†Ô∏è **This cell may take a while** depending on your hardware. Each experiment trains for 3,000 steps, which takes ~1-3 minutes on GPU or ~3-8 minutes on CPU per experiment.

In [None]:
results = {}

for name, config in experiments.items():
    results[name] = train_variant(config, label=name, max_iters=3000, eval_interval=500)

print("\n" + "=" * 60)
print("  ALL EXPERIMENTS COMPLETE! ‚úÖ")
print("=" * 60)

---
## Section 6: Results Comparison üìä

Now let's visualise and compare the results across all experiments.

### 6.1 ‚Äî Summary Table

In [None]:
# Print a summary table
print(f"{'Experiment':<20} {'Params':>10} {'Train Loss':>12} {'Val Loss':>10} {'Time (s)':>10}")
print("-" * 65)
for name, r in results.items():
    print(f"{name:<20} {r['n_params']:>10,} {r['final_train_loss']:>12.4f} {r['final_val_loss']:>10.4f} {r['elapsed']:>10.1f}")

### 6.2 ‚Äî Validation Loss Bar Chart (All Experiments)

In [None]:
names = list(results.keys())
val_losses_all = [results[n]['final_val_loss'] for n in names]
train_losses_all = [results[n]['final_train_loss'] for n in names]

fig, ax = plt.subplots(figsize=(14, 6))
x_pos = np.arange(len(names))
width = 0.35

bars1 = ax.bar(x_pos - width/2, train_losses_all, width, label='Train Loss', color='#1E88E5', alpha=0.8)
bars2 = ax.bar(x_pos + width/2, val_losses_all, width, label='Val Loss', color='#E53935', alpha=0.8)

ax.set_xlabel('Experiment', fontsize=12)
ax.set_ylabel('Cross-Entropy Loss', fontsize=12)
ax.set_title('Train & Validation Loss Across All Experiments', fontsize=14)
ax.set_xticks(x_pos)
ax.set_xticklabels(names, rotation=45, ha='right', fontsize=9)
ax.legend()
ax.grid(axis='y', alpha=0.3)

# Add value labels on bars
for bar in bars2:
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height + 0.01,
            f'{height:.3f}', ha='center', va='bottom', fontsize=7)

plt.tight_layout()
plt.show()

### 6.3 ‚Äî Training Curves: Effect of Number of Heads

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

head_exps = ['n_head=1', 'Baseline (small)', 'n_head=8']
colors = ['#FF7043', '#1E88E5', '#66BB6A']

for name, color in zip(head_exps, colors):
    r = results[name]
    ax1.plot(r['steps'], r['train_losses'], marker='o', label=f"{name} (train)", color=color, linestyle='-')
    ax2.plot(r['steps'], r['val_losses'], marker='s', label=f"{name} (val)", color=color, linestyle='-')

ax1.set_title('Training Loss ‚Äî Varying n_head', fontsize=12)
ax1.set_xlabel('Step'); ax1.set_ylabel('Loss'); ax1.legend(); ax1.grid(alpha=0.3)
ax2.set_title('Validation Loss ‚Äî Varying n_head', fontsize=12)
ax2.set_xlabel('Step'); ax2.set_ylabel('Loss'); ax2.legend(); ax2.grid(alpha=0.3)

plt.tight_layout()
plt.show()

### 6.4 ‚Äî Training Curves: Effect of Number of Layers

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

layer_exps = ['n_layer=1', 'Baseline (small)', 'n_layer=8']

for name, color in zip(layer_exps, colors):
    r = results[name]
    ax1.plot(r['steps'], r['train_losses'], marker='o', label=f"{name} (train)", color=color, linestyle='-')
    ax2.plot(r['steps'], r['val_losses'], marker='s', label=f"{name} (val)", color=color, linestyle='-')

ax1.set_title('Training Loss ‚Äî Varying n_layer', fontsize=12)
ax1.set_xlabel('Step'); ax1.set_ylabel('Loss'); ax1.legend(); ax1.grid(alpha=0.3)
ax2.set_title('Validation Loss ‚Äî Varying n_layer', fontsize=12)
ax2.set_xlabel('Step'); ax2.set_ylabel('Loss'); ax2.legend(); ax2.grid(alpha=0.3)

plt.tight_layout()
plt.show()

### 6.5 ‚Äî Training Curves: Effect of Embedding Dimension

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

embed_exps = ['n_embed=32', 'Baseline (small)', 'n_embed=128']

for name, color in zip(embed_exps, colors):
    r = results[name]
    ax1.plot(r['steps'], r['train_losses'], marker='o', label=f"{name} (train)", color=color, linestyle='-')
    ax2.plot(r['steps'], r['val_losses'], marker='s', label=f"{name} (val)", color=color, linestyle='-')

ax1.set_title('Training Loss ‚Äî Varying n_embed', fontsize=12)
ax1.set_xlabel('Step'); ax1.set_ylabel('Loss'); ax1.legend(); ax1.grid(alpha=0.3)
ax2.set_title('Validation Loss ‚Äî Varying n_embed', fontsize=12)
ax2.set_xlabel('Step'); ax2.set_ylabel('Loss'); ax2.legend(); ax2.grid(alpha=0.3)

plt.tight_layout()
plt.show()

### 6.6 ‚Äî Training Curves: Effect of Dropout

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

dropout_exps = ['dropout=0.0', 'Baseline (small)', 'dropout=0.3']

for name, color in zip(dropout_exps, colors):
    r = results[name]
    ax1.plot(r['steps'], r['train_losses'], marker='o', label=f"{name} (train)", color=color, linestyle='-')
    ax2.plot(r['steps'], r['val_losses'], marker='s', label=f"{name} (val)", color=color, linestyle='-')

ax1.set_title('Training Loss ‚Äî Varying dropout', fontsize=12)
ax1.set_xlabel('Step'); ax1.set_ylabel('Loss'); ax1.legend(); ax1.grid(alpha=0.3)
ax2.set_title('Validation Loss ‚Äî Varying dropout', fontsize=12)
ax2.set_xlabel('Step'); ax2.set_ylabel('Loss'); ax2.legend(); ax2.grid(alpha=0.3)

plt.tight_layout()
plt.show()

### 6.7 ‚Äî Training Curves: Effect of Block Size (Context Length)

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

block_exps = ['block_size=32', 'Baseline (small)', 'block_size=256']

for name, color in zip(block_exps, colors):
    r = results[name]
    ax1.plot(r['steps'], r['train_losses'], marker='o', label=f"{name} (train)", color=color, linestyle='-')
    ax2.plot(r['steps'], r['val_losses'], marker='s', label=f"{name} (val)", color=color, linestyle='-')

ax1.set_title('Training Loss ‚Äî Varying block_size', fontsize=12)
ax1.set_xlabel('Step'); ax1.set_ylabel('Loss'); ax1.legend(); ax1.grid(alpha=0.3)
ax2.set_title('Validation Loss ‚Äî Varying block_size', fontsize=12)
ax2.set_xlabel('Step'); ax2.set_ylabel('Loss'); ax2.legend(); ax2.grid(alpha=0.3)

plt.tight_layout()
plt.show()

### 6.8 ‚Äî Parameter Count vs. Validation Loss

Does throwing more parameters at the problem always help?

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))

for name, r in results.items():
    ax.scatter(r['n_params'], r['final_val_loss'], s=100, zorder=5)
    ax.annotate(name, (r['n_params'], r['final_val_loss']),
                textcoords="offset points", xytext=(5, 5), fontsize=8)

ax.set_xlabel('Number of Parameters', fontsize=12)
ax.set_ylabel('Final Validation Loss', fontsize=12)
ax.set_title('Parameters vs. Validation Loss', fontsize=14)
ax.set_xscale('log')
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

---
## Section 7: Generation Quality Comparison ‚úçÔ∏è

Numbers tell one story, but **reading the generated text** tells another. Let's compare outputs side by side.

In [None]:
for name, r in results.items():
    print(f"\n{'='*60}")
    print(f"  {name}")
    print(f"  Val Loss: {r['final_val_loss']:.4f} | Params: {r['n_params']:,}")
    print(f"{'='*60}")
    # Show first 300 chars of generated text
    print(r['sample_text'][:300])
    print(f"{'‚îÄ'*60}")

### Attention Heads: Small Baseline vs. Single Head

Let's compare how attention looks in the small baseline (4 heads) vs. a single-head model. The single-head model must cram all attention patterns into one head, while the multi-head model can specialise.

In [None]:
# Prepare a short input for the small models
short_text = "KING HENRY:"
short_tokens = encode(short_text)

# We need block_size to be at least as long as our input
for name in ['Baseline (small)', 'n_head=1']:
    r = results[name]
    cfg = r['config']
    model = r['model']
    model.eval()
    
    # Pad or trim to fit the model's block_size
    tokens = short_tokens[:cfg['block_size']]
    inp = torch.tensor([tokens], dtype=torch.long, device=device)
    
    print(f"\n--- {name} ---")
    plot_attention_heads(model, inp, layer_idx=0, n_heads_to_show=cfg['n_head'])

---
## Section 8: Written Analysis üìù

### 8.1 ‚Äî Effect of `n_head` (Number of Attention Heads)

**Observations:**
- **1 head:** The single head must learn all attention patterns by itself. Typically results in slightly higher validation loss compared to 4 heads.
- **4 heads (baseline):** Good balance ‚Äî each head can specialise on different pattern types (e.g., one head for local bigram-like patterns, another for longer-range structure).
- **8 heads:** With `n_embed=64`, each head only gets a 8-dimensional subspace (`64 / 8 = 8`). This may be too small ‚Äî each head has limited capacity. Performance may plateau or slightly degrade vs. 4 heads at this embedding size.

**Takeaway:** More heads help up to a point, but each head needs enough dimensions to be effective. The sweet spot depends on `n_embed`.

---

### 8.2 ‚Äî Effect of `n_layer` (Number of Transformer Blocks)

**Observations:**
- **1 layer:** Very limited ‚Äî can only do a single round of "communication" between characters. Produces lower-quality text resembling a souped-up bigram.
- **4 layers (baseline):** Significant improvement. Multiple rounds of attention allow the model to build up understanding of longer-range patterns.
- **8 layers:** More parameters and more processing depth. May improve slightly, but at this model scale diminishing returns set in. Also slower to train.

**Takeaway:** Depth (layers) is one of the most impactful hyperparameters. Going from 1‚Üí4 layers is a dramatic improvement; 4‚Üí8 shows diminishing returns for small models.

---

### 8.3 ‚Äî Effect of `n_embed` (Embedding Dimension)

**Observations:**
- **32 dimensions:** Very compact representations. The model struggles to encode enough information per token.
- **64 dimensions (baseline):** Good balance of expressiveness and training speed.
- **128 dimensions:** Richer representations ‚Äî each token carries more information. Typically the best validation loss among the three, but takes longer to train and uses 4√ó more parameters.

**Takeaway:** `n_embed` has a multiplicative effect on model size (affects every layer). Larger embeddings almost always help, but at increasing computational cost.

---

### 8.4 ‚Äî Effect of `dropout`

**Observations:**
- **dropout=0.0:** No regularisation. The model may overfit (train loss much lower than val loss), especially with many training steps.
- **dropout=0.1 (baseline):** Mild regularisation. Usually gives the best val loss for models of this size.
- **dropout=0.3:** Aggressive regularisation. May actually hurt training ‚Äî the model can't learn patterns fast enough because too many neurons are dropped.

**Takeaway:** Dropout is a double-edged sword. Too little ‚Üí overfitting; too much ‚Üí underfitting. The right value depends on model size and training duration.

---

### 8.5 ‚Äî Effect of `block_size` (Context Length)

**Observations:**
- **block_size=32:** Very short context. The model can only "see" 32 characters back. Can't capture patterns spanning multiple words or lines.
- **block_size=128 (baseline):** Around 20‚Äì30 words of context. Captures most local patterns in Shakespeare.
- **block_size=256:** Longer context ‚Äî allows the model to understand paragraph-level structure. However, the $O(T^2)$ cost of attention means training is slower.

**Takeaway:** Longer context is almost always better for generation quality, but the quadratic cost of attention makes it expensive. This is why modern models use techniques like sparse attention, FlashAttention, etc.

---

### 8.6 ‚Äî Overall Conclusions

1. **Embedding dimension** (`n_embed`) has the strongest effect on model quality ‚Äî it controls the "width" of representations at every layer.
2. **Number of layers** (`n_layer`) is the second most impactful ‚Äî depth allows the model to compose increasingly abstract features.
3. **Number of heads** (`n_head`) matters less than embed/layers at this scale, but enables specialisation.
4. **Block size** improves quality but at quadratic cost ‚Äî the returns are diminishing beyond a point.
5. **Dropout** is primarily about preventing overfitting; the optimal value depends on model/data size.

> **The key insight:** the Transformer architecture is remarkably scalable. The *same* architecture works from our tiny 200K-parameter model all the way to GPT-4's 1.8 trillion parameters ‚Äî you just turn up these same knobs!

---
## Summary ‚úÖ

### What We Did
- ‚úÖ Loaded the pre-trained baseline GPT model from Day 8
- ‚úÖ Visualised attention weight heatmaps across layers and heads
- ‚úÖ Trained 11 model variants, each changing one hyperparameter
- ‚úÖ Compared validation losses with bar charts and training curves
- ‚úÖ Plotted parameter count vs. validation loss (scaling analysis)
- ‚úÖ Compared generated text quality side by side
- ‚úÖ Analysed the role of each hyperparameter in depth

### Key Takeaways

| Hyperparameter | Increases Capacity? | Main Risk | Sweet Spot (small model) |
|:-:|:-:|:-:|:-:|
| `n_embed` ‚Üë | ‚úÖ Width | Memory/speed | 64‚Äì128 |
| `n_layer` ‚Üë | ‚úÖ Depth | Diminishing returns | 4‚Äì6 |
| `n_head` ‚Üë | ‚úÖ Specialisation | Too-small head dim | 4 at embed=64 |
| `block_size` ‚Üë | ‚úÖ Context | $O(T^2)$ cost | 128‚Äì256 |
| `dropout` ‚Üë | ‚ùå Regularisation | Under-fitting | 0.1‚Äì0.2 |

### Next Steps

Now that you understand *how* each hyperparameter affects the model, you could:
- Try **combinations** (e.g., larger embed + more layers)
- Add a **learning rate scheduler** (warmup + cosine decay)
- Experiment with **different optimisers** (SGD vs. Adam vs. AdamW)
- Try **different datasets** (code, poetry, song lyrics)