# Chapter 2: Transformers - Hands-On Notebook

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ttsugriy/mechinterp-first-principles/blob/main/notebooks/02-transformers.ipynb)

This notebook accompanies [Chapter 2: Transformers as Matrix Multiplication Machines](https://ttsugriy.github.io/mechinterp-first-principles/chapters/02-transformers.html).

**What you'll do:**
1. Load GPT-2 and run a forward pass
2. Inspect attention patterns
3. Visualize how information flows
4. Understand the residual stream

**Time:** ~20 minutes

## Setup

In [None]:
!pip install transformer-lens circuitsvis -q

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt
from transformer_lens import HookedTransformer
import circuitsvis as cv

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

## 1. Load the Model

In [None]:
model = HookedTransformer.from_pretrained("gpt2-small", device=device)

print(f"Model: GPT-2 Small")
print(f"  Layers: {model.cfg.n_layers}")
print(f"  Attention heads per layer: {model.cfg.n_heads}")
print(f"  Model dimension (d_model): {model.cfg.d_model}")
print(f"  Vocabulary size: {model.cfg.d_vocab}")
print(f"  Total parameters: {sum(p.numel() for p in model.parameters()):,}")

## 2. Run a Forward Pass

In [None]:
# Our test prompt
text = "The capital of France is"

# Tokenize
tokens = model.to_tokens(text)
print(f"Input: '{text}'")
print(f"Tokens: {tokens.shape} = {[model.to_string(t) for t in tokens[0]]}")

In [None]:
# Run the model and cache all intermediate activations
logits, cache = model.run_with_cache(tokens)

print(f"Output logits shape: {logits.shape}")
print(f"  (batch_size, sequence_length, vocabulary_size)")
print(f"\nCached {len(cache)} activation tensors")

In [None]:
# What does the model predict?
final_logits = logits[0, -1]  # Last token position
top_tokens = torch.topk(final_logits, k=10)

print("Top 10 predictions:")
for i, (idx, logit) in enumerate(zip(top_tokens.indices, top_tokens.values)):
    prob = torch.softmax(final_logits, dim=-1)[idx]
    print(f"  {i+1}. '{model.to_string(idx)}' (logit={logit:.2f}, prob={prob:.1%})")

## 3. Visualize Attention Patterns

Attention patterns show which tokens attend to which other tokens.

In [None]:
# Get attention patterns from layer 5
attention_pattern = cache["pattern", 5]  # Shape: (batch, heads, query_pos, key_pos)
print(f"Attention pattern shape: {attention_pattern.shape}")
print(f"  (batch, n_heads, seq_len, seq_len)")

In [None]:
# Visualize attention for all heads in layer 5
token_strs = [model.to_string(t) for t in tokens[0]]

cv.attention.attention_patterns(
    tokens=token_strs,
    attention=attention_pattern[0]  # Remove batch dimension
)

In [None]:
# Plot a single head's attention pattern
head_idx = 0
layer_idx = 5

attn = cache["pattern", layer_idx][0, head_idx].cpu().numpy()

plt.figure(figsize=(8, 6))
plt.imshow(attn, cmap='Blues')
plt.xticks(range(len(token_strs)), token_strs, rotation=45, ha='right')
plt.yticks(range(len(token_strs)), token_strs)
plt.xlabel('Key (attending to)')
plt.ylabel('Query (attending from)')
plt.title(f'Attention Pattern: Layer {layer_idx}, Head {head_idx}')
plt.colorbar(label='Attention weight')
plt.tight_layout()
plt.show()

## 4. Explore the Residual Stream

The residual stream accumulates contributions from all components.

In [None]:
# Get residual stream at different layers
resid_0 = cache["resid_pre", 0]   # After embedding, before layer 0
resid_6 = cache["resid_pre", 6]   # After 6 layers
resid_11 = cache["resid_post", 11] # After final layer

print("Residual stream shapes (all same):")
print(f"  After embedding: {resid_0.shape}")
print(f"  After layer 6: {resid_6.shape}")
print(f"  After layer 11: {resid_11.shape}")

In [None]:
# How much does the residual stream change through the network?
# Measure cosine similarity between early and late representations

def cosine_sim(a, b):
    return torch.nn.functional.cosine_similarity(a, b, dim=-1)

# For the last token position
sim_0_6 = cosine_sim(resid_0[0, -1], resid_6[0, -1]).item()
sim_6_11 = cosine_sim(resid_6[0, -1], resid_11[0, -1]).item()
sim_0_11 = cosine_sim(resid_0[0, -1], resid_11[0, -1]).item()

print(f"Cosine similarity of residual stream (last position):")
print(f"  Layer 0 ↔ Layer 6:  {sim_0_6:.3f}")
print(f"  Layer 6 ↔ Layer 11: {sim_6_11:.3f}")
print(f"  Layer 0 ↔ Layer 11: {sim_0_11:.3f}")
print(f"\nThe residual stream changes significantly through the network!")

In [None]:
# Visualize residual stream norm through layers
norms = []
for layer in range(model.cfg.n_layers):
    resid = cache["resid_pre", layer][0, -1]  # Last token
    norms.append(resid.norm().item())

# Add final layer
norms.append(cache["resid_post", model.cfg.n_layers - 1][0, -1].norm().item())

plt.figure(figsize=(10, 4))
plt.plot(norms, 'o-')
plt.xlabel('Layer')
plt.ylabel('Residual Stream Norm')
plt.title('How the Residual Stream Grows Through the Network')
plt.grid(True, alpha=0.3)
plt.show()

## 5. Component Contributions

Each attention head and MLP adds to the residual stream. Let's see their contributions.

In [None]:
# Get the output of each attention layer and MLP
attn_out = cache["attn_out", 5][0, -1]  # Layer 5 attention output, last position
mlp_out = cache["mlp_out", 5][0, -1]    # Layer 5 MLP output, last position

print(f"Attention output norm: {attn_out.norm():.2f}")
print(f"MLP output norm: {mlp_out.norm():.2f}")
print(f"\nMLP typically has larger contributions than attention!")

In [None]:
# Compare all layers
attn_norms = [cache["attn_out", l][0, -1].norm().item() for l in range(model.cfg.n_layers)]
mlp_norms = [cache["mlp_out", l][0, -1].norm().item() for l in range(model.cfg.n_layers)]

x = np.arange(model.cfg.n_layers)
width = 0.35

plt.figure(figsize=(12, 4))
plt.bar(x - width/2, attn_norms, width, label='Attention', alpha=0.8)
plt.bar(x + width/2, mlp_norms, width, label='MLP', alpha=0.8)
plt.xlabel('Layer')
plt.ylabel('Output Norm')
plt.title('Attention vs MLP Contribution by Layer')
plt.legend()
plt.xticks(x)
plt.grid(True, alpha=0.3, axis='y')
plt.show()

## Exercises

### Exercise 1: Different prompts
Try different prompts and observe how attention patterns change.

### Exercise 2: Find the "previous token" head
Some heads learn to attend to the previous token. Can you find one?

### Exercise 3: Track a specific token
How does the representation of "France" change through the layers?

In [None]:
# Exercise 1: Your code here
# Try: "Once upon a time, there was a"
# What does the model predict? What does attention look like?

## Summary

You've now:
1. Loaded GPT-2 and run a forward pass
2. Visualized attention patterns
3. Explored how the residual stream evolves through layers
4. Compared attention vs MLP contributions

**Next:** [Chapter 3: The Residual Stream](https://ttsugriy.github.io/mechinterp-first-principles/chapters/03-residual-stream.html)