#### Thursday, January 15, 2026

Generated by [Gemini](https://gemini.google.com/app/d13797287071656d)

    I am interested in getting started with studying alignment in large language models. I want to be able to run everything locally against a 4090 gpu. Can you please provide a learning path.

# Mechanistic Interpretability: Hello World
**Target Hardware:** Single GPU (RTX 4090 recommended)\n
**Goal:** Visualize attention heads and find 'Induction Heads' (the circuits responsible for in-context learning).

In [1]:
# Install necessary libraries if you haven't already
# !pip install transformer_lens circuitsvis plotly torch accelerate

In [2]:
import torch
import transformer_lens
from transformer_lens import HookedTransformer
import circuitsvis as cv # The standard library for interp visualizations
import plotly.express as px

# Ensure we use the 4090
device = "cuda" if torch.cuda.is_available() else "cpu"

In [3]:
transformer_lens.__spec__

ModuleSpec(name='transformer_lens', loader=<_frozen_importlib_external.SourceFileLoader object at 0x7e92d7562300>, origin='/home/rob/PythonEnvironments/Safety-Research/.circuit-tracer/lib/python3.12/site-packages/transformer_lens/__init__.py', submodule_search_locations=['/home/rob/PythonEnvironments/Safety-Research/.circuit-tracer/lib/python3.12/site-packages/transformer_lens'])

In [4]:
print(f"Using device: {device}")
##print(f"TransformerLens version: {transformer_lens.__version__}")

# Turn off automatic differentiation to save massive amounts of VRAM
torch.set_grad_enabled(False)

Using device: cuda


torch.autograd.grad_mode.set_grad_enabled(mode=False)

### 1. Load the Model
We start with `gpt2-small` as it is the standard 'lab rat' for learning. 
For your 4090, you can uncomment the Llama-3 section later.

In [5]:
# --- OPTION 1: The Lab Rat (GPT-2 Small) ---
# Great for learning concepts. 12 Layers, 12 Heads.
# model_name = "gpt2-small"
# model = HookedTransformer.from_pretrained(
#     model_name,
#     center_writing_weights=False,
#     center_unembed=False,
#     fold_ln=False,
#     device=device
# )

# --- OPTION 2: The Real Deal (Llama-3-8B) ---
# Uncomment this when you want to analyze a modern model on your 4090.
# You will need to accept the license on HuggingFace and login via `huggingface-cli login`
model_name = "meta-llama/Meta-Llama-3-8B"
model = HookedTransformer.from_pretrained(
    model_name,
    device=device,
    dtype=torch.bfloat16, # Vital for fitting on 24GB VRAM
)

print(f"Loaded {model_name} onto {device}")

# 26m 50.5s

`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]



Loaded pretrained model meta-llama/Meta-Llama-3-8B into HookedTransformer
Loaded meta-llama/Meta-Llama-3-8B onto cuda


### 2. Run with Cache
Standard PyTorch gives you output. `run_with_cache` gives you output AND the value of every single neuron.

In [6]:
# A simple prompt designed to trigger "Induction Heads" (repeating patterns)
# The model should predict "Potter" after the second "Harry".
prompt = "Harry Potter and the Sorcerer's Stone. Harry"

# run_with_cache returns two things:
# 1. logits: The predictions (standard output)
# 2. cache: A dictionary containing every internal activation
logits, cache = model.run_with_cache(prompt)

# Let's verify the model actually predicted "Potter"
next_token_logits = logits[0, -1] # Get logits for the last token
prediction_id = next_token_logits.argmax()
prediction = model.to_string(prediction_id)

print(f"Prompt: {prompt}")
print(f"Model Prediction: '{prediction}'")

Prompt: Harry Potter and the Sorcerer's Stone. Harry
Model Prediction: ' Potter'


### 3. Visualize Attention Patterns
Hover over the grid below to see which words the model attended to.

In [7]:
# Get the tokens as a list of strings for the labels
tokens = model.to_str_tokens(prompt)

# Extract attention patterns from a specific layer (e.g., Layer 5)
# Shape: [batch, heads, queries, keys]
layer_to_visualize = 5
attention_pattern = cache["pattern", layer_to_visualize][0]

print(f"Visualizing Attention for Layer {layer_to_visualize}")

# Create the interactive visualization
cv.attention.attention_patterns(
    tokens=tokens,
    attention=attention_pattern
)

Visualizing Attention for Layer 5


### 4. Finding the "Induction Head"
We search for heads that attend strongly to the token *after* the previous occurrence of the current token.

In [8]:
# Get all attention patterns from all layers
# Shape: [layer, batch, head, query, key]
attention_cache = torch.stack([cache["pattern", l] for l in range(model.cfg.n_layers)])

# Target: The last "Harry" (index -1)
# Source: The "Potter" from the start of the string (index 2)
target_token_index = -1 
source_token_index = 2  

# Extract the score for this specific connection across all layers and heads
induction_scores = attention_cache[:, 0, :, target_token_index, source_token_index]

# Plot the scores as a heatmap (Layers x Heads)
fig = px.imshow(
    induction_scores.cpu().numpy(),
    labels={"x": "Head", "y": "Layer", "color": "Attention Score"},
    title="Attention Score Attending to 'Potter' from final 'Harry'",
    color_continuous_scale="Viridis",
)

fig.show()

**Analysis:** Look for bright yellow spots in the heatmap above (likely around Layer 5, Head 5). These are your Induction Heads.