# Prelims

<b style="color: red">To use this notebook, go to Runtime > Change Runtime Type and select GPU as the hardware accelerator.</b>

This is because tokenizer will use .cuda to process input batches in parallel.

# Setup
(No need to read)

In [1]:
%%capture
# Janky code to do different setup when run in a Colab notebook vs VSCode
DEBUG_MODE = False
try:
    import google.colab
    IN_COLAB = True
    print("Running as a Colab notebook")
    %pip install git+https://github.com/neelnanda-io/TransformerLens.git
    # Install another version of node that makes PySvelte work way faster
    !curl -fsSL https://deb.nodesource.com/setup_16.x | sudo -E bash -; sudo apt-get install -y nodejs
    %pip install git+https://github.com/neelnanda-io/PySvelte.git
except:
    IN_COLAB = False
    print("Running as a Jupyter notebook - intended for development only!")
    from IPython import get_ipython

    ipython = get_ipython()
    # Code to automatically update the HookedTransformer code as its edited without restarting the kernel
    ipython.magic("load_ext autoreload")
    ipython.magic("autoreload 2")

In [2]:
# Plotly needs a different renderer for VSCode/Notebooks vs Colab argh
import plotly.io as pio

if IN_COLAB or not DEBUG_MODE:
    # Thanks to annoying rendering issues, Plotly graphics will either show up in colab OR Vscode depending on the renderer - this is bad for developing demos! Thus creating a debug mode.
    pio.renderers.default = "colab"
else:
    pio.renderers.default = "png"

In [3]:
# Import stuff
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import einops
from fancy_einsum import einsum
import tqdm.notebook as tqdm
import random
from pathlib import Path
import plotly.express as px
from torch.utils.data import DataLoader

from jaxtyping import Float, Int
from typing import List, Union, Optional
from functools import partial
import copy

import itertools
from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer
import dataclasses
import datasets
from IPython.display import HTML

In [4]:
import pysvelte

import transformer_lens
import transformer_lens.utils as utils
from transformer_lens.hook_points import (
    HookedRootModule,
    HookPoint,
)  # Hooking utilities
from transformer_lens import HookedTransformer, HookedTransformerConfig, FactoredMatrix, ActivationCache

We turn automatic differentiation off, to save GPU memory, as this notebook focuses on model inference not model training.

In [5]:
torch.set_grad_enabled(False)

<torch.autograd.grad_mode.set_grad_enabled at 0x7fc7b0b03d30>

Plotting helper functions:

In [6]:
def imshow(tensor, renderer=None, **kwargs):
    px.imshow(utils.to_numpy(tensor), color_continuous_midpoint=0.0, color_continuous_scale="RdBu", **kwargs).show(renderer)

def line(tensor, renderer=None, **kwargs):
    px.line(y=utils.to_numpy(tensor), **kwargs).show(renderer)

def scatter(x, y, xaxis="", yaxis="", caxis="", renderer=None, **kwargs):
    x = utils.to_numpy(x)
    y = utils.to_numpy(y)
    px.scatter(y=y, x=x, labels={"x":xaxis, "y":yaxis, "color":caxis}, **kwargs).show(renderer)

# Analyze GPT-2-Small
80M parameter model

## Loading and Running Models

In [7]:
device = "cuda" if torch.cuda.is_available() else "cpu"

In [8]:
model = HookedTransformer.from_pretrained("gpt2-small", device=device)

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Using pad_token, but it is not set yet.


Loaded pretrained model gpt2-small into HookedTransformer


To try the model out, let's find the loss on this text! Models can be run on a single string or a tensor of tokens (shape: [batch, position], all integers), and the possible return types are: 
* "logits" (shape [batch, position, d_vocab], floats), 
* "loss" (the cross-entropy loss when predicting the next token), 
* "both" (a tuple of (logits, loss)) 
* None (run the model, but don't calculate the logits - this is faster when we only want to use intermediate activations)

In [None]:
model_description_text = "1 2 3 4"
loss = model(model_description_text, return_type="loss")
print("Model loss:", loss)

Model loss: tensor(2.1871)


## Test prompts

In [None]:
example_prompt = "1 2 3 4"
example_answer = " 5"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', '1', ' 2', ' 3', ' 4']
Tokenized answer: [' 5']


Top 0th token. Logit: 18.76 Prob: 96.17% Token: | 5|
Top 1th token. Logit: 13.27 Prob:  0.40% Token: | Next|
Top 2th token. Logit: 13.01 Prob:  0.30% Token: |
|
Top 3th token. Logit: 12.87 Prob:  0.27% Token: | >|
Top 4th token. Logit: 12.04 Prob:  0.12% Token: | 4|
Top 5th token. Logit: 11.88 Prob:  0.10% Token: | 50|
Top 6th token. Logit: 11.83 Prob:  0.09% Token: | 6|
Top 7th token. Logit: 11.71 Prob:  0.08% Token: | <|
Top 8th token. Logit: 11.64 Prob:  0.08% Token: | $|
Top 9th token. Logit: 11.63 Prob:  0.08% Token: | 1|


This is too easily predictable. Try a number sequence that's more ambiguous. "2 4" seems to likely have 6 or 8 be next

In [None]:
example_prompt = "2 4"
example_answer = " 6"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', '2', ' 4']
Tokenized answer: [' 6']


Top 0th token. Logit: 12.31 Prob: 10.63% Token: |/|
Top 1th token. Logit: 11.78 Prob:  6.28% Token: | 4|
Top 2th token. Logit: 11.71 Prob:  5.85% Token: | 5|
Top 3th token. Logit: 11.62 Prob:  5.37% Token: | 3|
Top 4th token. Logit: 11.49 Prob:  4.70% Token: |.|
Top 5th token. Logit: 11.42 Prob:  4.37% Token: | 1|
Top 6th token. Logit: 11.13 Prob:  3.27% Token: |
|
Top 7th token. Logit: 11.00 Prob:  2.89% Token: | 2|
Top 8th token. Logit: 10.75 Prob:  2.24% Token: | 6|
Top 9th token. Logit: 10.75 Prob:  2.23% Token: | 0|


In [None]:
example_prompt = "Mary is a human. Fido is a dog. The pet of this family is Fido. Rachel is a human. Pebbles is a cat. The pet of this family is"
example_answer = " Peb"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'Mary', ' is', ' a', ' human', '.', ' F', 'ido', ' is', ' a', ' dog', '.', ' The', ' pet', ' of', ' this', ' family', ' is', ' F', 'ido', '.', ' Rachel', ' is', ' a', ' human', '.', ' Peb', 'bles', ' is', ' a', ' cat', '.', ' The', ' pet', ' of', ' this', ' family', ' is']
Tokenized answer: [' Peb']


Top 0th token. Logit: 16.75 Prob: 70.16% Token: | Peb|
Top 1th token. Logit: 13.46 Prob:  2.60% Token: | P|
Top 2th token. Logit: 13.24 Prob:  2.09% Token: | F|
Top 3th token. Logit: 12.66 Prob:  1.17% Token: | Pear|
Top 4th token. Logit: 12.54 Prob:  1.04% Token: | Bub|
Top 5th token. Logit: 12.23 Prob:  0.76% Token: | Pe|
Top 6th token. Logit: 12.16 Prob:  0.71% Token: | Pink|
Top 7th token. Logit: 11.59 Prob:  0.40% Token: | B|
Top 8th token. Logit: 11.40 Prob:  0.33% Token: | Peg|
Top 9th token. Logit: 11.34 Prob:  0.31% Token: | the|


In [None]:
example_prompt = "The human is Mary. The dog is Fido. The pet is Fido. The human is Rachel. The dog is Pebbles. The pet is"
example_answer = " Peb"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'The', ' human', ' is', ' Mary', '.', ' The', ' dog', ' is', ' F', 'ido', '.', ' The', ' pet', ' is', ' F', 'ido', '.', ' The', ' human', ' is', ' Rachel', '.', ' The', ' dog', ' is', ' Peb', 'bles', '.', ' The', ' pet', ' is']
Tokenized answer: [' Peb']


Top 0th token. Logit: 13.45 Prob: 13.43% Token: | Peb|
Top 1th token. Logit: 12.98 Prob:  8.42% Token: | F|
Top 2th token. Logit: 12.44 Prob:  4.91% Token: | Rachel|
Top 3th token. Logit: 11.45 Prob:  1.82% Token: | P|
Top 4th token. Logit: 10.89 Prob:  1.04% Token: | Pear|
Top 5th token. Logit: 10.48 Prob:  0.69% Token: | the|
Top 6th token. Logit: 10.32 Prob:  0.59% Token: | Pink|
Top 7th token. Logit: 10.32 Prob:  0.59% Token: | R|
Top 8th token. Logit: 10.23 Prob:  0.54% Token: | Betty|
Top 9th token. Logit: 10.18 Prob:  0.51% Token: | Bub|


# GPT2-small number sequence analysis

## Compare Logits

For all prompts (even if just one), get the model's outputs vs the correct ouputs, and compute the logit differences


In [None]:
prompts = [
    "1 2 3 4",
]
# List of answers, in the format (correct, incorrect)
answers = [
    (" 5", " Next"),
]

answer_tokens = []
for answer in answers:
    correct_token = model.to_single_token(answer[0])
    incorrect_token = model.to_single_token(answer[1])
    answer_tokens.append((correct_token, incorrect_token))
if len(prompts) > 1:
    answer_tokens = torch.tensor(answer_tokens).cuda()  # if many inputs
else:
    answer_tokens = torch.tensor(answer_tokens)

tokens = model.to_tokens(prompts, prepend_bos=True)
# tokens = tokens.cuda() # Move the tokens to the GPU
original_logits, cache = model.run_with_cache(tokens) # Run the model and cache all activations

def logits_to_ave_logit_diff(logits, answer_tokens, per_prompt=False):
    # Only the final logits are relevant for the answer
    final_logits = logits[:, -1, :]
    answer_logits = final_logits.gather(dim=-1, index=answer_tokens)
    answer_logit_diff = answer_logits[:, 0] - answer_logits[:, 1]
    if per_prompt:
        return answer_logit_diff
    else:
        return answer_logit_diff.mean()

print("Per prompt logit difference:", logits_to_ave_logit_diff(original_logits, answer_tokens, per_prompt=True))
original_average_logit_diff = logits_to_ave_logit_diff(original_logits, answer_tokens)
print("Average logit difference:", logits_to_ave_logit_diff(original_logits, answer_tokens).item())

Per prompt logit difference: tensor([5.4894])
Average logit difference: 5.489377021789551


## Direct Layer Attribution

### Logit Lens

In [None]:
answer_residual_directions = model.tokens_to_residual_directions(answer_tokens)
print("Answer residual directions shape:", answer_residual_directions.shape)
logit_diff_directions = answer_residual_directions[:, 0] - answer_residual_directions[:, 1]
print("Logit difference directions shape:", logit_diff_directions.shape)

def residual_stack_to_logit_diff(residual_stack: Float[torch.Tensor, "components batch d_model"], cache: ActivationCache) -> float:
    scaled_residual_stack = cache.apply_ln_to_stack(residual_stack, layer = -1, pos_slice=-1)
    return einsum("... batch d_model, batch d_model -> ...", scaled_residual_stack, logit_diff_directions)/len(prompts)
accumulated_residual, labels = cache.accumulated_resid(layer=-1, incl_mid=True, pos_slice=-1, return_labels=True)
logit_lens_logit_diffs = residual_stack_to_logit_diff(accumulated_residual, cache)
line(logit_lens_logit_diffs, x=np.arange(model.cfg.n_layers*2+1)/2, hover_name=labels, title="Logit Difference From Accumulate Residual Stream")

Answer residual directions shape: torch.Size([1, 2, 768])
Logit difference directions shape: torch.Size([1, 768])


In [None]:
per_head_residual, labels = cache.stack_head_results(layer=-1, pos_slice=-1, return_labels=True)
per_head_logit_diffs = residual_stack_to_logit_diff(per_head_residual, cache)
per_head_logit_diffs = einops.rearrange(per_head_logit_diffs, "(layer head_index) -> layer head_index", layer=model.cfg.n_layers, head_index=model.cfg.n_heads)
imshow(per_head_logit_diffs, labels={"x":"Head", "y":"Layer"}, title="Logit Difference From Each Head")

### Attention Patterns

In [None]:
def visualize_attention_patterns(
    heads: Union[List[int], int, Float[torch.Tensor, "heads"]], 
    local_cache: Optional[ActivationCache]=None, 
    local_tokens: Optional[torch.Tensor]=None, 
    title: str=""):
    # Heads are given as a list of integers or a single integer in [0, n_layers * n_heads)
    if isinstance(heads, int):
        heads = [heads]
    elif isinstance(heads, list) or isinstance(heads, torch.Tensor):
        heads = utils.to_numpy(heads)
    # Cache defaults to the original activation cache
    if local_cache is None:
        local_cache = cache
    # Tokens defaults to the tokenization of the first prompt (including the BOS token)
    if local_tokens is None:
        # The tokens of the first prompt
        local_tokens = tokens[0]
    
    labels = []
    patterns = []
    batch_index = 0
    for head in heads:
        layer = head // model.cfg.n_heads
        head_index = head % model.cfg.n_heads
        # Get the attention patterns for the head
        # Attention patterns have shape [batch, head_index, query_pos, key_pos]
        patterns.append(local_cache["attn", layer][batch_index, head_index])
        labels.append(f"L{layer}H{head_index}")
    str_tokens = model.to_str_tokens(local_tokens)
    patterns = torch.stack(patterns, dim=-1)
    # Plot the attention patterns
    attention_vis = pysvelte.AttentionMulti(attention=patterns, tokens=str_tokens, head_labels=labels)
    display(HTML(f"<h3>{title}</h3>"))
    attention_vis.show()

In [None]:
top_k = 3
top_positive_logit_attr_heads = torch.topk(per_head_logit_diffs.flatten(), k=top_k).indices
visualize_attention_patterns(top_positive_logit_attr_heads, title=f"Top {top_k} Positive Logit Attribution Heads")
top_negative_logit_attr_heads = torch.topk(-per_head_logit_diffs.flatten(), k=top_k).indices
visualize_attention_patterns(top_negative_logit_attr_heads, title=f"Top {top_k} Negative Logit Attribution Heads")

pysvelte components appear to be unbuilt or stale
Running npm install...
Building pysvelte components with webpack...


All heads for a layer:

Attention pattern has shape [head_index, destination_position, source_position], and we use the `model.to_str_tokens` method to convert the text to a list of tokens as strings, since there is an attention weight between each pair of tokens.


In [None]:
gpt2_text = "1 2 3 4"
gpt2_tokens = model.to_tokens(gpt2_text)
gpt2_logits, gpt2_cache = model.run_with_cache(gpt2_tokens, remove_batch_dim=True)

attention_pattern = gpt2_cache["pattern", 9, "attn"]
gpt2_str_tokens = model.to_str_tokens(gpt2_text)

print("Layer 9 Head Attention Patterns:")
cv.attention.attention_patterns(tokens=gpt2_str_tokens, attention=attention_pattern)

Layer 9 Head Attention Patterns:


# GPT2-small analogous inputs analysis

In [9]:
prompts = [
    "Mary is a human. Fido is a dog. The pet of this family is Fido. Rachel is a human. Pebbles is a cat. The pet of this family is",
]
# List of answers, in the format (correct, incorrect)
answers = [
    (" Peb", " Rachel"),
]

answer_tokens = []
for answer in answers:
    correct_token = model.to_single_token(answer[0])
    incorrect_token = model.to_single_token(answer[1])
    answer_tokens.append((correct_token, incorrect_token))
if len(prompts) > 1:
    answer_tokens = torch.tensor(answer_tokens).cuda()  # if many inputs
else:
    answer_tokens = torch.tensor(answer_tokens)

tokens = model.to_tokens(prompts, prepend_bos=True)
# tokens = tokens.cuda() # Move the tokens to the GPU
original_logits, cache = model.run_with_cache(tokens) # Run the model and cache all activations

def logits_to_ave_logit_diff(logits, answer_tokens, per_prompt=False):
    # Only the final logits are relevant for the answer
    final_logits = logits[:, -1, :]
    answer_logits = final_logits.gather(dim=-1, index=answer_tokens)
    answer_logit_diff = answer_logits[:, 0] - answer_logits[:, 1]
    if per_prompt:
        return answer_logit_diff
    else:
        return answer_logit_diff.mean()

print("Per prompt logit difference:", logits_to_ave_logit_diff(original_logits, answer_tokens, per_prompt=True))
original_average_logit_diff = logits_to_ave_logit_diff(original_logits, answer_tokens)
print("Average logit difference:", logits_to_ave_logit_diff(original_logits, answer_tokens).item())

Per prompt logit difference: tensor([5.5811])
Average logit difference: 5.58111572265625


In [10]:
answer_residual_directions = model.tokens_to_residual_directions(answer_tokens)
print("Answer residual directions shape:", answer_residual_directions.shape)
logit_diff_directions = answer_residual_directions[:, 0] - answer_residual_directions[:, 1]
print("Logit difference directions shape:", logit_diff_directions.shape)

def residual_stack_to_logit_diff(residual_stack: Float[torch.Tensor, "components batch d_model"], cache: ActivationCache) -> float:
    scaled_residual_stack = cache.apply_ln_to_stack(residual_stack, layer = -1, pos_slice=-1)
    return einsum("... batch d_model, batch d_model -> ...", scaled_residual_stack, logit_diff_directions)/len(prompts)
accumulated_residual, labels = cache.accumulated_resid(layer=-1, incl_mid=True, pos_slice=-1, return_labels=True)
logit_lens_logit_diffs = residual_stack_to_logit_diff(accumulated_residual, cache)
line(logit_lens_logit_diffs, x=np.arange(model.cfg.n_layers*2+1)/2, hover_name=labels, title="Logit Difference From Accumulate Residual Stream")

Answer residual directions shape: torch.Size([1, 2, 768])
Logit difference directions shape: torch.Size([1, 768])


In [None]:
per_head_residual, labels = cache.stack_head_results(layer=-1, pos_slice=-1, return_labels=True)
per_head_logit_diffs = residual_stack_to_logit_diff(per_head_residual, cache)
per_head_logit_diffs = einops.rearrange(per_head_logit_diffs, "(layer head_index) -> layer head_index", layer=model.cfg.n_layers, head_index=model.cfg.n_heads)
imshow(per_head_logit_diffs, labels={"x":"Head", "y":"Layer"}, title="Logit Difference From Each Head")

Tried to stack head results when they weren't cached. Computing head results now


In [None]:
top_k = 3
top_positive_logit_attr_heads = torch.topk(per_head_logit_diffs.flatten(), k=top_k).indices
visualize_attention_patterns(top_positive_logit_attr_heads, title=f"Top {top_k} Positive Logit Attribution Heads")
top_negative_logit_attr_heads = torch.topk(-per_head_logit_diffs.flatten(), k=top_k).indices
visualize_attention_patterns(top_negative_logit_attr_heads, title=f"Top {top_k} Negative Logit Attribution Heads")

The positives show very little attending back. The last token, "is", attends to Peb. The first "is" in the pet sentence attends to Mary instead of Fido for some reason.

# Analyze GPT-2-Large

## Loading and Running Models

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

In [None]:
model = HookedTransformer.from_pretrained("gpt2-large", device=device)

Downloading (…)lve/main/config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Using pad_token, but it is not set yet.


Loaded pretrained model gpt2-large into HookedTransformer


To try the model out, let's find the loss on this text! Models can be run on a single string or a tensor of tokens (shape: [batch, position], all integers), and the possible return types are: 
* "logits" (shape [batch, position, d_vocab], floats), 
* "loss" (the cross-entropy loss when predicting the next token), 
* "both" (a tuple of (logits, loss)) 
* None (run the model, but don't calculate the logits - this is faster when we only want to use intermediate activations)

In [None]:
model_description_text = "1 2 3 4"
loss = model(model_description_text, return_type="loss")
print("Model loss:", loss)

Model loss: tensor(2.5091, device='cuda:0')


## Test prompts

### Sequences

In [None]:
example_prompt = "1 2 3 4"
example_answer = " 5"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', '1', ' 2', ' 3', ' 4']
Tokenized answer: [' 5']


Top 0th token. Logit: 18.85 Prob: 96.91% Token: | 5|
Top 1th token. Logit: 14.05 Prob:  0.79% Token: |
|
Top 2th token. Logit: 12.90 Prob:  0.25% Token: | Next|
Top 3th token. Logit: 12.60 Prob:  0.19% Token: | 4|
Top 4th token. Logit: 12.15 Prob:  0.12% Token: | 1|
Top 5th token. Logit: 11.75 Prob:  0.08% Token: | 3|
Top 6th token. Logit: 11.52 Prob:  0.06% Token: | 6|
Top 7th token. Logit: 11.05 Prob:  0.04% Token: | ...|
Top 8th token. Logit: 11.05 Prob:  0.04% Token: | 7|
Top 9th token. Logit: 11.02 Prob:  0.04% Token: | 10|


This is too easily predictable. Try a number sequence that's more ambiguous. "2 4" seems to likely have 6 or 8 be next

In [None]:
example_prompt = "2 4"
example_answer = " 6"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', '2', ' 4']
Tokenized answer: [' 6']


Top 0th token. Logit: 12.85 Prob:  8.82% Token: | 5|
Top 1th token. Logit: 12.59 Prob:  6.75% Token: |.|
Top 2th token. Logit: 12.44 Prob:  5.83% Token: | 1|
Top 3th token. Logit: 12.40 Prob:  5.58% Token: |
|
Top 4th token. Logit: 12.35 Prob:  5.35% Token: | 4|
Top 5th token. Logit: 12.28 Prob:  4.98% Token: | 3|
Top 6th token. Logit: 12.11 Prob:  4.19% Token: | 2|
Top 7th token. Logit: 11.75 Prob:  2.92% Token: | 0|
Top 8th token. Logit: 11.72 Prob:  2.83% Token: |/|
Top 9th token. Logit: 11.71 Prob:  2.81% Token: | 7|


The top 3rd token is strange. If we use " ", "  " or "" as the example answer, that's not considered as the 3rd token.

In [None]:
example_prompt = "2 4 6"
example_answer = " 8"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', '2', ' 4', ' 6']
Tokenized answer: [' 8']


Top 0th token. Logit: 14.16 Prob: 13.95% Token: | 7|
Top 1th token. Logit: 13.62 Prob:  8.17% Token: | 8|
Top 2th token. Logit: 13.35 Prob:  6.21% Token: | 0|
Top 3th token. Logit: 13.13 Prob:  5.00% Token: |
|
Top 4th token. Logit: 13.10 Prob:  4.86% Token: | 6|
Top 5th token. Logit: 13.05 Prob:  4.63% Token: | 5|
Top 6th token. Logit: 12.93 Prob:  4.11% Token: | 1|
Top 7th token. Logit: 12.82 Prob:  3.69% Token: | 2|
Top 8th token. Logit: 12.60 Prob:  2.96% Token: | 4|
Top 9th token. Logit: 12.52 Prob:  2.72% Token: | 10|


In [None]:
example_prompt = "2 4 6 8"
example_answer = " 10"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', '2', ' 4', ' 6', ' 8']
Tokenized answer: [' 10']


Top 0th token. Logit: 15.41 Prob: 24.26% Token: | 9|
Top 1th token. Logit: 15.13 Prob: 18.25% Token: | 10|
Top 2th token. Logit: 14.95 Prob: 15.29% Token: | 8|
Top 3th token. Logit: 13.87 Prob:  5.21% Token: |
|
Top 4th token. Logit: 13.52 Prob:  3.65% Token: | 7|
Top 5th token. Logit: 13.44 Prob:  3.38% Token: | 11|
Top 6th token. Logit: 13.30 Prob:  2.93% Token: | 12|
Top 7th token. Logit: 12.97 Prob:  2.10% Token: | 1|
Top 8th token. Logit: 12.67 Prob:  1.56% Token: | 4|
Top 9th token. Logit: 12.63 Prob:  1.50% Token: | 0|


In [None]:
example_prompt = "Monday Tuesday Wednesday"
example_answer = " Thursday"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'Monday', ' Tuesday', ' Wednesday']
Tokenized answer: [' Thursday']


Top 0th token. Logit: 23.07 Prob: 99.20% Token: | Thursday|
Top 1th token. Logit: 17.38 Prob:  0.34% Token: |
|
Top 2th token. Logit: 16.42 Prob:  0.13% Token: |Thursday|
Top 3th token. Logit: 15.81 Prob:  0.07% Token: | Wednesday|
Top 4th token. Logit: 15.54 Prob:  0.05% Token: | Thurs|
Top 5th token. Logit: 15.22 Prob:  0.04% Token: | Friday|
Top 6th token. Logit: 13.74 Prob:  0.01% Token: |/|
Top 7th token. Logit: 13.69 Prob:  0.01% Token: |

|
Top 8th token. Logit: 13.57 Prob:  0.01% Token: | Thu|
Top 9th token. Logit: 13.48 Prob:  0.01% Token: | and|


### Simple "analogies"

In [None]:
str_tokens = model.to_str_tokens("  Pebbles")
str_tokens

['<|endoftext|>', ' ', ' Peb', 'bles']

In [None]:
example_prompt = "Mary is a human. Fido is a dog. The pet of this family is Fido. Rachel is a human. Pebbles is a cat. The pet of this family is"
example_answer = " Peb"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'Mary', ' is', ' a', ' human', '.', ' F', 'ido', ' is', ' a', ' dog', '.', ' The', ' pet', ' of', ' this', ' family', ' is', ' F', 'ido', '.', ' Rachel', ' is', ' a', ' human', '.', ' Peb', 'bles', ' is', ' a', ' cat', '.', ' The', ' pet', ' of', ' this', ' family', ' is']
Tokenized answer: [' Peb']


Top 0th token. Logit: 17.77 Prob: 88.42% Token: | Peb|
Top 1th token. Logit: 14.08 Prob:  2.21% Token: | F|
Top 2th token. Logit: 13.23 Prob:  0.95% Token: | Rachel|
Top 3th token. Logit: 13.01 Prob:  0.76% Token: | a|
Top 4th token. Logit: 12.95 Prob:  0.71% Token: | P|
Top 5th token. Logit: 12.71 Prob:  0.56% Token: | the|
Top 6th token. Logit: 11.39 Prob:  0.15% Token: | Mary|
Top 7th token. Logit: 11.29 Prob:  0.14% Token: | Bub|
Top 8th token. Logit: 10.96 Prob:  0.10% Token: | B|
Top 9th token. Logit: 10.86 Prob:  0.09% Token: | C|


In [None]:
example_prompt = "Mary is a human. Fido is a dog. The pet of this family is Fido. Pebbles is a cat. Rachel is a human. The pet of this family is"
example_answer = " Peb"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'Mary', ' is', ' a', ' human', '.', ' F', 'ido', ' is', ' a', ' dog', '.', ' The', ' pet', ' of', ' this', ' family', ' is', ' F', 'ido', '.', ' Peb', 'bles', ' is', ' a', ' cat', '.', ' Rachel', ' is', ' a', ' human', '.', ' The', ' pet', ' of', ' this', ' family', ' is']
Tokenized answer: [' Peb']


Top 0th token. Logit: 16.88 Prob: 74.65% Token: | Rachel|
Top 1th token. Logit: 13.96 Prob:  4.03% Token: | Peb|
Top 2th token. Logit: 13.60 Prob:  2.81% Token: | F|
Top 3th token. Logit: 12.61 Prob:  1.04% Token: | a|
Top 4th token. Logit: 12.35 Prob:  0.80% Token: | the|
Top 5th token. Logit: 11.79 Prob:  0.46% Token: | R|
Top 6th token. Logit: 11.66 Prob:  0.40% Token: | Mary|
Top 7th token. Logit: 11.50 Prob:  0.34% Token: | P|
Top 8th token. Logit: 11.22 Prob:  0.26% Token: | her|
Top 9th token. Logit: 10.94 Prob:  0.20% Token: | C|


If we change the order, it doesn’t predict “Peb”, but Rachel, by 74% prob. So it’s not looking at semantics, but order. Why?

Try changing Mary and John

In [None]:
example_prompt = "When John and Mary went to the shops, John gave the bag to"
example_answer = " Mary"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'When', ' John', ' and', ' Mary', ' went', ' to', ' the', ' shops', ',', ' John', ' gave', ' the', ' bag', ' to']
Tokenized answer: [' Mary']


Top 0th token. Logit: 17.82 Prob: 62.49% Token: | Mary|
Top 1th token. Logit: 16.65 Prob: 19.25% Token: | his|
Top 2th token. Logit: 14.70 Prob:  2.74% Token: | the|
Top 3th token. Logit: 14.08 Prob:  1.48% Token: | a|
Top 4th token. Logit: 14.06 Prob:  1.44% Token: | John|
Top 5th token. Logit: 13.93 Prob:  1.27% Token: | her|
Top 6th token. Logit: 12.94 Prob:  0.47% Token: | me|
Top 7th token. Logit: 12.90 Prob:  0.45% Token: | Mrs|
Top 8th token. Logit: 12.65 Prob:  0.35% Token: | their|
Top 9th token. Logit: 12.33 Prob:  0.26% Token: | Peter|


In [None]:
example_prompt = "When Mary and John went to the shops, John gave the bag to"
example_answer = " Mary"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'When', ' Mary', ' and', ' John', ' went', ' to', ' the', ' shops', ',', ' John', ' gave', ' the', ' bag', ' to']
Tokenized answer: [' Mary']


Top 0th token. Logit: 18.57 Prob: 77.37% Token: | Mary|
Top 1th token. Logit: 16.52 Prob: 10.04% Token: | his|
Top 2th token. Logit: 15.24 Prob:  2.77% Token: | her|
Top 3th token. Logit: 14.91 Prob:  2.00% Token: | the|
Top 4th token. Logit: 14.19 Prob:  0.97% Token: | a|
Top 5th token. Logit: 13.50 Prob:  0.49% Token: | me|
Top 6th token. Logit: 13.24 Prob:  0.37% Token: | John|
Top 7th token. Logit: 12.69 Prob:  0.22% Token: | their|
Top 8th token. Logit: 12.60 Prob:  0.20% Token: | Mrs|
Top 9th token. Logit: 12.42 Prob:  0.17% Token: | him|


Now for IOI, it doesn't matter about the order of Mary and John. But the "identifying pet" example does care about sentence order, rather than semantics of who is the non-human animal.

In [None]:
example_prompt = "Mary is a teacher. Fido is a student. The child is Fido. Pebbles is a student. Rachel is a teacher. The child is"
example_answer = " Peb"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'Mary', ' is', ' a', ' teacher', '.', ' F', 'ido', ' is', ' a', ' student', '.', ' The', ' child', ' is', ' F', 'ido', '.', ' Peb', 'bles', ' is', ' a', ' student', '.', ' Rachel', ' is', ' a', ' teacher', '.', ' The', ' child', ' is']
Tokenized answer: [' Peb']


Top 0th token. Logit: 16.65 Prob: 48.08% Token: | Rachel|
Top 1th token. Logit: 16.15 Prob: 29.16% Token: | Peb|
Top 2th token. Logit: 14.96 Prob:  8.83% Token: | F|
Top 3th token. Logit: 13.18 Prob:  1.49% Token: | a|
Top 4th token. Logit: 12.72 Prob:  0.94% Token: | the|
Top 5th token. Logit: 11.72 Prob:  0.35% Token: | P|
Top 6th token. Logit: 11.71 Prob:  0.34% Token: | Mary|
Top 7th token. Logit: 11.68 Prob:  0.33% Token: | R|
Top 8th token. Logit: 11.44 Prob:  0.26% Token: | her|
Top 9th token. Logit: 10.92 Prob:  0.16% Token: | Riley|


In [None]:
example_prompt = "Mary. Fido. Fido. Pebbles. Rachel. "
example_answer = " Rachel"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'Mary', '.', ' F', 'ido', '.', ' F', 'ido', '.', ' Peb', 'bles', '.', ' Rachel', '.', ' ']
Tokenized answer: [' Rachel']


Top 0th token. Logit: 16.40 Prob: 68.21% Token: | |
Top 1th token. Logit: 13.18 Prob:  2.73% Token: |????|
Top 2th token. Logit: 13.12 Prob:  2.57% Token: |________|
Top 3th token. Logit: 12.98 Prob:  2.23% Token: |Â|
Top 4th token. Logit: 12.55 Prob:  1.44% Token: |________________________________|
Top 5th token. Logit: 12.54 Prob:  1.44% Token: |~~~~~~~~|
Top 6th token. Logit: 12.45 Prob:  1.31% Token: |????????|
Top 7th token. Logit: 12.25 Prob:  1.07% Token: |?????|
Top 8th token. Logit: 12.21 Prob:  1.03% Token: |�|
Top 9th token. Logit: 12.16 Prob:  0.98% Token: |
|


In [None]:
example_prompt = "Mary is X. Fido is Y. Z is Fido. Pebbles is Y. Rachel is X. Z is "
example_answer = " Peb"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'Mary', ' is', ' X', '.', ' F', 'ido', ' is', ' Y', '.', ' Z', ' is', ' F', 'ido', '.', ' Peb', 'bles', ' is', ' Y', '.', ' Rachel', ' is', ' X', '.', ' Z', ' is', ' ']
Tokenized answer: [' Peb']


Top 0th token. Logit: 17.55 Prob: 31.26% Token: |____|
Top 1th token. Logit: 17.09 Prob: 19.70% Token: |________|
Top 2th token. Logit: 17.04 Prob: 18.66% Token: |_____|
Top 3th token. Logit: 16.87 Prob: 15.84% Token: |_______|
Top 4th token. Logit: 15.29 Prob:  3.25% Token: |????|
Top 5th token. Logit: 14.67 Prob:  1.75% Token: |?????|
Top 6th token. Logit: 14.20 Prob:  1.09% Token: | |
Top 7th token. Logit: 14.17 Prob:  1.06% Token: |~~|
Top 8th token. Logit: 14.07 Prob:  0.96% Token: |________________|
Top 9th token. Logit: 13.71 Prob:  0.67% Token: |?"|


In [None]:
example_prompt = "Mary is X. Fido is Y. Z is Fido. Pebbles is Y. Rachel is X. Z is"
example_answer = " Peb"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'Mary', ' is', ' X', '.', ' F', 'ido', ' is', ' Y', '.', ' Z', ' is', ' F', 'ido', '.', ' Peb', 'bles', ' is', ' Y', '.', ' Rachel', ' is', ' X', '.', ' Z', ' is']
Tokenized answer: [' Peb']


Top 0th token. Logit: 15.55 Prob: 37.21% Token: | F|
Top 1th token. Logit: 15.04 Prob: 22.13% Token: | Y|
Top 2th token. Logit: 14.53 Prob: 13.36% Token: | Z|
Top 3th token. Logit: 13.27 Prob:  3.77% Token: | Rachel|
Top 4th token. Logit: 13.14 Prob:  3.34% Token: | X|
Top 5th token. Logit: 11.78 Prob:  0.86% Token: | a|
Top 6th token. Logit: 11.75 Prob:  0.82% Token: | B|
Top 7th token. Logit: 11.39 Prob:  0.58% Token: | A|
Top 8th token. Logit: 11.35 Prob:  0.56% Token: | the|
Top 9th token. Logit: 11.33 Prob:  0.54% Token: | R|


In [None]:
example_prompt = "Mary is X. Fido is Y. Z is Fido. Pebbles is X. Rachel is Y. Z is"
example_answer = " Peb"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'Mary', ' is', ' X', '.', ' F', 'ido', ' is', ' Y', '.', ' Z', ' is', ' F', 'ido', '.', ' Peb', 'bles', ' is', ' X', '.', ' Rachel', ' is', ' Y', '.', ' Z', ' is']
Tokenized answer: [' Peb']


Top 0th token. Logit: 15.18 Prob: 34.41% Token: | F|
Top 1th token. Logit: 14.47 Prob: 16.85% Token: | Rachel|
Top 2th token. Logit: 14.16 Prob: 12.41% Token: | Z|
Top 3th token. Logit: 13.08 Prob:  4.21% Token: | X|
Top 4th token. Logit: 12.83 Prob:  3.29% Token: | Peb|
Top 5th token. Logit: 12.66 Prob:  2.76% Token: | Y|
Top 6th token. Logit: 11.66 Prob:  1.02% Token: | a|
Top 7th token. Logit: 11.49 Prob:  0.86% Token: | R|
Top 8th token. Logit: 11.16 Prob:  0.62% Token: | the|
Top 9th token. Logit: 11.08 Prob:  0.57% Token: | B|


In [None]:
example_prompt = "Mary is the hero. Fido is the mentor. The main character is Mary. Pebbles is the hero. Rachel is the mentor. The main character is"
example_answer = " Peb"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'Mary', ' is', ' the', ' hero', '.', ' F', 'ido', ' is', ' the', ' mentor', '.', ' The', ' main', ' character', ' is', ' Mary', '.', ' Peb', 'bles', ' is', ' the', ' hero', '.', ' Rachel', ' is', ' the', ' mentor', '.', ' The', ' main', ' character', ' is']
Tokenized answer: [' Peb']


Top 0th token. Logit: 16.46 Prob: 54.54% Token: | Rachel|
Top 1th token. Logit: 15.68 Prob: 24.94% Token: | Peb|
Top 2th token. Logit: 13.61 Prob:  3.16% Token: | the|
Top 3th token. Logit: 13.14 Prob:  1.97% Token: | Mary|
Top 4th token. Logit: 12.01 Prob:  0.64% Token: | F|
Top 5th token. Logit: 11.91 Prob:  0.58% Token: | a|
Top 6th token. Logit: 11.52 Prob:  0.39% Token: | P|
Top 7th token. Logit: 11.10 Prob:  0.26% Token: | Riley|
Top 8th token. Logit: 11.03 Prob:  0.24% Token: | her|
Top 9th token. Logit: 10.84 Prob:  0.20% Token: | R|


In [None]:
example_prompt = "Mary is X. Fido is Y. Z is Fido. John is Y. Larry is X. Z is John. Pebbles is X. Rachel is Y. Z is"
example_answer = " Peb"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'Mary', ' is', ' X', '.', ' F', 'ido', ' is', ' Y', '.', ' Z', ' is', ' F', 'ido', '.', ' John', ' is', ' Y', '.', ' Larry', ' is', ' X', '.', ' Z', ' is', ' John', '.', ' Peb', 'bles', ' is', ' X', '.', ' Rachel', ' is', ' Y', '.', ' Z', ' is']
Tokenized answer: [' Peb']


Top 0th token. Logit: 16.60 Prob: 49.74% Token: | Rachel|
Top 1th token. Logit: 15.62 Prob: 18.71% Token: | Peb|
Top 2th token. Logit: 14.53 Prob:  6.25% Token: | Z|
Top 3th token. Logit: 13.41 Prob:  2.05% Token: | Larry|
Top 4th token. Logit: 12.48 Prob:  0.81% Token: | P|
Top 5th token. Logit: 12.21 Prob:  0.62% Token: | R|
Top 6th token. Logit: 12.21 Prob:  0.62% Token: | F|
Top 7th token. Logit: 11.77 Prob:  0.40% Token: | Bub|
Top 8th token. Logit: 11.77 Prob:  0.40% Token: | Sally|
Top 9th token. Logit: 11.54 Prob:  0.32% Token: | Carly|


In [None]:
example_prompt = "The human is Mary. The dog is Fido. The pet is Fido. The dog is Pebbles. The human is Rachel. The pet is"
example_answer = " Peb"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'The', ' human', ' is', ' Mary', '.', ' The', ' dog', ' is', ' F', 'ido', '.', ' The', ' pet', ' is', ' F', 'ido', '.', ' The', ' dog', ' is', ' Peb', 'bles', '.', ' The', ' human', ' is', ' Rachel', '.', ' The', ' pet', ' is']
Tokenized answer: [' Peb']


Top 0th token. Logit: 15.28 Prob: 35.58% Token: | Rachel|
Top 1th token. Logit: 14.31 Prob: 13.47% Token: | Peb|
Top 2th token. Logit: 13.11 Prob:  4.04% Token: | F|
Top 3th token. Logit: 12.28 Prob:  1.76% Token: | the|
Top 4th token. Logit: 11.87 Prob:  1.17% Token: | a|
Top 5th token. Logit: 11.75 Prob:  1.04% Token: | Bub|
Top 6th token. Logit: 11.58 Prob:  0.87% Token: | P|
Top 7th token. Logit: 11.00 Prob:  0.49% Token: | R|
Top 8th token. Logit: 11.00 Prob:  0.49% Token: | her|
Top 9th token. Logit: 10.81 Prob:  0.40% Token: | Toby|


In [None]:
example_prompt = "The human is Mary. The dog is Fido. The pet is Fido. The human is Rachel. The dog is Pebbles. The pet is"
example_answer = " Peb"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'The', ' human', ' is', ' Mary', '.', ' The', ' dog', ' is', ' F', 'ido', '.', ' The', ' pet', ' is', ' F', 'ido', '.', ' The', ' human', ' is', ' Rachel', '.', ' The', ' dog', ' is', ' Peb', 'bles', '.', ' The', ' pet', ' is']
Tokenized answer: [' Peb']


Top 0th token. Logit: 17.61 Prob: 90.00% Token: | Peb|
Top 1th token. Logit: 13.56 Prob:  1.57% Token: | F|
Top 2th token. Logit: 13.18 Prob:  1.07% Token: | P|
Top 3th token. Logit: 12.28 Prob:  0.44% Token: | the|
Top 4th token. Logit: 11.82 Prob:  0.27% Token: | a|
Top 5th token. Logit: 11.39 Prob:  0.18% Token: | Rachel|
Top 6th token. Logit: 11.13 Prob:  0.14% Token: | Bub|
Top 7th token. Logit: 11.06 Prob:  0.13% Token: | Pe|
Top 8th token. Logit: 10.98 Prob:  0.12% Token: | B|
Top 9th token. Logit: 10.90 Prob:  0.11% Token: | Polly|


So "The human is..." is simpler but doesn't work as well as "mary is a human", as it outputs Rachel first instead of Peb. Why?

### Other

In [None]:
example_prompt = ""
example_answer = ""
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

In [None]:
example_prompt = ""
example_answer = ""
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

# GPT2-large analogous inputs analysis

## Compare Logits

For all prompts (even if just one), get the model's outputs vs the correct ouputs, and compute the logit differences

original_average_logit_diff() is used for both direct logit and actv patch, so cell below should be run for both of them. Otherwise, the two sections are run indp of each other


In [None]:
prompts = [
"Mary is a human. Fido is a dog. The pet of this family is Fido. Rachel is a human. Pebbles is a cat. The pet of this family is",
# "Mary is a human. Fido is a dog. The pet of this family is Fido. Pebbles is a cat. Rachel is a human. The pet of this family is",
]
# List of answers, in the format (correct, incorrect)
answers = [
    (" Peb", " Rachel"),
    # (" Peb", " Rachel"),
]

answer_tokens = []
for answer in answers:
    correct_token = model.to_single_token(answer[0])
    incorrect_token = model.to_single_token(answer[1])
    answer_tokens.append((correct_token, incorrect_token))
# if len(prompts) > 1:
#     answer_tokens = torch.tensor(answer_tokens).cuda()  # if many inputs
# else:
#     answer_tokens = torch.tensor(answer_tokens)
# answer_tokens = torch.tensor(answer_tokens)
answer_tokens = torch.tensor(answer_tokens).cuda()

tokens = model.to_tokens(prompts, prepend_bos=True)
tokens = tokens.cuda() # Move the tokens to the GPU
original_logits, cache = model.run_with_cache(tokens) # Run the model and cache all activations

def logits_to_ave_logit_diff(logits, answer_tokens, per_prompt=False):
    # Only the final logits are relevant for the answer
    final_logits = logits[:, -1, :]
    answer_logits = final_logits.gather(dim=-1, index=answer_tokens)
    answer_logit_diff = answer_logits[:, 0] - answer_logits[:, 1]
    if per_prompt:
        return answer_logit_diff
    else:
        return answer_logit_diff.mean()

print("Per prompt logit difference:", logits_to_ave_logit_diff(original_logits, answer_tokens, per_prompt=True))
original_average_logit_diff = logits_to_ave_logit_diff(original_logits, answer_tokens)
print("Average logit difference:", logits_to_ave_logit_diff(original_logits, answer_tokens).item())

Per prompt logit difference: tensor([4.5368], device='cuda:0')
Average logit difference: 4.536792755126953


## Direct Layer Attribution

### Logit Lens

In [None]:
answer_residual_directions = model.tokens_to_residual_directions(answer_tokens)
print("Answer residual directions shape:", answer_residual_directions.shape)
logit_diff_directions = answer_residual_directions[:, 0] - answer_residual_directions[:, 1]
print("Logit difference directions shape:", logit_diff_directions.shape)

# cache syntax - resid_post is the residual stream at the end of the layer, -1 gets the final layer. The general syntax is [activation_name, layer_index, sub_layer_type]. 
final_residual_stream = cache["resid_post", -1]
print("Final residual stream shape:", final_residual_stream.shape)
final_token_residual_stream = final_residual_stream[:, -1, :]
# Apply LayerNorm scaling
# pos_slice is the subset of the positions we take - here the final token of each prompt
scaled_final_token_residual_stream = cache.apply_ln_to_stack(final_token_residual_stream, layer = -1, pos_slice=-1)

average_logit_diff = einsum("batch d_model, batch d_model -> ", scaled_final_token_residual_stream, logit_diff_directions)/len(prompts)
print("Calculated average logit diff:", average_logit_diff.item())
print("Original logit difference:",original_average_logit_diff.item())

def residual_stack_to_logit_diff(residual_stack: Float[torch.Tensor, "components batch d_model"], cache: ActivationCache) -> float:
    scaled_residual_stack = cache.apply_ln_to_stack(residual_stack, layer = -1, pos_slice=-1)
    return einsum("... batch d_model, batch d_model -> ...", scaled_residual_stack, logit_diff_directions)/len(prompts)
accumulated_residual, labels = cache.accumulated_resid(layer=-1, incl_mid=True, pos_slice=-1, return_labels=True)
logit_lens_logit_diffs = residual_stack_to_logit_diff(accumulated_residual, cache)
line(logit_lens_logit_diffs, x=np.arange(model.cfg.n_layers*2+1)/2, hover_name=labels, title="Logit Difference From Accumulate Residual Stream")

Answer residual directions shape: torch.Size([2, 2, 1280])
Logit difference directions shape: torch.Size([2, 1280])
Final residual stream shape: torch.Size([2, 38, 1280])
Calculated average logit diff: 1.199303150177002
Original logit difference: 0.8092727661132812


In [None]:
per_layer_residual, labels = cache.decompose_resid(layer=-1, pos_slice=-1, return_labels=True)
per_layer_logit_diffs = residual_stack_to_logit_diff(per_layer_residual, cache)
line(per_layer_logit_diffs, hover_name=labels, title="Logit Difference From Each Layer")

In [None]:
per_head_residual, labels = cache.stack_head_results(layer=-1, pos_slice=-1, return_labels=True)
per_head_logit_diffs = residual_stack_to_logit_diff(per_head_residual, cache)
per_head_logit_diffs = einops.rearrange(per_head_logit_diffs, "(layer head_index) -> layer head_index", layer=model.cfg.n_layers, head_index=model.cfg.n_heads)
imshow(per_head_logit_diffs, labels={"x":"Head", "y":"Layer"}, title="Logit Difference From Each Head")

Tried to stack head results when they weren't cached. Computing head results now


### Attention patterns

In [None]:
top_k = 3
top_positive_logit_attr_heads = torch.topk(per_head_logit_diffs.flatten(), k=top_k).indices
visualize_attention_patterns(top_positive_logit_attr_heads, title=f"Top {top_k} Positive Logit Attribution Heads")
top_negative_logit_attr_heads = torch.topk(-per_head_logit_diffs.flatten(), k=top_k).indices
visualize_attention_patterns(top_negative_logit_attr_heads, title=f"Top {top_k} Negative Logit Attribution Heads")

There seems to be more attending, but quantitatively speaking it's not for sure yet. Note there were 2 prompts but only the first is displayed.

## Activation patching

### Corrupt by switching sentences

In [None]:
corrupted_prompts = ["Mary is a human. Fido is a dog. The pet of this family is Fido. Pebbles is a cat. Rachel is a human. The pet of this family is",]
corrupted_tokens = model.to_tokens(corrupted_prompts, prepend_bos=True)
corrupted_logits, corrupted_cache = model.run_with_cache(corrupted_tokens, return_type="logits")
corrupted_average_logit_diff = logits_to_ave_logit_diff(corrupted_logits, answer_tokens)
print("Corrupted Average Logit Diff", corrupted_average_logit_diff)
print("Clean Average Logit Diff", original_average_logit_diff)

Corrupted Average Logit Diff tensor(-2.9183, device='cuda:0')
Clean Average Logit Diff tensor(4.5368, device='cuda:0')


This takes a while to run on cpu, but only 2m (for 1 prompt) on T4

In [None]:
def patch_residual_component(
    corrupted_residual_component: Float[torch.Tensor, "batch pos d_model"],
    hook, 
    pos, 
    clean_cache):
    corrupted_residual_component[:, pos, :] = clean_cache[hook.name][:, pos, :]
    return corrupted_residual_component

def normalize_patched_logit_diff(patched_logit_diff):
    # Subtract corrupted logit diff to measure the improvement, divide by the total improvement from clean to corrupted to normalise
    # 0 means zero change, negative means actively made worse, 1 means totally recovered clean performance, >1 means actively *improved* on clean performance
    return (patched_logit_diff - corrupted_average_logit_diff)/(original_average_logit_diff - corrupted_average_logit_diff)

# patched_residual_stream_diff = torch.zeros(model.cfg.n_layers, tokens.shape[1], dtype=torch.float32)
patched_residual_stream_diff = torch.zeros(model.cfg.n_layers, tokens.shape[1], device="cuda", dtype=torch.float32)
for layer in range(model.cfg.n_layers):
    for position in range(tokens.shape[1]):
        hook_fn = partial(patch_residual_component, pos=position, clean_cache=cache)
        patched_logits = model.run_with_hooks(
            corrupted_tokens, 
            fwd_hooks = [(utils.get_act_name("resid_pre", layer), 
                hook_fn)], 
            return_type="logits"
        )
        patched_logit_diff = logits_to_ave_logit_diff(patched_logits, answer_tokens)

        patched_residual_stream_diff[layer, position] = normalize_patched_logit_diff(patched_logit_diff)

In [None]:
prompt_position_labels = [f"{tok}_{i}" for i, tok in enumerate(model.to_str_tokens(tokens[0]))]
imshow(patched_residual_stream_diff, x=prompt_position_labels, title="Logit Difference From Patched Residual Stream", labels={"x":"Position", "y":"Layer"})

Early and Middle layers attend negatively to Rachel, positively to Pebbles (esp "bles"), and then "attention moves" so that later layers attend to the last token "is"

The early and mid layers attends to, for target side of the analogy, human neg and cat pos, too, but not for the domain side of the analogy.

### Corrupt by switching words

In [None]:
prompts = [
"Mary is a human. Fido is a dog. The pet of this family is Fido. Rachel is a human. Pebbles is a cat. The pet of this family is",
# "Mary is a human. Fido is a dog. The pet of this family is Fido. Pebbles is a cat. Rachel is a human. The pet of this family is",
]
# List of answers, in the format (correct, incorrect)
answers = [
    (" Peb", " Rachel"),
    # (" Peb", " Rachel"),
]

answer_tokens = []
for answer in answers:
    correct_token = model.to_single_token(answer[0])
    incorrect_token = model.to_single_token(answer[1])
    answer_tokens.append((correct_token, incorrect_token))
# if len(prompts) > 1:
#     answer_tokens = torch.tensor(answer_tokens).cuda()  # if many inputs
# else:
#     answer_tokens = torch.tensor(answer_tokens)
# answer_tokens = torch.tensor(answer_tokens)
answer_tokens = torch.tensor(answer_tokens).cuda()

tokens = model.to_tokens(prompts, prepend_bos=True)
tokens = tokens.cuda() # Move the tokens to the GPU
original_logits, cache = model.run_with_cache(tokens) # Run the model and cache all activations

def logits_to_ave_logit_diff(logits, answer_tokens, per_prompt=False):
    # Only the final logits are relevant for the answer
    final_logits = logits[:, -1, :]
    answer_logits = final_logits.gather(dim=-1, index=answer_tokens)
    answer_logit_diff = answer_logits[:, 0] - answer_logits[:, 1]
    if per_prompt:
        return answer_logit_diff
    else:
        return answer_logit_diff.mean()

print("Per prompt logit difference:", logits_to_ave_logit_diff(original_logits, answer_tokens, per_prompt=True))
original_average_logit_diff = logits_to_ave_logit_diff(original_logits, answer_tokens)
print("Average logit difference:", logits_to_ave_logit_diff(original_logits, answer_tokens).item())

Per prompt logit difference: tensor([4.5368], device='cuda:0')
Average logit difference: 4.536792755126953


In [None]:
corrupted_prompts = ["Mary is a human. Fido is a dog. The pet of this family is Fido. Rachel is a cat. Pebbles is a human. The pet of this family is",]
corrupted_tokens = model.to_tokens(corrupted_prompts, prepend_bos=True)
corrupted_logits, corrupted_cache = model.run_with_cache(corrupted_tokens, return_type="logits")
corrupted_average_logit_diff = logits_to_ave_logit_diff(corrupted_logits, answer_tokens)
print("Corrupted Average Logit Diff", corrupted_average_logit_diff)
print("Clean Average Logit Diff", original_average_logit_diff)

Corrupted Average Logit Diff tensor(5.1553, device='cuda:0')
Clean Average Logit Diff tensor(4.5368, device='cuda:0')


This is a bad input b/c "Rachel is a cat. Pebbles is a human. The pet of this family is" yields Pebbles as the top result even though it should be Rachel.




### What was original vs corrupted, and answer tokens, in expl analysis demo (that this is based on)?

In [None]:
prompt_format = [
    "When John and Mary went to the shops,{} gave the bag to",
    "When Tom and James went to the park,{} gave the ball to",
    "When Dan and Sid went to the shops,{} gave an apple to",
    "After Martin and Amy went to the park,{} gave a drink to",
]
names = [
    (" Mary", " John"),
    (" Tom", " James"),
    (" Dan", " Sid"),
    (" Martin", " Amy"),
]
# List of prompts
prompts = []
# List of answers, in the format (correct, incorrect)
answers = []
# List of the token (ie an integer) corresponding to each answer, in the format (correct_token, incorrect_token)
answer_tokens = []
for i in range(len(prompt_format)):
    for j in range(2):
        answers.append((names[i][j], names[i][1 - j]))
        answer_tokens.append(
            (
                model.to_single_token(answers[-1][0]),
                model.to_single_token(answers[-1][1]),
            )
        )
        # Insert the *incorrect* answer to the prompt, making the correct answer the indirect object.
        prompts.append(prompt_format[i].format(answers[-1][1]))
answer_tokens = torch.tensor(answer_tokens).cuda()
print(prompts)
print(answers)

['When John and Mary went to the shops, John gave the bag to', 'When John and Mary went to the shops, Mary gave the bag to', 'When Tom and James went to the park, James gave the ball to', 'When Tom and James went to the park, Tom gave the ball to', 'When Dan and Sid went to the shops, Sid gave an apple to', 'When Dan and Sid went to the shops, Dan gave an apple to', 'After Martin and Amy went to the park, Amy gave a drink to', 'After Martin and Amy went to the park, Martin gave a drink to']
[(' Mary', ' John'), (' John', ' Mary'), (' Tom', ' James'), (' James', ' Tom'), (' Dan', ' Sid'), (' Sid', ' Dan'), (' Martin', ' Amy'), (' Amy', ' Martin')]


In [None]:
corrupted_prompts = []
for i in range(0, len(prompts), 2):
    corrupted_prompts.append(prompts[i+1])
    corrupted_prompts.append(prompts[i])
corrupted_tokens = model.to_tokens(corrupted_prompts, prepend_bos=True)
corrupted_logits, corrupted_cache = model.run_with_cache(corrupted_tokens, return_type="logits")
corrupted_average_logit_diff = logits_to_ave_logit_diff(corrupted_logits, answer_tokens)
print("Corrupted Average Logit Diff", corrupted_average_logit_diff)
print("Clean Average Logit Diff", original_average_logit_diff)

Corrupted Average Logit Diff tensor(-4.5051, device='cuda:0')
Clean Average Logit Diff tensor(4.5368, device='cuda:0')


In [None]:
model.to_string(corrupted_tokens)

['<|endoftext|>When John and Mary went to the shops, Mary gave the bag to',
 '<|endoftext|>When John and Mary went to the shops, John gave the bag to',
 '<|endoftext|>When Tom and James went to the park, Tom gave the ball to',
 '<|endoftext|>When Tom and James went to the park, James gave the ball to',
 '<|endoftext|>When Dan and Sid went to the shops, Dan gave an apple to',
 '<|endoftext|>When Dan and Sid went to the shops, Sid gave an apple to',
 '<|endoftext|>After Martin and Amy went to the park, Martin gave a drink to',
 '<|endoftext|>After Martin and Amy went to the park, Amy gave a drink to']

In [None]:
prompts

['When John and Mary went to the shops, John gave the bag to',
 'When John and Mary went to the shops, Mary gave the bag to',
 'When Tom and James went to the park, James gave the ball to',
 'When Tom and James went to the park, Tom gave the ball to',
 'When Dan and Sid went to the shops, Sid gave an apple to',
 'When Dan and Sid went to the shops, Dan gave an apple to',
 'After Martin and Amy went to the park, Amy gave a drink to',
 'After Martin and Amy went to the park, Martin gave a drink to']

In [None]:
model.to_string(answer_tokens)

[' Mary John',
 ' John Mary',
 ' Tom James',
 ' James Tom',
 ' Dan Sid',
 ' Sid Dan',
 ' Martin Amy',
 ' Amy Martin']

logits_to_ave_logit_diff() gets the logits for the specified answer tokens, then subtracts first ind and second ind of answer token tuple. So corrupted should be negative because it should have the second ind be higher than first, regardless of what second is. 

Thus, corrupted input’s result just has to place second ind answer_token[i][1] higher than first ind answer_token[i][0]; exact ranking doesn't matter.

In [None]:
final_logits = corrupted_logits[:, -1, :]
final_logits.shape

torch.Size([8, 50257])

### Corrupt number sequences

In [None]:
prompts = [
"1 2 3 4",
]
# List of answers, in the format (correct, incorrect)
answers = [
    (" 5", " 6"),
]

answer_tokens = []
for answer in answers:
    correct_token = model.to_single_token(answer[0])
    incorrect_token = model.to_single_token(answer[1])
    answer_tokens.append((correct_token, incorrect_token))
answer_tokens = torch.tensor(answer_tokens).cuda()

tokens = model.to_tokens(prompts, prepend_bos=True)
tokens = tokens.cuda() # Move the tokens to the GPU
original_logits, cache = model.run_with_cache(tokens) # Run the model and cache all activations

def logits_to_ave_logit_diff(logits, answer_tokens, per_prompt=False):
    # Only the final logits are relevant for the answer
    final_logits = logits[:, -1, :]
    answer_logits = final_logits.gather(dim=-1, index=answer_tokens)
    answer_logit_diff = answer_logits[:, 0] - answer_logits[:, 1]
    if per_prompt:
        return answer_logit_diff
    else:
        return answer_logit_diff.mean()

print("Per prompt logit difference:", logits_to_ave_logit_diff(original_logits, answer_tokens, per_prompt=True))
original_average_logit_diff = logits_to_ave_logit_diff(original_logits, answer_tokens)
print("Average logit difference:", logits_to_ave_logit_diff(original_logits, answer_tokens).item())

Per prompt logit difference: tensor([7.3265], device='cuda:0')
Average logit difference: 7.326506614685059


In [None]:
example_prompt = "1 2 3 5"
example_answer = " 6"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', '1', ' 2', ' 3', ' 5']
Tokenized answer: [' 6']


Top 0th token. Logit: 15.91 Prob: 41.51% Token: | 6|
Top 1th token. Logit: 14.48 Prob:  9.93% Token: | 7|
Top 2th token. Logit: 14.02 Prob:  6.25% Token: | 4|
Top 3th token. Logit: 13.89 Prob:  5.46% Token: | 8|
Top 4th token. Logit: 13.84 Prob:  5.20% Token: | 11|
Top 5th token. Logit: 13.23 Prob:  2.82% Token: |
|
Top 6th token. Logit: 13.20 Prob:  2.75% Token: | 5|
Top 7th token. Logit: 13.00 Prob:  2.25% Token: | 10|
Top 8th token. Logit: 12.77 Prob:  1.79% Token: | 9|
Top 9th token. Logit: 12.68 Prob:  1.64% Token: | 1|


Since this corrupted sequence ranks 6 above 5, we'll use it.

In [None]:
corrupted_prompts = ["1 2 3 5",]
corrupted_tokens = model.to_tokens(corrupted_prompts, prepend_bos=True)
corrupted_logits, corrupted_cache = model.run_with_cache(corrupted_tokens, return_type="logits")
corrupted_average_logit_diff = logits_to_ave_logit_diff(corrupted_logits, answer_tokens)
print("Corrupted Average Logit Diff", corrupted_average_logit_diff)
print("Clean Average Logit Diff", original_average_logit_diff)

Corrupted Average Logit Diff tensor(-2.7137, device='cuda:0')
Clean Average Logit Diff tensor(7.3265, device='cuda:0')


In [None]:
def patch_residual_component(
    corrupted_residual_component: Float[torch.Tensor, "batch pos d_model"],
    hook, 
    pos, 
    clean_cache):
    corrupted_residual_component[:, pos, :] = clean_cache[hook.name][:, pos, :]
    return corrupted_residual_component

def normalize_patched_logit_diff(patched_logit_diff):
    # Subtract corrupted logit diff to measure the improvement, divide by the total improvement from clean to corrupted to normalise
    # 0 means zero change, negative means actively made worse, 1 means totally recovered clean performance, >1 means actively *improved* on clean performance
    return (patched_logit_diff - corrupted_average_logit_diff)/(original_average_logit_diff - corrupted_average_logit_diff)

# patched_residual_stream_diff = torch.zeros(model.cfg.n_layers, tokens.shape[1], dtype=torch.float32)
patched_residual_stream_diff = torch.zeros(model.cfg.n_layers, tokens.shape[1], device="cuda", dtype=torch.float32)
for layer in range(model.cfg.n_layers):
    for position in range(tokens.shape[1]):
        hook_fn = partial(patch_residual_component, pos=position, clean_cache=cache)
        patched_logits = model.run_with_hooks(
            corrupted_tokens, 
            fwd_hooks = [(utils.get_act_name("resid_pre", layer), 
                hook_fn)], 
            return_type="logits"
        )
        patched_logit_diff = logits_to_ave_logit_diff(patched_logits, answer_tokens)

        patched_residual_stream_diff[layer, position] = normalize_patched_logit_diff(patched_logit_diff)

In [None]:
prompt_position_labels = [f"{tok}_{i}" for i, tok in enumerate(model.to_str_tokens(tokens[0]))]
imshow(patched_residual_stream_diff, x=prompt_position_labels, title="Logit Difference From Patched Residual Stream", labels={"x":"Position", "y":"Layer"})

ALL the layers outputting to residual stream attend highly to the last token. That's obvious because the last token was the one that was changed. This is such a short sequence that we don't expect early sites like in the more elaborate sentences.

In [None]:
example_prompt = "1 2 3"
example_answer = " 4"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', '1', ' 2', ' 3']
Tokenized answer: [' 4']


Top 0th token. Logit: 18.40 Prob: 94.97% Token: | 4|
Top 1th token. Logit: 14.30 Prob:  1.57% Token: |
|
Top 2th token. Logit: 13.08 Prob:  0.47% Token: | Next|
Top 3th token. Logit: 12.52 Prob:  0.27% Token: | 5|
Top 4th token. Logit: 11.99 Prob:  0.16% Token: | 3|
Top 5th token. Logit: 11.44 Prob:  0.09% Token: | 1|
Top 6th token. Logit: 11.14 Prob:  0.07% Token: | 6|
Top 7th token. Logit: 10.91 Prob:  0.05% Token: | View|
Top 8th token. Logit: 10.88 Prob:  0.05% Token: | >|
Top 9th token. Logit: 10.79 Prob:  0.05% Token: |

|
