# Prelims

<b style="color: red">To use this notebook, go to Runtime > Change Runtime Type and select GPU as the hardware accelerator.</b>

This is because tokenizer will use .cuda to process input batches in parallel.

This is like name movers, but it detects the same description. Checks what's involved; expect induction heads, "adjective movers", "adjective inhibition", duplicate heads

# Setup
(No need to read)

In [None]:
# Janky code to do different setup when run in a Colab notebook vs VSCode
DEBUG_MODE = False
try:
    import google.colab
    IN_COLAB = True
    print("Running as a Colab notebook")
    %pip install git+https://github.com/neelnanda-io/TransformerLens.git
    # Install another version of node that makes PySvelte work way faster
    !curl -fsSL https://deb.nodesource.com/setup_16.x | sudo -E bash -; sudo apt-get install -y nodejs
    %pip install git+https://github.com/neelnanda-io/PySvelte.git
except:
    IN_COLAB = False
    print("Running as a Jupyter notebook - intended for development only!")
    from IPython import get_ipython

    ipython = get_ipython()
    # Code to automatically update the HookedTransformer code as its edited without restarting the kernel
    ipython.magic("load_ext autoreload")
    ipython.magic("autoreload 2")

In [None]:
# Plotly needs a different renderer for VSCode/Notebooks vs Colab argh
import plotly.io as pio

if IN_COLAB or not DEBUG_MODE:
    # Thanks to annoying rendering issues, Plotly graphics will either show up in colab OR Vscode depending on the renderer - this is bad for developing demos! Thus creating a debug mode.
    pio.renderers.default = "colab"
else:
    pio.renderers.default = "png"

In [None]:
# Import stuff
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import einops
from fancy_einsum import einsum
import tqdm.notebook as tqdm
import random
from pathlib import Path
import plotly.express as px
from torch.utils.data import DataLoader

from jaxtyping import Float, Int
from typing import List, Union, Optional
from functools import partial
import copy

import itertools
from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer
import dataclasses
import datasets
from IPython.display import HTML

In [None]:
import pysvelte

import transformer_lens
import transformer_lens.utils as utils
from transformer_lens.hook_points import (
    HookedRootModule,
    HookPoint,
)  # Hooking utilities
from transformer_lens import HookedTransformer, HookedTransformerConfig, FactoredMatrix, ActivationCache

We turn automatic differentiation off, to save GPU memory, as this notebook focuses on model inference not model training.

In [None]:
torch.set_grad_enabled(False)

<torch.autograd.grad_mode.set_grad_enabled at 0x7f6344fb74f0>

Plotting helper functions:

In [None]:
def imshow(tensor, renderer=None, **kwargs):
    px.imshow(utils.to_numpy(tensor), color_continuous_midpoint=0.0, color_continuous_scale="RdBu", **kwargs).show(renderer)

def line(tensor, renderer=None, **kwargs):
    px.line(y=utils.to_numpy(tensor), **kwargs).show(renderer)

def scatter(x, y, xaxis="", yaxis="", caxis="", renderer=None, **kwargs):
    x = utils.to_numpy(x)
    y = utils.to_numpy(y)
    px.scatter(y=y, x=x, labels={"x":xaxis, "y":yaxis, "color":caxis}, **kwargs).show(renderer)

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

# Analyze GPT-2-Small
80M parameter model

In [None]:
model = HookedTransformer.from_pretrained("gpt2-small", device=device)

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Using pad_token, but it is not set yet.


Loaded pretrained model gpt2-small into HookedTransformer


To try the model out, let's find the loss on this text! Models can be run on a single string or a tensor of tokens (shape: [batch, position], all integers), and the possible return types are: 
* "logits" (shape [batch, position, d_vocab], floats), 
* "loss" (the cross-entropy loss when predicting the next token), 
* "both" (a tuple of (logits, loss)) 
* None (run the model, but don't calculate the logits - this is faster when we only want to use intermediate activations)

## Test prompts

In [None]:
example_prompt = "John is red. Mary is blue. Connor is green. Mary is"
example_answer = " blue"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'John', ' is', ' red', '.', ' Mary', ' is', ' blue', '.', ' Connor', ' is', ' green', '.', ' Mary', ' is']
Tokenized answer: [' blue']


Top 0th token. Logit: 17.50 Prob: 21.44% Token: | yellow|
Top 1th token. Logit: 17.37 Prob: 18.78% Token: | blue|
Top 2th token. Logit: 16.87 Prob: 11.34% Token: | purple|
Top 3th token. Logit: 16.66 Prob:  9.21% Token: | red|
Top 4th token. Logit: 16.56 Prob:  8.35% Token: | white|
Top 5th token. Logit: 15.96 Prob:  4.59% Token: | pink|
Top 6th token. Logit: 15.81 Prob:  3.94% Token: | black|
Top 7th token. Logit: 15.41 Prob:  2.65% Token: | orange|
Top 8th token. Logit: 15.33 Prob:  2.45% Token: | brown|
Top 9th token. Logit: 15.24 Prob:  2.22% Token: | green|


In [None]:
example_prompt = "John is red. Mary is blue. Mary is"
example_answer = " blue"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'John', ' is', ' red', '.', ' Mary', ' is', ' blue', '.', ' Mary', ' is']
Tokenized answer: [' blue']


Top 0th token. Logit: 16.26 Prob: 15.46% Token: | white|
Top 1th token. Logit: 15.98 Prob: 11.66% Token: | green|
Top 2th token. Logit: 15.68 Prob:  8.63% Token: | yellow|
Top 3th token. Logit: 15.53 Prob:  7.38% Token: | purple|
Top 4th token. Logit: 15.49 Prob:  7.13% Token: | red|
Top 5th token. Logit: 15.41 Prob:  6.61% Token: | pink|
Top 6th token. Logit: 15.32 Prob:  5.99% Token: | blue|
Top 7th token. Logit: 15.06 Prob:  4.64% Token: | black|
Top 8th token. Logit: 14.55 Prob:  2.80% Token: | orange|
Top 9th token. Logit: 14.33 Prob:  2.24% Token: | a|


That's really bad.

### Opposites

In [None]:
example_prompt = "John is tall. Mary is"
example_answer = " short"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'John', ' is', ' tall', '.', ' Mary', ' is']
Tokenized answer: [' short']


Top 0th token. Logit: 16.89 Prob: 21.23% Token: | short|
Top 1th token. Logit: 16.05 Prob:  9.19% Token: | thin|
Top 2th token. Logit: 15.89 Prob:  7.78% Token: | tall|
Top 3th token. Logit: 15.51 Prob:  5.34% Token: | small|
Top 4th token. Logit: 15.27 Prob:  4.22% Token: | slim|
Top 5th token. Logit: 15.15 Prob:  3.74% Token: | a|
Top 6th token. Logit: 14.93 Prob:  2.99% Token: | slender|
Top 7th token. Logit: 14.77 Prob:  2.56% Token: | skinny|
Top 8th token. Logit: 14.34 Prob:  1.66% Token: | not|
Top 9th token. Logit: 14.29 Prob:  1.58% Token: | shorter|


In [None]:
example_prompt = "John is black. Mary is"
example_answer = " white"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'John', ' is', ' black', '.', ' Mary', ' is']
Tokenized answer: [' white']


Top 0th token. Logit: 18.16 Prob: 65.14% Token: | white|
Top 1th token. Logit: 15.70 Prob:  5.58% Token: | black|
Top 2th token. Logit: 15.19 Prob:  3.36% Token: | brown|
Top 3th token. Logit: 14.79 Prob:  2.24% Token: | a|
Top 4th token. Logit: 14.33 Prob:  1.41% Token: | Hispanic|
Top 5th token. Logit: 13.97 Prob:  0.99% Token: | blue|
Top 6th token. Logit: 13.90 Prob:  0.92% Token: | Asian|
Top 7th token. Logit: 13.72 Prob:  0.77% Token: | blonde|
Top 8th token. Logit: 13.61 Prob:  0.69% Token: | red|
Top 9th token. Logit: 13.60 Prob:  0.68% Token: | female|


In [None]:
example_prompt = "John is blue. Mary is"
example_answer = " red"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'John', ' is', ' blue', '.', ' Mary', ' is']
Tokenized answer: [' red']


Top 0th token. Logit: 16.01 Prob: 14.53% Token: | green|
Top 1th token. Logit: 15.78 Prob: 11.50% Token: | white|
Top 2th token. Logit: 15.56 Prob:  9.24% Token: | red|
Top 3th token. Logit: 15.09 Prob:  5.78% Token: | pink|
Top 4th token. Logit: 15.07 Prob:  5.69% Token: | yellow|
Top 5th token. Logit: 15.05 Prob:  5.53% Token: | blue|
Top 6th token. Logit: 14.82 Prob:  4.43% Token: | black|
Top 7th token. Logit: 14.81 Prob:  4.35% Token: | purple|
Top 8th token. Logit: 14.25 Prob:  2.50% Token: | brown|
Top 9th token. Logit: 14.20 Prob:  2.38% Token: | a|


In [None]:
example_prompt = "John is big. Mary is"
example_answer = " small"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'John', ' is', ' big', '.', ' Mary', ' is']
Tokenized answer: [' small']


Top 0th token. Logit: 15.50 Prob: 21.77% Token: | big|
Top 1th token. Logit: 15.12 Prob: 14.87% Token: | small|
Top 2th token. Logit: 13.93 Prob:  4.52% Token: | a|
Top 3th token. Logit: 13.23 Prob:  2.24% Token: | not|
Top 4th token. Logit: 12.99 Prob:  1.76% Token: | tiny|
Top 5th token. Logit: 12.70 Prob:  1.32% Token: | quiet|
Top 6th token. Logit: 12.63 Prob:  1.23% Token: | short|
Top 7th token. Logit: 12.53 Prob:  1.11% Token: | bigger|
Top 8th token. Logit: 12.53 Prob:  1.11% Token: | very|
Top 9th token. Logit: 12.52 Prob:  1.10% Token: | huge|


In [None]:
example_prompt = "John is wide. Mary is"
example_answer = " thin"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'John', ' is', ' wide', '.', ' Mary', ' is']
Tokenized answer: [' thin']


Top 0th token. Logit: 15.13 Prob: 14.18% Token: | short|
Top 1th token. Logit: 14.46 Prob:  7.25% Token: | small|
Top 2th token. Logit: 13.76 Prob:  3.59% Token: | narrow|
Top 3th token. Logit: 13.61 Prob:  3.11% Token: | wide|
Top 4th token. Logit: 13.43 Prob:  2.58% Token: | tall|
Top 5th token. Logit: 13.28 Prob:  2.23% Token: | in|
Top 6th token. Logit: 13.28 Prob:  2.22% Token: | not|
Top 7th token. Logit: 13.12 Prob:  1.90% Token: | thin|
Top 8th token. Logit: 13.12 Prob:  1.90% Token: | a|
Top 9th token. Logit: 12.79 Prob:  1.36% Token: | flat|


In [None]:
example_prompt = "Tall."
example_answer = " Short"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'T', 'all', '.']
Tokenized answer: [' Short']


Top 0th token. Logit: 12.93 Prob:  7.35% Token: |
|
Top 1th token. Logit: 12.61 Prob:  5.34% Token: | Tall|
Top 2th token. Logit: 11.54 Prob:  1.82% Token: | But|
Top 3th token. Logit: 11.47 Prob:  1.70% Token: | I|
Top 4th token. Logit: 11.42 Prob:  1.61% Token: | The|
Top 5th token. Logit: 11.38 Prob:  1.55% Token: | It|
Top 6th token. Logit: 11.31 Prob:  1.45% Token: | A|
Top 7th token. Logit: 11.09 Prob:  1.16% Token: | Strong|
Top 8th token. Logit: 11.07 Prob:  1.14% Token: | And|
Top 9th token. Logit: 10.91 Prob:  0.97% Token: | He|


In [None]:
example_prompt = "tall, "
example_answer = " short"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'tall', ',', ' ']
Tokenized answer: [' short']


Top 0th token. Logit: 14.72 Prob: 17.99% Token: | |
Top 1th token. Logit: 14.12 Prob:  9.89% Token: |iced|
Top 2th token. Logit: 13.95 Prob:  8.30% Token: |ich|
Top 3th token. Logit: 13.83 Prob:  7.42% Token: |ive|
Top 4th token. Logit: 12.90 Prob:  2.91% Token: |ix|
Top 5th token. Logit: 12.82 Prob:  2.70% Token: |ik|
Top 6th token. Logit: 12.66 Prob:  2.30% Token: |ike|
Top 7th token. Logit: 12.62 Prob:  2.20% Token: |icky|
Top 8th token. Logit: 12.60 Prob:  2.16% Token: |ia|
Top 9th token. Logit: 12.34 Prob:  1.67% Token: |urn|


In [None]:
example_prompt = "Big."
example_answer = " Small"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'Big', '.']
Tokenized answer: [' Small']


Top 0th token. Logit: 13.10 Prob: 13.40% Token: |
|
Top 1th token. Logit: 11.63 Prob:  3.09% Token: | Big|
Top 2th token. Logit: 11.47 Prob:  2.64% Token: | I|
Top 3th token. Logit: 10.94 Prob:  1.54% Token: | You|
Top 4th token. Logit: 10.93 Prob:  1.54% Token: | The|
Top 5th token. Logit: 10.82 Prob:  1.37% Token: | It|
Top 6th token. Logit: 10.56 Prob:  1.06% Token: |Big|
Top 7th token. Logit: 10.48 Prob:  0.97% Token: | That|
Top 8th token. Logit: 10.37 Prob:  0.87% Token: | We|
Top 9th token. Logit: 10.28 Prob:  0.80% Token: | This|


In [None]:
example_prompt = "Is big. Is"
example_answer = " small"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'Is', ' big', '.', ' Is']
Tokenized answer: [' small']


Top 0th token. Logit: 13.57 Prob: 12.15% Token: | big|
Top 1th token. Logit: 13.52 Prob: 11.60% Token: | small|
Top 2th token. Logit: 12.84 Prob:  5.85% Token: | it|
Top 3th token. Logit: 11.31 Prob:  1.26% Token: | the|
Top 4th token. Logit: 11.23 Prob:  1.17% Token: | not|
Top 5th token. Logit: 11.21 Prob:  1.15% Token: | huge|
Top 6th token. Logit: 11.21 Prob:  1.15% Token: | a|
Top 7th token. Logit: 11.16 Prob:  1.09% Token: | good|
Top 8th token. Logit: 11.03 Prob:  0.96% Token: | easy|
Top 9th token. Logit: 10.93 Prob:  0.86% Token: | he|


In [None]:
example_prompt = "Is tall. Is"
example_answer = " short"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'Is', ' tall', '.', ' Is']
Tokenized answer: [' short']


Top 0th token. Logit: 14.20 Prob:  9.94% Token: | skinny|
Top 1th token. Logit: 13.60 Prob:  5.43% Token: | tall|
Top 2th token. Logit: 13.56 Prob:  5.24% Token: | short|
Top 3th token. Logit: 13.50 Prob:  4.92% Token: | thin|
Top 4th token. Logit: 12.73 Prob:  2.28% Token: | fat|
Top 5th token. Logit: 12.64 Prob:  2.09% Token: | a|
Top 6th token. Logit: 12.44 Prob:  1.71% Token: | slim|
Top 7th token. Logit: 12.40 Prob:  1.65% Token: | light|
Top 8th token. Logit: 12.29 Prob:  1.48% Token: | big|
Top 9th token. Logit: 12.27 Prob:  1.44% Token: | small|


GPT-2-small can't do this as well as -large

# Test GPT-2-Large

In [None]:
model = HookedTransformer.from_pretrained("gpt2-large", device=device)

Downloading (…)lve/main/config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Using pad_token, but it is not set yet.


Loaded pretrained model gpt2-large into HookedTransformer


In [None]:
example_prompt = "John is red. Mary is blue. Connor is green. Mary is"
example_answer = " blue"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'John', ' is', ' red', '.', ' Mary', ' is', ' blue', '.', ' Connor', ' is', ' green', '.', ' Mary', ' is']
Tokenized answer: [' blue']


Top 0th token. Logit: 13.31 Prob:  9.44% Token: | a|
Top 1th token. Logit: 12.60 Prob:  4.66% Token: | red|
Top 2th token. Logit: 12.49 Prob:  4.15% Token: | blue|
Top 3th token. Logit: 12.31 Prob:  3.47% Token: | white|
Top 4th token. Logit: 12.25 Prob:  3.25% Token: | green|
Top 5th token. Logit: 12.06 Prob:  2.71% Token: | purple|
Top 6th token. Logit: 12.04 Prob:  2.64% Token: | the|
Top 7th token. Logit: 11.98 Prob:  2.49% Token: | yellow|
Top 8th token. Logit: 11.69 Prob:  1.87% Token: | not|
Top 9th token. Logit: 11.68 Prob:  1.85% Token: | pink|


In [None]:
example_prompt = "John is red. Mary is blue. Mary is"
example_answer = " blue"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'John', ' is', ' red', '.', ' Mary', ' is', ' blue', '.', ' Mary', ' is']
Tokenized answer: [' blue']


Top 0th token. Logit: 13.40 Prob: 10.11% Token: | a|
Top 1th token. Logit: 12.73 Prob:  5.18% Token: | pregnant|
Top 2th token. Logit: 12.14 Prob:  2.89% Token: | red|
Top 3th token. Logit: 11.92 Prob:  2.30% Token: | the|
Top 4th token. Logit: 11.70 Prob:  1.85% Token: | sick|
Top 5th token. Logit: 11.68 Prob:  1.81% Token: | in|
Top 6th token. Logit: 11.66 Prob:  1.79% Token: | not|
Top 7th token. Logit: 11.56 Prob:  1.60% Token: | white|
Top 8th token. Logit: 11.47 Prob:  1.47% Token: | green|
Top 9th token. Logit: 11.36 Prob:  1.32% Token: | blue|


In [None]:
example_prompt = "John is red. Mary is blue. John is"
example_answer = " red"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'John', ' is', ' red', '.', ' Mary', ' is', ' blue', '.', ' John', ' is']
Tokenized answer: [' red']


Top 0th token. Logit: 13.54 Prob: 12.94% Token: | a|
Top 1th token. Logit: 12.19 Prob:  3.36% Token: | the|
Top 2th token. Logit: 11.95 Prob:  2.63% Token: | in|
Top 3th token. Logit: 11.73 Prob:  2.11% Token: | red|
Top 4th token. Logit: 11.70 Prob:  2.05% Token: | white|
Top 5th token. Logit: 11.47 Prob:  1.63% Token: | not|
Top 6th token. Logit: 11.44 Prob:  1.59% Token: | green|
Top 7th token. Logit: 11.41 Prob:  1.54% Token: | blue|
Top 8th token. Logit: 11.40 Prob:  1.53% Token: | an|
Top 9th token. Logit: 11.22 Prob:  1.28% Token: | black|


It fails. Just like with subjects, it gets the "order"; red came first, so it'd say red. With subjects, it says the most recent subject.

In [None]:
example_prompt = "John and Mary went to the store. John gave the gift to"
example_answer = " Mary"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'John', ' and', ' Mary', ' went', ' to', ' the', ' store', '.', ' John', ' gave', ' the', ' gift', ' to']
Tokenized answer: [' blue']


Top 0th token. Logit: 18.66 Prob: 79.38% Token: | Mary|
Top 1th token. Logit: 16.23 Prob:  7.01% Token: | his|
Top 2th token. Logit: 15.74 Prob:  4.30% Token: | the|
Top 3th token. Logit: 14.41 Prob:  1.13% Token: | her|
Top 4th token. Logit: 14.17 Prob:  0.89% Token: | a|
Top 5th token. Logit: 13.98 Prob:  0.74% Token: | John|
Top 6th token. Logit: 13.62 Prob:  0.52% Token: | me|
Top 7th token. Logit: 12.80 Prob:  0.23% Token: | Mrs|
Top 8th token. Logit: 12.41 Prob:  0.15% Token: | Jesus|
Top 9th token. Logit: 12.38 Prob:  0.15% Token: | him|


# Test GPT-2-xl

In [None]:
model = HookedTransformer.from_pretrained("gpt2-xl", device=device)

Downloading (…)lve/main/config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Using pad_token, but it is not set yet.


Loaded pretrained model gpt2-xl into HookedTransformer


In [None]:
example_prompt = "John is red. Mary is blue. John is"
example_answer = " red"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'John', ' is', ' red', '.', ' Mary', ' is', ' blue', '.', ' John', ' is']
Tokenized answer: [' red']


Top 0th token. Logit: 12.75 Prob:  7.00% Token: | a|
Top 1th token. Logit: 12.08 Prob:  3.57% Token: | red|
Top 2th token. Logit: 11.88 Prob:  2.92% Token: | green|
Top 3th token. Logit: 11.82 Prob:  2.76% Token: | blue|
Top 4th token. Logit: 11.75 Prob:  2.57% Token: | the|
Top 5th token. Logit: 11.73 Prob:  2.51% Token: | white|
Top 6th token. Logit: 11.50 Prob:  2.01% Token: | in|
Top 7th token. Logit: 11.36 Prob:  1.74% Token: | black|
Top 8th token. Logit: 11.35 Prob:  1.72% Token: | not|
Top 9th token. Logit: 11.08 Prob:  1.31% Token: | wearing|


In [None]:
example_prompt = "John is red. Mary is blue. Mary is"
example_answer = " blue"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'John', ' is', ' red', '.', ' Mary', ' is', ' blue', '.', ' Mary', ' is']
Tokenized answer: [' blue']


Top 0th token. Logit: 13.01 Prob:  8.14% Token: | red|
Top 1th token. Logit: 12.95 Prob:  7.62% Token: | a|
Top 2th token. Logit: 12.77 Prob:  6.35% Token: | blue|
Top 3th token. Logit: 12.07 Prob:  3.18% Token: | the|
Top 4th token. Logit: 11.91 Prob:  2.70% Token: | also|
Top 5th token. Logit: 11.87 Prob:  2.58% Token: | green|
Top 6th token. Logit: 11.69 Prob:  2.17% Token: | white|
Top 7th token. Logit: 11.69 Prob:  2.16% Token: | not|
Top 8th token. Logit: 11.50 Prob:  1.79% Token: | in|
Top 9th token. Logit: 11.17 Prob:  1.28% Token: | pink|


This also fails

In [None]:
example_prompt = "John is red. Mary is not red. Mary is"
example_answer = " not"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'John', ' is', ' red', '.', ' Mary', ' is', ' not', ' red', '.', ' Mary', ' is']
Tokenized answer: [' not']


Top 0th token. Logit: 14.27 Prob: 17.25% Token: | not|
Top 1th token. Logit: 13.69 Prob:  9.64% Token: | a|
Top 2th token. Logit: 13.15 Prob:  5.64% Token: | white|
Top 3th token. Logit: 12.99 Prob:  4.80% Token: | red|
Top 4th token. Logit: 12.89 Prob:  4.32% Token: | blue|
Top 5th token. Logit: 12.83 Prob:  4.07% Token: | green|
Top 6th token. Logit: 12.34 Prob:  2.51% Token: | the|
Top 7th token. Logit: 12.17 Prob:  2.12% Token: | black|
Top 8th token. Logit: 11.71 Prob:  1.33% Token: | brown|
Top 9th token. Logit: 11.63 Prob:  1.22% Token: | yellow|


In [None]:
example_prompt = "John is red. Mary is not red. John is"
example_answer = " red"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'John', ' is', ' red', '.', ' Mary', ' is', ' not', ' red', '.', ' John', ' is']
Tokenized answer: [' red']


Top 0th token. Logit: 13.98 Prob: 14.13% Token: | not|
Top 1th token. Logit: 13.42 Prob:  8.05% Token: | a|
Top 2th token. Logit: 13.21 Prob:  6.50% Token: | blue|
Top 3th token. Logit: 12.87 Prob:  4.62% Token: | white|
Top 4th token. Logit: 12.53 Prob:  3.31% Token: | black|
Top 5th token. Logit: 12.48 Prob:  3.12% Token: | green|
Top 6th token. Logit: 12.31 Prob:  2.64% Token: | red|
Top 7th token. Logit: 12.23 Prob:  2.44% Token: | the|
Top 8th token. Logit: 11.96 Prob:  1.86% Token: | brown|
Top 9th token. Logit: 11.82 Prob:  1.62% Token: | yellow|


Unlike before, now it's repeating the most recent thing.

In [None]:
example_prompt = "John is red. Mary is"
example_answer = " red"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'John', ' is', ' red', '.', ' Mary', ' is']
Tokenized answer: [' red']


Top 0th token. Logit: 15.55 Prob: 29.17% Token: | blue|
Top 1th token. Logit: 14.50 Prob: 10.11% Token: | white|
Top 2th token. Logit: 14.01 Prob:  6.25% Token: | green|
Top 3th token. Logit: 13.80 Prob:  5.02% Token: | red|
Top 4th token. Logit: 13.53 Prob:  3.85% Token: | purple|
Top 5th token. Logit: 13.43 Prob:  3.50% Token: | black|
Top 6th token. Logit: 13.28 Prob:  3.00% Token: | a|
Top 7th token. Logit: 13.12 Prob:  2.55% Token: | yellow|
Top 8th token. Logit: 12.85 Prob:  1.95% Token: | pink|
Top 9th token. Logit: 12.72 Prob:  1.71% Token: | orange|


The model strangely seems to have a circuit that wants to output opposites given this structure, though. So it probably needs a head that attends to the "adjective" then another head that "finds" its opposite.

After these further tries, stick with the opposite (size diff, color, etc) circuits.