# Prelims

<b style="color: red">To use this notebook, go to Runtime > Change Runtime Type and select GPU as the hardware accelerator.</b>

One reason is because tokenizer will use .cuda to process input batches in parallel.

**INTRODUCTION**

**AIM**: Investigate if there are circuits similar to those of IOI (with duplication and subj-inhibition heads, etc) for recognizing simple analogies. The task is, given "source examples" in the input, if it can correctly complete a target case. For example, one such input is:

    "Mary has a hat. John has a cane. The student is John. Ron has a cane. Horace has a hat. The student is Ron. Ashley has a cane. Ben has a hat. The student is"
    
And the correct answer is "Ashley" because the pattern is "the student has the cane". (The inputs are aimed to be written to avoid ambiguity that can result in multiple correct answers if there are several patterns).

This is inspired by how one is able to give chatgpt say a writing style it hasn't seen before, and it is able to mimic its patterns, essentially making "analogies" from its input. Of course, smaller models may not have this ability, so I sought to test them.


# Setup

In [None]:
# Janky code to do different setup when run in a Colab notebook vs VSCode
DEBUG_MODE = False
try:
    import google.colab
    IN_COLAB = True
    print("Running as a Colab notebook")
    %pip install git+https://github.com/neelnanda-io/TransformerLens.git
    # Install another version of node that makes PySvelte work way faster
    !curl -fsSL https://deb.nodesource.com/setup_16.x | sudo -E bash -; sudo apt-get install -y nodejs
    %pip install git+https://github.com/neelnanda-io/PySvelte.git
except:
    IN_COLAB = False
    print("Running as a Jupyter notebook - intended for development only!")
    from IPython import get_ipython

    ipython = get_ipython()
    # Code to automatically update the HookedTransformer code as its edited without restarting the kernel
    ipython.magic("load_ext autoreload")
    ipython.magic("autoreload 2")

In [None]:
# Plotly needs a different renderer for VSCode/Notebooks vs Colab argh
import plotly.io as pio

if IN_COLAB or not DEBUG_MODE:
    # Thanks to annoying rendering issues, Plotly graphics will either show up in colab OR Vscode depending on the renderer - this is bad for developing demos! Thus creating a debug mode.
    pio.renderers.default = "colab"
else:
    pio.renderers.default = "png"

In [None]:
# Import stuff
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import einops
from fancy_einsum import einsum
import tqdm.notebook as tqdm
import random
from pathlib import Path
import plotly.express as px
from torch.utils.data import DataLoader

from jaxtyping import Float, Int
from typing import List, Union, Optional
from functools import partial
import copy

import itertools
from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer
import dataclasses
import datasets
from IPython.display import HTML

In [None]:
import pysvelte

import transformer_lens
import transformer_lens.utils as utils
from transformer_lens.hook_points import (
    HookedRootModule,
    HookPoint,
)  # Hooking utilities
from transformer_lens import HookedTransformer, HookedTransformerConfig, FactoredMatrix, ActivationCache

We turn automatic differentiation off, to save GPU memory, as this notebook focuses on model inference not model training.

In [None]:
torch.set_grad_enabled(False)

<torch.autograd.grad_mode.set_grad_enabled at 0x7f7d6fed98a0>

Plotting helper functions:

In [None]:
def imshow(tensor, renderer=None, **kwargs):
    px.imshow(utils.to_numpy(tensor), color_continuous_midpoint=0.0, color_continuous_scale="RdBu", **kwargs).show(renderer)

def line(tensor, renderer=None, **kwargs):
    px.line(y=utils.to_numpy(tensor), **kwargs).show(renderer)

def scatter(x, y, xaxis="", yaxis="", caxis="", renderer=None, **kwargs):
    x = utils.to_numpy(x)
    y = utils.to_numpy(y)
    px.scatter(y=y, x=x, labels={"x":xaxis, "y":yaxis, "color":caxis}, **kwargs).show(renderer)

# Test prompts for GPT-2-Large

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model = HookedTransformer.from_pretrained("gpt2-large", device=device)

Downloading (…)lve/main/config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Using pad_token, but it is not set yet.


Loaded pretrained model gpt2-large into HookedTransformer


In [None]:
example_prompt = "The cat sat on the mat. The cat"
example_answer = " sat"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'The', ' cat', ' sat', ' on', ' the', ' mat', '.', ' The', ' cat']
Tokenized answer: [' sat']


Top 0th token. Logit: 15.16 Prob: 12.29% Token: | sat|
Top 1th token. Logit: 15.00 Prob: 10.49% Token: | was|
Top 2th token. Logit: 14.04 Prob:  4.01% Token: | stood|
Top 3th token. Logit: 13.74 Prob:  2.98% Token: |'s|
Top 4th token. Logit: 13.73 Prob:  2.94% Token: | looked|
Top 5th token. Logit: 13.67 Prob:  2.78% Token: | jumped|
Top 6th token. Logit: 13.30 Prob:  1.91% Token: | had|
Top 7th token. Logit: 12.78 Prob:  1.13% Token: | ran|
Top 8th token. Logit: 12.76 Prob:  1.11% Token: | walked|
Top 9th token. Logit: 12.67 Prob:  1.02% Token: | moved|


In [None]:
example_prompt = "John is big. Mary is small. John is tall. Mary is"
example_answer = " short"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'John', ' is', ' big', '.', ' Mary', ' is', ' small', '.', ' John', ' is', ' tall', '.', ' Mary', ' is']
Tokenized answer: [' short']


Top 0th token. Logit: 17.18 Prob: 57.68% Token: | short|
Top 1th token. Logit: 14.84 Prob:  5.54% Token: | small|
Top 2th token. Logit: 14.37 Prob:  3.46% Token: | thin|
Top 3th token. Logit: 14.01 Prob:  2.42% Token: | skinny|
Top 4th token. Logit: 13.91 Prob:  2.19% Token: | tall|
Top 5th token. Logit: 13.56 Prob:  1.54% Token: | very|
Top 6th token. Logit: 13.49 Prob:  1.43% Token: | slim|
Top 7th token. Logit: 13.17 Prob:  1.04% Token: | a|
Top 8th token. Logit: 13.16 Prob:  1.03% Token: | not|
Top 9th token. Logit: 13.15 Prob:  1.02% Token: | slender|


In [None]:
example_prompt = "Tall"
example_answer = " short"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'T', 'all']
Tokenized answer: [' short']


Top 0th token. Logit: 13.04 Prob: 20.38% Token: |inn|
Top 1th token. Logit: 12.23 Prob:  9.03% Token: |ade|
Top 2th token. Logit: 11.88 Prob:  6.42% Token: |,|
Top 3th token. Logit: 11.22 Prob:  3.31% Token: |er|
Top 4th token. Logit: 11.03 Prob:  2.74% Token: | and|
Top 5th token. Logit: 11.02 Prob:  2.71% Token: |ul|
Top 6th token. Logit: 10.36 Prob:  1.40% Token: |grass|
Top 7th token. Logit: 10.22 Prob:  1.22% Token: | Tales|
Top 8th token. Logit: 10.16 Prob:  1.15% Token: | Ships|
Top 9th token. Logit: 10.08 Prob:  1.06% Token: |est|


# Analyze "John is tall. Mary is"

## Compare Logits

For all prompts (even if just one), get the model's outputs vs the correct ouputs, and compute the logit differences

original_average_logit_diff() is used for both direct logit and actv patch, so cell below should be run for both of them. Otherwise, the two sections are run indp of each other


In [None]:
prompts = [
"John is tall. Mary is",
]
# List of answers, in the format (correct, incorrect)
answers = [
    (" short", " tall"),
]

answer_tokens = []
for answer in answers:
    correct_token = model.to_single_token(answer[0])
    incorrect_token = model.to_single_token(answer[1])
    answer_tokens.append((correct_token, incorrect_token))
# if len(prompts) > 1:
#     answer_tokens = torch.tensor(answer_tokens).cuda()  # if many inputs
# else:
#     answer_tokens = torch.tensor(answer_tokens)
answer_tokens = torch.tensor(answer_tokens)
# answer_tokens = torch.tensor(answer_tokens).cuda()

tokens = model.to_tokens(prompts, prepend_bos=True)
# tokens = tokens.cuda() # Move the tokens to the GPU
original_logits, cache = model.run_with_cache(tokens) # Run the model and cache all activations

def logits_to_ave_logit_diff(logits, answer_tokens, per_prompt=False):
    # Only the final logits are relevant for the answer
    final_logits = logits[:, -1, :]
    answer_logits = final_logits.gather(dim=-1, index=answer_tokens)
    answer_logit_diff = answer_logits[:, 0] - answer_logits[:, 1]
    if per_prompt:
        return answer_logit_diff
    else:
        return answer_logit_diff.mean()

print("Per prompt logit difference:", logits_to_ave_logit_diff(original_logits, answer_tokens, per_prompt=True))
original_average_logit_diff = logits_to_ave_logit_diff(original_logits, answer_tokens)
print("Average logit difference:", logits_to_ave_logit_diff(original_logits, answer_tokens).item())

Per prompt logit difference: tensor([2.0442])
Average logit difference: 2.044248580932617


## Direct Layer Attribution

### Logit Lens

In [None]:
answer_residual_directions = model.tokens_to_residual_directions(answer_tokens)
print("Answer residual directions shape:", answer_residual_directions.shape)
logit_diff_directions = answer_residual_directions[:, 0] - answer_residual_directions[:, 1]
print("Logit difference directions shape:", logit_diff_directions.shape)

# cache syntax - resid_post is the residual stream at the end of the layer, -1 gets the final layer. The general syntax is [activation_name, layer_index, sub_layer_type]. 
final_residual_stream = cache["resid_post", -1]
print("Final residual stream shape:", final_residual_stream.shape)
final_token_residual_stream = final_residual_stream[:, -1, :]
# Apply LayerNorm scaling
# pos_slice is the subset of the positions we take - here the final token of each prompt
scaled_final_token_residual_stream = cache.apply_ln_to_stack(final_token_residual_stream, layer = -1, pos_slice=-1)

average_logit_diff = einsum("batch d_model, batch d_model -> ", scaled_final_token_residual_stream, logit_diff_directions)/len(prompts)
print("Calculated average logit diff:", average_logit_diff.item())
print("Original logit difference:",original_average_logit_diff.item())

def residual_stack_to_logit_diff(residual_stack: Float[torch.Tensor, "components batch d_model"], cache: ActivationCache) -> float:
    scaled_residual_stack = cache.apply_ln_to_stack(residual_stack, layer = -1, pos_slice=-1)
    return einsum("... batch d_model, batch d_model -> ...", scaled_residual_stack, logit_diff_directions)/len(prompts)
accumulated_residual, labels = cache.accumulated_resid(layer=-1, incl_mid=True, pos_slice=-1, return_labels=True)
logit_lens_logit_diffs = residual_stack_to_logit_diff(accumulated_residual, cache)
line(logit_lens_logit_diffs, x=np.arange(model.cfg.n_layers*2+1)/2, hover_name=labels, title="Logit Difference From Accumulate Residual Stream")

Answer residual directions shape: torch.Size([1, 2, 1280])
Logit difference directions shape: torch.Size([1, 1280])
Final residual stream shape: torch.Size([1, 7, 1280])
Calculated average logit diff: 1.781943440437317
Original logit difference: 2.044248580932617


The values near 0 mean it weights tall and short around the same. The negative means it weighs tall more. Then near the last layers, it shoots back up to positive and thinks short. Why? Look at the attention heads involved in the last few layers.

In [None]:
per_layer_residual, labels = cache.decompose_resid(layer=-1, pos_slice=-1, return_labels=True)
per_layer_logit_diffs = residual_stack_to_logit_diff(per_layer_residual, cache)
line(per_layer_logit_diffs, hover_name=labels, title="Logit Difference From Each Layer")

This above plot is double that of the layers because it counts attn and MLP as separate layers, and each layer has 1 attn and 1 MLP.

In [None]:
per_head_residual, labels = cache.stack_head_results(layer=-1, pos_slice=-1, return_labels=True)
per_head_logit_diffs = residual_stack_to_logit_diff(per_head_residual, cache)
per_head_logit_diffs = einops.rearrange(per_head_logit_diffs, "(layer head_index) -> layer head_index", layer=model.cfg.n_layers, head_index=model.cfg.n_heads)
imshow(per_head_logit_diffs, labels={"x":"Head", "y":"Layer"}, title="Logit Difference From Each Head")

Tried to stack head results when they weren't cached. Computing head results now


Notable:
- H14, L20

Head 13, Layer 30 stands out the most. A lot of the logit for "short" seems to be there. But does it contain "short" (like MLPs) or is it moving some information about it? How do we know an attn head is MOVING info? Look at their value matrix and copy score.

We want to look for the heads near the last layers the are responsible for 

Auto output the impt heads based on their score on heatmap

In [None]:
per_head_logit_diffs.shape

torch.Size([36, 20])

In [None]:
import numpy as np
flattened_diffs = per_head_logit_diffs.flatten()

# Get the indices that would sort the flattened array in descending order
sorted_indices = np.argsort(-flattened_diffs)

top_N = 10
sorted_indices = sorted_indices[:top_N]

n_layers, n_heads = per_head_logit_diffs.shape
row_indices = sorted_indices // n_heads
col_indices = sorted_indices % n_heads

for i, (row, col) in enumerate(zip(row_indices, col_indices), 1):
    value = per_head_logit_diffs[row, col]
    print(f"Rank {i}: Value={value:.4f}, Layer={row}, Head={col}")

Rank 1: Value=0.6968, Layer=30, Head=13
Rank 2: Value=0.4971, Layer=20, Head=14
Rank 3: Value=0.4949, Layer=25, Head=5
Rank 4: Value=0.3269, Layer=19, Head=14
Rank 5: Value=0.3089, Layer=23, Head=17
Rank 6: Value=0.2569, Layer=26, Head=0
Rank 7: Value=0.2000, Layer=24, Head=17
Rank 8: Value=0.1951, Layer=27, Head=11
Rank 9: Value=0.1836, Layer=17, Head=19
Rank 10: Value=0.1824, Layer=17, Head=0


### Attention patterns

This gets the top heads of all layers (as given from previous list)

In [None]:
def visualize_attention_patterns(
    heads: Union[List[int], int, Float[torch.Tensor, "heads"]], 
    local_cache: Optional[ActivationCache]=None, 
    local_tokens: Optional[torch.Tensor]=None, 
    title: str=""):
    # Heads are given as a list of integers or a single integer in [0, n_layers * n_heads)
    if isinstance(heads, int):
        heads = [heads]
    elif isinstance(heads, list) or isinstance(heads, torch.Tensor):
        heads = utils.to_numpy(heads)
    # Cache defaults to the original activation cache
    if local_cache is None:
        local_cache = cache
    # Tokens defaults to the tokenization of the first prompt (including the BOS token)
    if local_tokens is None:
        # The tokens of the first prompt
        local_tokens = tokens[0]
    
    labels = []
    patterns = []
    batch_index = 0
    for head in heads:
        layer = head // model.cfg.n_heads
        head_index = head % model.cfg.n_heads
        # Get the attention patterns for the head
        # Attention patterns have shape [batch, head_index, query_pos, key_pos]
        patterns.append(local_cache["attn", layer][batch_index, head_index])
        labels.append(f"L{layer}H{head_index}")
    str_tokens = model.to_str_tokens(local_tokens)
    patterns = torch.stack(patterns, dim=-1)
    # Plot the attention patterns
    attention_vis = pysvelte.AttentionMulti(attention=patterns, tokens=str_tokens, head_labels=labels)
    display(HTML(f"<h3>{title}</h3>"))
    attention_vis.show()

In [None]:
top_k = 3
top_positive_logit_attr_heads = torch.topk(per_head_logit_diffs.flatten(), k=top_k).indices
visualize_attention_patterns(top_positive_logit_attr_heads, title=f"Top {top_k} Positive Logit Attribution Heads")
top_negative_logit_attr_heads = torch.topk(-per_head_logit_diffs.flatten(), k=top_k).indices
visualize_attention_patterns(top_negative_logit_attr_heads, title=f"Top {top_k} Negative Logit Attribution Heads")

pysvelte components appear to be unbuilt or stale
Running npm install...
Building pysvelte components with webpack...


Both Pos and Neg heads for "is" are attending to "tall". Next, we should hypothesize what these heads are doing, and test these to separate out what they're not doing.

These positive heads may be the "antonym" heads, while the negative heads may be the "same" heads. Try to edit them.

# Analyze "John is big. Mary is small. John is tall. Mary is"

Will adding a source system prevent the model from even considering tall over short within its layers?

## Compare Logits

For all prompts (even if just one), get the model's outputs vs the correct ouputs, and compute the logit differences

original_average_logit_diff() is used for both direct logit and actv patch, so cell below should be run for both of them. Otherwise, the two sections are run indp of each other


In [None]:
prompts = [
"John is big. Mary is small. John is tall. Mary is",
]
# List of answers, in the format (correct, incorrect)
answers = [
    (" short", " tall"),
]

answer_tokens = []
for answer in answers:
    correct_token = model.to_single_token(answer[0])
    incorrect_token = model.to_single_token(answer[1])
    answer_tokens.append((correct_token, incorrect_token))
# if len(prompts) > 1:
#     answer_tokens = torch.tensor(answer_tokens).cuda()  # if many inputs
# else:
#     answer_tokens = torch.tensor(answer_tokens)
# answer_tokens = torch.tensor(answer_tokens)
answer_tokens = torch.tensor(answer_tokens).cuda()

tokens = model.to_tokens(prompts, prepend_bos=True)
tokens = tokens.cuda() # Move the tokens to the GPU
original_logits, cache = model.run_with_cache(tokens) # Run the model and cache all activations

def logits_to_ave_logit_diff(logits, answer_tokens, per_prompt=False):
    # Only the final logits are relevant for the answer
    final_logits = logits[:, -1, :]
    answer_logits = final_logits.gather(dim=-1, index=answer_tokens)
    answer_logit_diff = answer_logits[:, 0] - answer_logits[:, 1]
    if per_prompt:
        return answer_logit_diff
    else:
        return answer_logit_diff.mean()

print("Per prompt logit difference:", logits_to_ave_logit_diff(original_logits, answer_tokens, per_prompt=True))
original_average_logit_diff = logits_to_ave_logit_diff(original_logits, answer_tokens)
print("Average logit difference:", logits_to_ave_logit_diff(original_logits, answer_tokens).item())

Per prompt logit difference: tensor([3.2732], device='cuda:0')
Average logit difference: 3.27315616607666


## Direct Layer Attribution

### Logit Lens

In [None]:
answer_residual_directions = model.tokens_to_residual_directions(answer_tokens)
print("Answer residual directions shape:", answer_residual_directions.shape)
logit_diff_directions = answer_residual_directions[:, 0] - answer_residual_directions[:, 1]
print("Logit difference directions shape:", logit_diff_directions.shape)

# cache syntax - resid_post is the residual stream at the end of the layer, -1 gets the final layer. The general syntax is [activation_name, layer_index, sub_layer_type]. 
final_residual_stream = cache["resid_post", -1]
print("Final residual stream shape:", final_residual_stream.shape)
final_token_residual_stream = final_residual_stream[:, -1, :]
# Apply LayerNorm scaling
# pos_slice is the subset of the positions we take - here the final token of each prompt
scaled_final_token_residual_stream = cache.apply_ln_to_stack(final_token_residual_stream, layer = -1, pos_slice=-1)

average_logit_diff = einsum("batch d_model, batch d_model -> ", scaled_final_token_residual_stream, logit_diff_directions)/len(prompts)
print("Calculated average logit diff:", average_logit_diff.item())
print("Original logit difference:",original_average_logit_diff.item())

def residual_stack_to_logit_diff(residual_stack: Float[torch.Tensor, "components batch d_model"], cache: ActivationCache) -> float:
    scaled_residual_stack = cache.apply_ln_to_stack(residual_stack, layer = -1, pos_slice=-1)
    return einsum("... batch d_model, batch d_model -> ...", scaled_residual_stack, logit_diff_directions)/len(prompts)
accumulated_residual, labels = cache.accumulated_resid(layer=-1, incl_mid=True, pos_slice=-1, return_labels=True)
logit_lens_logit_diffs = residual_stack_to_logit_diff(accumulated_residual, cache)
line(logit_lens_logit_diffs, x=np.arange(model.cfg.n_layers*2+1)/2, hover_name=labels, title="Logit Difference From Accumulate Residual Stream")

Answer residual directions shape: torch.Size([1, 2, 1280])
Logit difference directions shape: torch.Size([1, 1280])
Final residual stream shape: torch.Size([1, 15, 1280])
Calculated average logit diff: 3.010845899581909
Original logit difference: 3.27315616607666


There's still a dip. But compare it to before. Before there were two dips, and they were deeper. Also, the end logit diff is bigger now. Could the source sentence be making it "more sure" of "short", through in-context learning and induction heads? Try more examples of "no source" vs "source".

First check the last layer logit diffs in 'test prompts'. Then compare how their layers logit diff changes; is "with source sys" more sure throughout the layers? The (inductive based) hypothesis is that it will be, given this one example.

# Compare logit diffs for more examples of "no source" vs "source".

## Test prompts

Ideas: size, color shades, direction

### direction: east vs west, left vs right

In [None]:
example_prompt = "Adam is east. Helen is west. Mary is right. John is"
example_answer = " left"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'Adam', ' is', ' east', '.', ' Helen', ' is', ' west', '.', ' Mary', ' is', ' right', '.', ' John', ' is']
Tokenized answer: [' left']


Top 0th token. Logit: 15.49 Prob: 28.36% Token: | left|
Top 1th token. Logit: 14.46 Prob: 10.18% Token: | east|
Top 2th token. Logit: 14.40 Prob:  9.55% Token: | south|
Top 3th token. Logit: 13.70 Prob:  4.76% Token: | north|
Top 4th token. Logit: 13.27 Prob:  3.07% Token: | west|
Top 5th token. Logit: 13.08 Prob:  2.56% Token: | right|
Top 6th token. Logit: 12.98 Prob:  2.31% Token: | down|
Top 7th token. Logit: 12.73 Prob:  1.79% Token: | up|
Top 8th token. Logit: 12.72 Prob:  1.78% Token: | in|
Top 9th token. Logit: 12.64 Prob:  1.64% Token: | wrong|


In [None]:
example_prompt = "Mary is right. John is"
example_answer = " left"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'Mary', ' is', ' right', '.', ' John', ' is']
Tokenized answer: [' left']


Top 0th token. Logit: 15.68 Prob: 25.23% Token: | right|
Top 1th token. Logit: 14.92 Prob: 11.79% Token: | wrong|
Top 2th token. Logit: 14.80 Prob: 10.45% Token: | a|
Top 3th token. Logit: 14.42 Prob:  7.14% Token: | not|
Top 4th token. Logit: 13.55 Prob:  3.01% Token: | the|
Top 5th token. Logit: 12.82 Prob:  1.44% Token: | an|
Top 6th token. Logit: 12.64 Prob:  1.20% Token: | in|
Top 7th token. Logit: 12.34 Prob:  0.89% Token: | just|
Top 8th token. Logit: 12.23 Prob:  0.80% Token: | going|
Top 9th token. Logit: 12.18 Prob:  0.76% Token: | still|


In [None]:
example_prompt = "Adam is right. Helen is left. Mary is east. John is"
example_answer = " west"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'Adam', ' is', ' right', '.', ' Helen', ' is', ' left', '.', ' Mary', ' is', ' east', '.', ' John', ' is']
Tokenized answer: [' west']


Top 0th token. Logit: 15.53 Prob: 34.31% Token: | west|
Top 1th token. Logit: 14.65 Prob: 14.36% Token: | north|
Top 2th token. Logit: 14.47 Prob: 11.88% Token: | south|
Top 3th token. Logit: 13.27 Prob:  3.60% Token: | east|
Top 4th token. Logit: 13.09 Prob:  3.00% Token: | right|
Top 5th token. Logit: 12.91 Prob:  2.52% Token: | to|
Top 6th token. Logit: 12.64 Prob:  1.91% Token: | in|
Top 7th token. Logit: 12.39 Prob:  1.49% Token: | up|
Top 8th token. Logit: 12.06 Prob:  1.07% Token: | on|
Top 9th token. Logit: 11.93 Prob:  0.94% Token: | northwest|


In [None]:
example_prompt = "Mary is east. John is"
example_answer = " west"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'Mary', ' is', ' east', '.', ' John', ' is']
Tokenized answer: [' west']


Top 0th token. Logit: 17.51 Prob: 74.02% Token: | west|
Top 1th token. Logit: 15.31 Prob:  8.13% Token: | north|
Top 2th token. Logit: 15.15 Prob:  6.97% Token: | south|
Top 3th token. Logit: 13.93 Prob:  2.06% Token: | east|
Top 4th token. Logit: 12.88 Prob:  0.72% Token: | northwest|
Top 5th token. Logit: 12.84 Prob:  0.69% Token: | southwest|
Top 6th token. Logit: 12.72 Prob:  0.61% Token: | to|
Top 7th token. Logit: 12.08 Prob:  0.32% Token: | West|
Top 8th token. Logit: 12.03 Prob:  0.31% Token: | southeast|
Top 9th token. Logit: 12.03 Prob:  0.31% Token: | up|


Without source:

Top 0th token. Logit: 17.51 Prob: 74.02% Token: | west|

With source: (1 in-context learning example)

Top 0th token. Logit: 15.53 Prob: 34.31% Token: | west|

The probably actually went down when adding an in-context learning example. Try adding 2 source examples.

In [None]:
example_prompt = "Rachel is Japanese. Ellen is American. Adam is right. Helen is left. Mary is east. John is"
example_answer = " west"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'Rachel', ' is', ' Japanese', '.', ' Ellen', ' is', ' American', '.', ' Adam', ' is', ' right', '.', ' Helen', ' is', ' left', '.', ' Mary', ' is', ' east', '.', ' John', ' is']
Tokenized answer: [' west']


Top 0th token. Logit: 15.32 Prob: 35.33% Token: | west|
Top 1th token. Logit: 14.48 Prob: 15.19% Token: | north|
Top 2th token. Logit: 14.38 Prob: 13.82% Token: | south|
Top 3th token. Logit: 13.35 Prob:  4.92% Token: | east|
Top 4th token. Logit: 12.34 Prob:  1.80% Token: | right|
Top 5th token. Logit: 11.75 Prob:  1.00% Token: | West|
Top 6th token. Logit: 11.65 Prob:  0.90% Token: | in|
Top 7th token. Logit: 11.46 Prob:  0.75% Token: | North|
Top 8th token. Logit: 11.34 Prob:  0.66% Token: | the|
Top 9th token. Logit: 11.32 Prob:  0.65% Token: | up|


This doesn't really do much. Finding synonyms for 'left vs right' is hard and requires more external knowledge than what's given in input.

Perhaps we shouldn't jump to studying analogies; we can study opposites instead. Why does it recognize this consistently for size, direction?

Other analysis:
- The subjects don't seem to matter that much? Try using the same subject for all sentences.

### opposites: color shades

In [None]:
example_prompt = "X is black. Y is"
example_answer = " white"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'X', ' is', ' black', '.', ' Y', ' is']
Tokenized answer: [' white']


Top 0th token. Logit: 16.21 Prob: 40.50% Token: | white|
Top 1th token. Logit: 14.98 Prob: 11.85% Token: | red|
Top 2th token. Logit: 14.77 Prob:  9.66% Token: | green|
Top 3th token. Logit: 14.63 Prob:  8.36% Token: | yellow|
Top 4th token. Logit: 14.27 Prob:  5.84% Token: | blue|
Top 5th token. Logit: 12.80 Prob:  1.35% Token: | gray|
Top 6th token. Logit: 12.78 Prob:  1.32% Token: | brown|
Top 7th token. Logit: 12.72 Prob:  1.24% Token: | gold|
Top 8th token. Logit: 12.49 Prob:  0.98% Token: | a|
Top 9th token. Logit: 12.42 Prob:  0.91% Token: | orange|


In [None]:
example_prompt = "X is red. Y is"
example_answer = " green"
utils.test_prompt(example_prompt, example_answer, model, prepend_bos=True)

Tokenized prompt: ['<|endoftext|>', 'X', ' is', ' red', '.', ' Y', ' is']
Tokenized answer: [' green']


Top 0th token. Logit: 16.07 Prob: 37.75% Token: | green|
Top 1th token. Logit: 15.37 Prob: 18.82% Token: | blue|
Top 2th token. Logit: 15.25 Prob: 16.60% Token: | yellow|
Top 3th token. Logit: 13.58 Prob:  3.12% Token: | black|
Top 4th token. Logit: 13.07 Prob:  1.87% Token: | white|
Top 5th token. Logit: 12.94 Prob:  1.66% Token: | orange|
Top 6th token. Logit: 12.57 Prob:  1.14% Token: | purple|
Top 7th token. Logit: 12.26 Prob:  0.84% Token: | a|
Top 8th token. Logit: 12.25 Prob:  0.83% Token: | the|
Top 9th token. Logit: 12.06 Prob:  0.69% Token: | red|


Opposites is knowledge, so the guess is that it's stored somewhere in MLPs. 

If we edit what's considered opposite, what's the effect?

# Make a baseline dataset