# Bigrams in Bilinear Transformers

Bilinear transformers are great because they are even more linear in nature than the original architecture. This allows us to perform standardized analysis on each component separately (or even together). This notebook in particular focusses on extracting 2-grams from the weights. This notebook is meant as an introduction to the capabilities of bilinear layers and shouldn't be used to draw rigorous conclusions.

In [1]:
%load_ext autoreload
%autoreload 2

from shared.transformer import Transformer, Config
import plotly.express as px
import torch
import pandas as pd
from einops import *

torch.set_grad_enabled(False)

color = dict(color_continuous_midpoint=0, color_continuous_scale="RdBu")
name = "tdooms/TinyStories-1-256-rotary" 

config = Config.from_pretrained(name)
model = Transformer.from_pretrained(name, config=config).cuda()

model.center_unembed().fold_norms()
vocab = model.vocab

config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/11.8M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/69.0 [00:00<?, ?B/s]

1 and 2 layer transformers have slightly different behavior. The 1-layer transformer has a slightly more diverse MLP layer (because it kinda has to). Results shown in this notebook hold for both.

## Direct Path

Let's start with the obvious way to look at 2-grams, the direct embedding-unembedding path. 

In [2]:
direct = (model.w_u @ model.w_e).detach().cpu()
assert direct.shape == (len(vocab), len(vocab))

vocab.get_max_activations(direct.T, ["input", "output"], 10)

Unnamed: 0,input,output,value
0,char,##day,3.201929
1,ste,##ache,3.139564
2,ste,##ph,3.086413
3,each,##oon,2.922952
4,someone,##ch,2.818082
5,ste,##ail,2.767743
6,ste,avail,2.735258
7,ste,##ment,2.728622
8,ste,moder,2.725459
9,cont,##alk,2.720618


I'm a bit surprised that this doesn't make a lot of sense. I would expect some structure.

TODO: why doesn't this work?

## MLP path

Now, onto the good stuff, the MLP. In a normal neural network, we can't study the MLP with SVD or any linear technique. However, bilinear layers actually allow us to do so. In this section, we will limit ourselves to the direct MLP path, aka embedding -> MLP -> unembedding. To our knowledge, this hasn't been done before. To study the direct path, we can take the diagonal over the last two dimensions of the B tensor. I won't go into the math here for brevity, trust me bro. 

Before looking at the eigenvalues, let's look at the highest activations in general, this will result in a map of input -> output, meaning that we get the pairs of which the model is most sure.

In [3]:
diag = model.ube.diagonal(residual=True).cpu()

I use a helper function ``get_max_activations`` This returns a data frame of indices and values of the max values in the provided tensor. The indices are automatically converted to tokens. Let's look at the top 1000 connections in the first MLP layer.

In [4]:
df0 = vocab.get_max_activations(diag[0].T, ["input", "output"], k=1_000, largest=True)
df0

Unnamed: 0,input,output,value
0,',s,7.008902
1,pin,##ch,6.892549
2,net,##work,6.767766
3,an,exc,6.602790
4,an,extra,6.598722
...,...,...,...
995,stri,##ving,4.149396
996,introd,##uce,4.149202
997,flut,##ter,4.149147
998,cont,##alk,4.147686


Okay, so it's obvious that the most first layer just connects the obvious bi-grams of words that didn't quite get included in the tokenizer. Let's quantify this.

In [5]:
px.line(df0["output"].str.startswith("##").cumsum(), title="cumulative ## tokens")

So, the first layer seems to only be bothered with learning these kinds of connections. Let's look at the second layer.

In [6]:
df1 = vocab.get_max_activations(diag[1].T, ["input", "output"], k=1_000, largest=True)
df1.head(20)

IndexError: index 1 is out of bounds for dimension 0 with size 1

That makes way less sense. Let's try to look if there's a pattern on the tokens it activates strongly on.

In [7]:
input_mean = diag[1].mean(1).view(64, 64)
output_mean = diag[1].mean(0).view(64, 64)

fig = px.imshow(torch.stack((input_mean, output_mean)), facet_col=0, **color)
fig.layout.annotations[0].update(text="input")
fig.layout.annotations[1].update(text="output")

fig

IndexError: index 1 is out of bounds for dimension 0 with size 1

Except from the fact that the inputs seem to mostly looking positively at a handful of tokens, nothing discernable, we'll leave this inspection for later.

#### Preceding and Following tokens

Given this diagonal matrix, we can also analyze which words are most important indicators for the next word or the other way around.
For instance, we can ask:
- *"what tokens are most important for the model to decide to predict the token 'game'"* (preceding token).
- *"what tokens does the token 'game' infer most"* (following token).

In [8]:
token = "girl"
idx = vocab[token]

preceding = vocab.tokenize(torch.topk(diag[0, idx], k=10).indices)
following = vocab.tokenize(torch.topk(diag[0, :, idx], k=10).indices)

pd.DataFrame(dict(preceding=preceding, self=[token]*10, following=following))

Unnamed: 0,preceding,self,following
0,little,girl,named
1,bald,girl,##s
2,sweet,girl,'
3,kind,girl,called
4,each,girl,asked
5,rude,girl,saw
6,shy,girl,","
7,polite,girl,was
8,two,girl,came
9,clever,girl,who


Left and right are not related, this is simply a concise visualization. 

#### Articles

Something interesting to look at is if the model has learned to use correct articles. Let's study this a bit more in-depth.
We can do this quite simply by taking the weights for both for all subsequent tokens and plotting them together.

In [9]:
mask_non_words = torch.tensor([vocab.inv[idx][0].isalpha() for idx in range(len(vocab))])

vowels = ['a', 'e', 'i', 'o', 'u']

token = vocab.tokenize(torch.arange(len(vocab)))
df = pd.DataFrame(dict(x=diag[0, :, vocab["a"]].cpu(), y=diag[0, :, vocab["an"]].cpu(), token=token))
df = df[df.token.str[0].str.isalpha()]
df["guess"] = df.token.str[0].isin(vowels)

px.scatter(df, x="x", y="y", hover_name="token", color="guess", labels=dict(x="a", y="an")).show()

The result isn't as clean as I'd hoped but it seems that the model simply generally has a strong bias towards picking 'a' which is sensible. If you hover over most tokens, it's clear why it's "unsure" about some of them, a proper filtering of verbs and such will probably improve the separation. Alos, I'd assume this becomes more clear as models improve.

## Token Interactions
Until now, we've only looked at the direct path. This is fine, but the MLP encodes so much more information than (input output)-pairs. Specifically, it actually encodes (input, input, output)-triplets, being one of the reasons for its effectiveness.

So, in essence, until now, we've just looked at token interactions with itself. This reduces the UBE tensor to a matrix, which we can study. Now, we will perform another reduction, by just taking the first dimension of the UBE tensor, which means that we will get the input-input interactions for a certain token.

In [10]:
idx = vocab["boy"]
inter = model.ube.interaction(idx, residual=True).cpu()[0]

topk = torch.topk(inter.tril().flatten(), k=25)
input1, input2 = torch.unravel_index(topk.indices, inter.size())
pd.DataFrame(dict(input1=vocab.tokenize(input1), input2=vocab.tokenize(input2), value=topk.values.cpu()))

Unnamed: 0,input1,input2,value
0,little,little,2.966776
1,bigger,the,2.335706
2,a,a,2.300054
3,##ma,out,2.213748
4,aw,came,2.21109
5,dull,the,2.206932
6,realized,the,2.184559
7,un,ran,2.143701
8,smaller,the,2.122864
9,tea,arri,2.121422


This will mostly become useful once we introduce some additional inspection techniques in later attention layers.