## Interpreting the Internals of a Small Language Model's Blocks

In [None]:
# | hide
%load_ext autoreload
%autoreload 2

In [None]:
# | hide
import json
import math
from pathlib import Path
from typing import Dict, Optional, Iterable, Sequence, Tuple

In [None]:
# | hide

from fastcore.test import *
from matplotlib.axes import Axes
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
import torch
from torch.nn import functional as F
from tqdm.auto import tqdm

In [None]:
# | hide

from transformer_experiments.common.substring_generator import all_unique_substrings
from transformer_experiments.common.text_analysis import (
    build_next_token_map,
    SubstringFrequencyAnalysis,
    top_nonzero_tokens
)
from transformer_experiments.common.utils import aggregate_by_string_key, DataWrapper
from transformer_experiments.dataset_split import split_text_dataset
from transformer_experiments.datasets.tinyshakespeare import (
    TinyShakespeareDataSet,
)
from transformer_experiments.models.transformer import (
    n_layer,
    TransformerLanguageModel
)
from transformer_experiments.models.transformer_helpers import (
    unsqueeze_emb,
    EncodingHelpers,
    LogitsWrapper,
    TransformerAccessors
)
from transformer_experiments.trained_models.tinyshakespeare_transformer import (
    create_model_and_tokenizer
)
from transformer_experiments.experiments.block_internals import (
    BlockInternalsAccessors,
    BlockInternalsExperiment,
    BatchedBlockInternalsExperiment,
    BatchedBlockInternalsExperimentSlicer,
    BlockInternalsAnalysis,
)
from transformer_experiments.experiments.logit_lens import LogitLens

## Motivation
Early this past summer, I trained a small language model. I then spent the rest of the summer taking it apart and trying to figure out how it works. This post is a summary of what I've learned so far. 

I started by watching [Andrej Karpathy](https://karpathy.ai/)'s excellent video, [Let's build GPT: from scratch, in code, spelled out](https://www.youtube.com/watch?v=kCc8FmEb1nY). In that video, he starts from a blank Jupyter notebook and, just under 2 hours later, ends with a functional transformer model (it's one of the best explanatory videos I've ever seen and highly recommend it). 

I want to make one thing super clear at the start: The code for the language model I trained came entirely from the video. It's Andrej's, not mine. I typed in the code by copying what I saw on the screen as I watched the video. For things that weren't clear onscreen, I referenced the [GitHub repo for the video](https://github.com/karpathy/ng-video-lecture) and the [nanoGPT repo](https://github.com/karpathy/nanoGPT). After getting it working, I made only minor changes to make it work with the rest of the code in/structure of this repository, resulting in [this implementation](https://github.com/spather/transformer-experiments/blob/master/nbs/models/transformer.ipynb). In summary: the core language model is Andrej Karpathy's work, not mine. The analysis and all the supporting code behind it is mine. I was of course inspired by many others and I'll cite their work in the relevant places. 

## The Model
The model I trained is a small, decoder-only [transformer](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)). It’s trained on the [TinyShakespeare data set](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) which contains 40,000 lines of Shakespeare’s plays. 

After about an hour of training on an A100 GPU, it is able to produce reasonable-looking fake Shakespearean text. Let’s spin it up and look at it in action:

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ts = TinyShakespeareDataSet(cache_file='../artifacts/input.txt')
m, tokenizer = create_model_and_tokenizer(
    saved_model_filename='../artifacts/shakespeare.pt',
    dataset=ts,
    device=device,
)
print(f"device is {device}")
encoding_helpers = EncodingHelpers(tokenizer, device)

device is cpu


Given a prompt, the model predicts what the next tokens will be. Let's start with an easy task, giving it a prompt that surely it's seen before, `ROMEO:`, and ask it to generate 500 new tokens:

In [None]:
_ = torch.manual_seed(1337) # Keep the output deterministic across runs
prompt = 'ROMEO:'
tokens = encoding_helpers.tokenize_string(prompt)
print(tokenizer.decode(m.generate(tokens, max_new_tokens=500)[0].tolist()))

ROMEO:
Thus he cannot Edward's sunse heart
That any thing hath bid His temption of seems.

BUCKINGHAM:
Nay, you are has kept another, of the queen,
Against his noble foreign them to take;
And with their I'll harm those insequents,
That honoured she not black physicians;
But what is full and a man of hoteful prince,
And to ransom on their within the fair beds did
But in same heaven limit out a clean.

Nurse:
England, by yond face!

RATCLIFF:
Thou dancest not kill'd with not budies.

RICHARD:
Thus say t


Don't try too hard to interpret it: meaning-wise, it's nonsense. But in terms of superficial structure, it looks Shakespearean:

* It's looks like a script for a play: a character name, followed by a colon, followed by a line of dialog.
* Most of the words are valid English words. It's important to note that in this model, the tokens are individual characters, not words. So it's making words up from characters and mostly getting it right. Though there are some notable exceptions like "sunse", "hoteful", and "insequents". But even these exceptions don't seem too far off from real words.
* Capitalization and punctuation are mostly correct: the first word of each line is capitalized, periods and other punctuation are used in plausible. Named characters ("ROMEO", "BUCKINGHAM") are in all-caps and unnamed characters ("Nurse") are not. But to be fair, all these character names appear in the training data, so that attribute is more likely a result of just copying what it's seen before vs understanding that pattern. 
* The language sounds archaic, using words like "Thou", "Nay", "yond", and 'dancest'. It's noteable tht "dancest" does not actually appear in the training data, so it's not just copying words it's seen before. 

When given a prompt it hasn't seen before, it still does a reasonable job:

In [None]:
_ = torch.manual_seed(1337) # Keep the output deterministic across runs
prompt = 'Hello'
tokens = encoding_helpers.tokenize_string(prompt)
print(tokenizer.decode(m.generate(tokens, max_new_tokens=100)[0].tolist()))

Hellows stand cause, Edward's sunse and justice:
Then, tell thou this tempest malnsters should have
Resid


"Hello" doesn't appear anywhere in the text, but you can see how it got to "Hellows" given that "fellow", "yellow", and "mellow" do. Even a completely gibberish prompt, like `adxed3dd`, it's able to recover and produce something reasonable looking: 

In [None]:
_ = torch.manual_seed(1337) # Keep the output deterministic across runs
prompt = 'adxed3dd'
tokens = encoding_helpers.tokenize_string(prompt)
print(tokenizer.decode(m.generate(tokens, max_new_tokens=100)[0].tolist()))

adxed3ddess, and caden'd in throllows.
Now thou dost know thy horse to set;
Have I learn'd these friends and
