### Library Imports

In [19]:
import einops
import math
from fancy_einsum import einsum
from dataclasses import dataclass
from transformer_lens import EasyTransformer

In [20]:
import torch
print("PyTorch version: ", torch.__version__)

# Default device
device = "cpu"

# Check Pytorch has access to MPS (Metal Performance Shader, Apple's GPU architecture)
if torch.backends.mps.is_available():
    device = "mps"
# Check Pytorch has access to CUDA
if torch.cuda.is_available():
    device = "cuda"

# Set device to GPU if available
torch.device(device)
print("Torch device: ", device)

PyTorch version:  2.0.1
Torch device:  mps


In [21]:
gpt2_xl = EasyTransformer.from_pretrained("gpt2-xl")    

Using pad_token, but it is not set yet.


Loaded pretrained model gpt2-xl into HookedTransformer


In [3]:
reference_gpt2 = EasyTransformer.from_pretrained("gpt2-small", fold_ln=False, center_unembed=False, center_writing_weights=False)

Using pad_token, but it is not set yet.


Loaded pretrained model gpt2-small into HookedTransformer


### What is the point of a transformer?

**Key takeaway**: Transformers are *sequence modelling engines*. At each sequence position, it does the same processing in parallel. It moves information between positions with attention, and conceptually can take a sequence of arbitrary length (not actually true, see later)

**Key feature**: They generate text! You feed in language, and the model generates a probability distribution of next tokens, which you can repeatedly sample over to generate more text.
There are many different types of transformers but this notebook will focus on GPT-2 style transformers.


### How is the model trained?

You give it a bunch of text and it is trained to predict the next token. Importantly, if you give a model 100 tokens in a sequence, it predicts the next token for *each* prefix, i.e. it produces 100 predictions. This is unintuitive at first since we really just want the last token prediction. But in reality, it makes training more efficient because you can use 100 bits of feedback instead of just one. We make the transformer have *causal attention* in order to take advantage of all this extra information. A core point is that it can only move information forwards in a sequence i.e. the prediction of what comes after token 50 is a function of the first 50 tokens and *not* of token 51. (Jargon: *autoregressive*: the transformer's output is fed back into its input)

## Tokens - Transformer Inputs

*Core point*: Input is language e.g. a sequence of characters, strings etc. 

### How do we convert language to vectors?

ML models operate on vectors, not raw language, so we need to convert it somehow. 

### Idea 1: integers to vectors

Convert to integers (in a fixed range) -> integers to vecotrs. This is essentially a lookup table, called an embedding. 

> Lookup tables are equivalent to multiplying a fixed matrix by the one-hot encoded vector. Think about it, the index that is 1 in the one-hot vector will pick out the corresponding column of the matrix.


> Jargon: *Encoding* is the process of converting data into a different format.

> Jargon: *Embedding* is the process of pairing objects (such as words or entities) with vectors of real numbers. More used in a machine learning context.
    
> Jargon: *One-hot encoding* e.g. We map numbers from 1 to 100, to a 100 dimensional vector, with a 1 in the position of the number and 0s everywhere else. This lets you think about each integer independently.

Dimensions = things that vary independently. each input has its  own dimension, so each input can be thought of independently, we don't bake in any assumptions about the relationship between inputs.

But what if we want to encode structure into the embedding? In some contexts, this structural similarity is important e.g. if you were to encode colors, you might want to encode the fact that red and orange are similar.

### Tokens: Language to sequence of integers

*Core idea*: We need a model that can deal with arbitrary text.

*Key properties*: there should be a conversion to integers *and* these integers should be in a bounded range

Idea: Form a vocaulary i.e. a set of words known and used by someone or some people 

Idea 1: We could form a dictionary, where you take in language and you look up the index in a dictionary. The problem here is that it can't cope with arbitrary text e.g. URLs, punctuation, etc. or mispellings.

Idea 2: We could just do characters, our vocab = 256 ASCII characters. The main problem here is that it loses some structure of language - some sequences of characters are more meaningful than others.

The word *language* is more meaningful than *asdfjkl* - we want *language* to be a token and *asdfjkl* not to be. Not to mention this is a more efficient use of our vocab.  

#### What Actually Happens?

A process called tokenization (forming a vocabulary) happens. In the context of GPT-2, it uses something called Byte-Pair Encodings. It's super weird. 

The tokenizer has a dictionary of tokens. These tokens are sequences of characters. If we print out the vocabulary of a tokenizer we see seemingly random sequences of characters.

Ġ ~ means begins with a space, tokens with a leading space are different then those that are not.

### Byte-Pair Encoding

**Core idea**: An algorithm for segmenting text into tokens.

**Key properties**: Encouraged to use frequent words and frequent subwords e.g. -ing, -ed, -s, etc. e.g. unlikeliest -> un-, likely, -est
Another interpretation is that the most frequent words are represented as a single token while more rare words are broken down into two or more subword tokens.

Instead of breaking up words at every whitespace or breaking up at every character, we let the data tell us how to tokenize.

> Jargon: Subword tokenization: tokens can be parts of words as well as whole words. This is a class of algorithms from which BPE belongs to. 

#### Training time (Building a vocabulary)

Step 1: Gather a corpus of text.

Step 2: Induce a vocabulary by operating on the corpus.

- Step 2a: Choose the two symbols that are most frequently adjacent in the training corpus.  
- Step 2b: Add new merged symbol to the vocabulary.  
- Step 2c: Replace all occurrences of the pair in the corpus with the new merged symbol.  
- Step 2d: Repeat (go back to 2a) until the vocabulary reaches a desired size.  

> Note: We should retain all the individual characters in the vocabulary. This is because we want to be able to encode arbitrary text. 

#### Test time (Encode using trained vocabulary)

Input language sequence and then the algorithm will break it up into tokens.

In [None]:
print(reference_gpt2.tokenizer.vocab_size)
print(reference_gpt2.tokenizer.vocab)

sorted_vocab = sorted(list(reference_gpt2.tokenizer.vocab.items()), key=lambda n: n[1])
print(sorted_vocab)

In [31]:
print(reference_gpt2.tokenizer.encode("<|endoftext|>"))
print(reference_gpt2.tokenizer.decode(50256))
print(sorted_vocab[-5:])

[50256]
<|endoftext|>
[('Ġregress', 50252), ('ĠCollider', 50253), ('Ġinformants', 50254), ('Ġgazed', 50255), ('<|endoftext|>', 50256)]


### Tokenization is a Headache

Whether a word begins with a capital or space matters.  
Arithmetic is a mess: Length is inconsistent

Play around with the OpenAI tokenizer [here](https://platform.openai.com/tokenizer)

In [5]:
print(reference_gpt2.to_str_tokens("Butterfly"))
print(reference_gpt2.to_str_tokens(" Butterfly"))
print(reference_gpt2.to_str_tokens(" butterfly"))
print(reference_gpt2.to_str_tokens("butterfly"))

print(reference_gpt2.to_str_tokens("1234+5678=12345678"))


['<|endoftext|>', 'But', 'ter', 'fly']
['<|endoftext|>', ' Butterfly']
['<|endoftext|>', ' butterfly']
['<|endoftext|>', 'but', 'ter', 'fly']
['<|endoftext|>', '12', '34', '+', '5', '678', '=', '123', '45', '678']


<|endoftext|> is a special token that is pre-pended to every sequence. It's a special token that indicates the start of a sequence.

Models tend to do some weird things with the first token in a sequence, so we use this to avoid that. A little hacky but it works. It can be disabled with `preprend_bos=False`

### Summary

We learn a dictionary of vocab of tokens (sub-words).

We try to losslessly convert language to integers by tokenizing it.

We convert integers to vectors using a lookup table.  

(Note: input to the transformer is sequence tokens not vectors)

## Logits - Transformer Outputs

*Goal*: Probability distribution over next tokens. For every prefix of the sequnce - given n tokens, we make n next token predictions

*Problem*: How to convert a vector to a probability distribution?

*Answer*: Use a softmax ($$x_i \to \frac{e^{x_i}}{\sum_j e^{x_j}}$$). The exponential makes everything positive and the normalization makes it sum to 1.

So the model outputs a tensor of logits, one vector of size $d_{vocab}$ for each input token.

### Step 1: Generation

Convert to tokens

shape = batch x position

In [51]:
reference_text = "I am an autoregressive, decoder-only, GPT-2 style transformer."
tokens = reference_gpt2.to_tokens(reference_text)

# Print tokens and their string representations
print(tokens)
print(reference_gpt2.to_str_tokens(tokens))

# shape = (batch_size, sequence_length)
print(tokens.shape)

tensor([[50256,    40,   716,   281,  1960,   382, 19741,    11,   875, 12342,
            12,  8807,    11,   402, 11571,    12,    17,  3918, 47385,    13]],
       device='mps:0')
['<|endoftext|>', 'I', ' am', ' an', ' aut', 'ore', 'gressive', ',', ' dec', 'oder', '-', 'only', ',', ' G', 'PT', '-', '2', ' style', ' transformer', '.']
torch.Size([1, 20])


### Step 2: Map tokens to logits



In [52]:
# Use GPU if available
tokens = tokens.to(device)

# run_with_cache means cache all intermediate activations, we will view these later
logits, cache = reference_gpt2.run_with_cache(tokens)

# shape = (batch_size, sequence_length, vocab_size)
print (logits.shape)

torch.Size([1, 20, 50257])


### Step 3: Convert the logits to a distribution with the softmax

In [53]:
log_probs = logits.log_softmax(dim=-1)
probs = logits.softmax(dim=-1)  

What is the most likely next token for each prefix?

Output of the form:
(token, next token prediction)

In [11]:
list(zip(reference_gpt2.to_str_tokens(reference_text), reference_gpt2.tokenizer.batch_decode(logits.argmax(dim=-1)[0])))

[('<|endoftext|>', '\n'),
 ('I', "'m"),
 (' am', ' a'),
 (' an', ' avid'),
 (' aut', 'od'),
 ('ore', 'sp'),
 ('gressive', ','),
 (',', ' and'),
 (' dec', 'ently'),
 ('oder', '-'),
 ('-', 'driven'),
 ('only', ','),
 (',', ' and'),
 (' G', 'IM'),
 ('PT', '-'),
 ('-', 'only'),
 ('2', '.'),
 (' style', ','),
 (' transformer', '.'),
 ('.', ' I')]

In [73]:
# Map distribution to a token
last_token = logits[0, -1] # (batch -> 0, sequence index -> -1, vocab -> all)
next_token = last_token.argmax() # take the highest probability token

# Print the next token integer and corresponding string
print(f"{next_token.item()}: {reference_gpt2.tokenizer.decode(next_token)}")

314:  I


Attach the following code to the end of the input, re-run

In [83]:
# Clone and detach the next_token to ensure it's not connected to any computation graph
cloned_next_token = next_token.clone().detach()

# Move the cloned_next_token to the same device as the original tokens and set its data type to int64
cloned_next_token_device = cloned_next_token.to(device=device, dtype=torch.int64)

# Add two singleton dimensions to the cloned_next_token tensor to match the dimensions of the tokens tensor
reshaped_next_token = cloned_next_token.unsqueeze(0).unsqueeze(0)

# Concatenate the reshaped_next_token tensor along the last dimension (-1) with the original tokens tensor
next_tokens = torch.cat([tokens, reshaped_next_token], dim=-1)  # (batch -> 0, sequence -> n+1)

# Run the concatenated tokens through the GPT-2 model to get new logits
new_logits = reference_gpt2(next_tokens)

# Print the new input tokens
print("New Input:", next_tokens)

# Decode the new input tokens to string using the tokenizer
print("New Input:", reference_gpt2.tokenizer.decode(next_tokens[0]))
print("New Shape: ", new_logits.shape)

# Predict the next token using the new logits 
next_token = new_logits[0, -1].argmax()

# Print the next token integer and corresponding string
print(f"{next_token.item()}: {reference_gpt2.tokenizer.decode(next_token)}")


New Input: tensor([[50256,    40,   716,   281,  1960,   382, 19741,    11,   875, 12342,
            12,  8807,    11,   402, 11571,    12,    17,  3918, 47385,    13,
           314]], device='mps:0')
New Input: <|endoftext|>I am an autoregressive, decoder-only, GPT-2 style transformer. I
New Shape:  torch.Size([1, 21, 50257])
423:  have


### Summary

**Takes in language, predicts the next token for each prefix in a causal way.**  
> Transformers are sequence opertaion models. They take in a seqence, do processing on each token in parallel, and then use attention to move information between tokens. 

We convert language to a sequence of integers with a tokenizer.

We convert integers to vectors with a lookup table.

Output is a vector of logits for each input token. We convert these to a probability distribution with a softmax and then convert this to a token either by taking the largest token or by sampling from the distribution.

We append this to the input and run again generate more text.