# GPT-2 From Scratch

This is my implementation of GPT-2 (124M) based on Andrej Karpthy's tutorial on YouTube, along with my notes on the same. 

Initially, we're going to build exact replica of GPT-2. But then we will train the model to improve on this perhaps. Here, we reproduce the GPT-2 as it is but we are going to create maybe different class names, etc. The naming of the parameters in all the module classes follows the `transformers` library's state dictionary. 

In [None]:
from dataclasses import dataclass
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

@dataclass
class GPTConfig:
    block_size: int = 1024
    vocab_size: int = 50257
    n_layer: int = 12
    n_head: int = 12
    n_embd: int = 768

class CausalSelfAttention(nn.Module):
    def __init__(self, config:GPTConfig):
        super().__init__()
        assert config.n_embd % config.n_head == 0

        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)

        self.n_head = config.n_head
        self.n_embd = config.n_embd

        # 4D shape now
        self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size)).view(1, 1, config.block_size, config.block_size)) 

    def forward(self, x):
        B, T, C = x.size() # Batch Size, Sequence Length, Embedding Dim

        qkv = self.c_attn(x)

        q, k, v = qkv.split(self.n_embd, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)

        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)
        y = att @ v
        y = y.transpose(1, 2).contiguous().view(B, T, C)

        y = self.c_proj(y)

        return y


class MLP(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
        self.gelu = nn.GELU(approximate='tanh')
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)

    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        return x

class Block(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()

        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x


class GPT(nn.Module):

    def __init__(self, config:GPTConfig):
        super().__init__()
        self.config = config
        
        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd), # text embeddings
            wpe = nn.Embedding(config.block_size, config.n_embd), # positional embeddings
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]), 
            ln_final = nn.LayerNorm(config.n_embd)
        )) 

        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

    def forward(self, idx):
        B, T, C = idx.size()

        assert T <= self.config.block_size

        pos = torch.arange(0, T, dtype=torch.long, device=idx.device)
        pos_emb = self.transformer.wpe(pos)
        tok_emb = self.transformer.wte(idx)
        x = pos_emb + tok_emb

        for block in self.transformer.h:
            x = block(x)

        x = self.transformer.ln_final(x)

        logits = self.lm_head(x)

        return logits


### Few Notes:

1. GeLU's approximate version was used in GPT-2. The reason for this was that the exact version was quite slow in Tensorflow, where GPT-2 was trained. But there is no real reason now to not use exact version of GELU. So, when we try to improve this model, we will use GELU in its exact form.

2. Organization of `LayerNorm` layers was changed in GPT-2 from the Attention is All You Need paper. (see Shakespeare in me). But in the original paper, `LayerNorm` was kept inside the residual block which is not desirable because you want your residual stream to be just additions to make the gradients flow. 

3. Attention is aggregation operation where information goes across tokens. But in the MLP layer, operation is performed on each token individually. 

4. Causal Self Attention is a term given to attention where you only look at the previous tokens. As opposed to Masked Self Attention where you are allowed to look at both sides but some tokens are masked. 

5. In the Shakespeare code, we had multiple modules- one for Head, one for concatenating the multiple heads, etc. Here, we are just having all of it in one module by doing some smart `.view()` operations. 

What we have above is a complete GPT-2 implementation. If we can somehow get the weights from `transformers` and add them to our model, we should get a similar performance as GPT-2. Andrej does go into the details of how you can load the weights, but since we are anyways going to train the model by initializing it from scratch, I haven't implemented it.

## More Efficient Attention Block

From the NanoGPT version, there are some differences in this version's attention block. So we need to find out how to go from the previous code (from Shakespeare-in-Me) to the code in this notebook. In the code block below, I have copied the code of `CausalAttention` but only the part which is different from the Shakespeare-in-me version. 

In [None]:
class CausalSelfAttention(nn.Module):
    def __init__(self, config:GPTConfig):

        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)

    def forward(self, x):
        B, T, C = x.size() # Batch Size, Sequence Length, Embedding Dim

        qkv = self.c_attn(x)

        q, k, v = qkv.split(self.n_embd, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)

        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)
        y = att @ v
        y = y.transpose(1, 2).contiguous().view(B, T, C)

        y = self.c_proj(y)

        return y

1. First thing to notice is the dimension of the `c_attn` linear layer. It is three times expected output. For example, if we want the query, key, and value vectors to be 32 Dimensional, we are having a $3 \times 32$ sized output in this linear layer. That already gives us a hint as to what we're going to do. Assuming that you're just passing one input sample `x` to this linear layer, the matrix multiplication is going to look like this:
$$
\begin{bmatrix}
W_Q \\
W_K \\
W_V
\end{bmatrix}

\cdot 

\begin{bmatrix}
x
\end{bmatrix}

= 
\begin{bmatrix}
q \\
k \\
v
\end{bmatrix}
$$

where, `q`, `k`, `v` would be the query, key, and value vectors produced from that input, and $W_K, W_Q, W_V$ are the matrices that we had in the attention layer, except that there are stacked one below the other instead of having them as seaparate matrices. `qkv = self.c_attn(x)` produced this output. 

2. Given that this is the output, it makes sense to split it, right? If we split it, then we can access the individual query, key, and value vectors. This is what this line does: `q, k, v = qkv.split(self.n_embd, dim=2)`. The dimension is 2 (the last dimension) because remember the input passed is not a single vector but rather a $B \times T \times C$ matrix. For each batch, for each input word, we're producing query, key, and value vectors. Each of `q, k, v` are matrices of the size $B \times T \times 64$, where 64 is $d_{model}$ in this toy example. 

3. Now, what is with the `view` and `transpose`? Think about how we would implement multi-headed attention. If we wanted four heads, we would have four $W_K, W_Q, W_V$ matrices. Here, we are doing the same thing except using a single matrix because it is more efficient to do it this way. Going back to point 1, the $W_K, W_Q, W_V$ are themselves not single matrices but rather they are separate matrices that would produce qkv vectors for different heads. For example, I am going to elaborate with $W_K$, and then the rest is similar. 

$$
W_K = \begin{bmatrix}
W_{K_1} \\
W_{K_2} \\
W_{K_3} \\
W_{K_4} \\
W_{Q_1} \\
\cdot   \\
\cdot
\end{bmatrix}
$$

When we multiply the input with this matrix, we are going to get something that looks like:

$$
\begin{bmatrix}
k_1 \\
k_2 \\
k_3 \\
k_4 \\
q_1 \\
\cdot \\
\cdot
\end{bmatrix}
$$

So we can see that we are implementing multi-headed attention within a single matrix multiplication. Also observe that the dimensions here match for the matrix and the input. When we are going to take dot products of keys and queries, we want to take dot product for each of these four $k_1, k_2, k_3, k_4$ and $q_1, q_2, q_3, q_4$ separately. We want to do $q_1 \cdot k_1$ (except that these are matrices and you have to see that the dimensions match but the idea is same). So with the `view` and `transpose` functions, we are extracting and rearranging they qkv vectors.

So before taking the dot product, what's the desired shape for the q, k, v matrices? It's going to be $64 \times 4 \times 8 \times 16$. So first, we `view` to get $64 \times 8 \times 4 \times 16$, then we `transpose` the dimensions (1, 2) to get the desired shape.

4. Then it's just the matter of applying the formula for self-attention, which will operate along the last two dimensions. And thus, I am not elaborating it here. 

The reason why this implementation is faster, I believe, is that it avoids moving the data from High Bandwidth Memory onto GPU processors multiple times. This data movement is slow and the less number of times we do it, the better.

## Generating Output

Now that we have the model configuration ready, we need to implement the text generation. As opposed to the Shakespeare Language Model, in this we would want to pass a prompt to the GPT-2 model, and then generate from there. So the process is going to look like this:

1. Convert the prompt into GPT-2 tokens. We'll use the `tiktoken` library for this. We will also need to add a batch dimension since the forward pass expects $B \times T$ shape.
2. With `torch.no_grad`, pass the tokens through the model to get the logits. Remember, the transformer will produce output for each of the tokens you passed to it. We only care about the last set of logits since there are going to be the probabilities for the next token. So, we throw away the previous logits, and keep only the last ones.
3. Then, the regular: pass through the softmax and sample from the distribution.
4. Finally, we will concatenate this newly generated token to the prompt and generate a new token again.
5. Repeat 2-4 steps a bunch of times.
6. And when you are done, we would want to decode the text that was generated.

In [None]:
num_return_sequences = 5 # number of sequences to generate.
max_length = 30 # The size of the generated text

# Instantiate the model
model = GPT(GPTConfig())
model.eval()
model.to('cuda') # This is not going to work on laptop

# Convert the prompt into tokens using the tokenizer used for GPT-2
import tiktoken
enc = tiktoken.get_encoding('gpt-2')
tokens = enc.encode("Hello, I'm a language model, ")
tokens = torch.tensor(tokens, dtype=torch.long)
tokens = torch.unsqueeze(0).repeat(num_return_sequences, 1) # We want to generate five sequences. So we repeat same prompt, five times

x = tokens.to('cuda')

torch.manual_seed(42)
torch.cuda.manual_seed(42)

while x.size(1) <= max_length:
    with torch.no_grad():
        logits = model(x)
        logits = logits[:, -1, :]

        probs = F.softmax(logits, dim=-1)

        topk_probs, topk_indices = torch.topk(probs, 50, dim=-1) # Use Top K sampling method to select 50 tokens with highest probabilities

        ix = torch.multinomial(topk_probs, 1) # Sample one token from the top 50 tokens for each batch-> (B, 1)

        # From the topk token indices, we want to select the only the ones that we sampled for each batch. And we want to collect them in one tensor
        xcol = torch.gather(topk_indices, -1, ix)

        # Append the new tokens to the sequence
        x = torch.cat((x, xcol), dim=1)

# Decode and print
for i in range(num_return_sequences):
    tokens = x[i, :max_length].tolist() # Get the tokens for the ith sequence
    decoded = enc.decode(tokens)
    print("> ", decoded)

Now, this is all well and good but we are hardcoding the device everywhere. Instead, we would want PyTorch to detect the fastest device available and run on that. At the moment, if you were to run this on laptop, it might throw an error because cuda may not be available. So, we fix this here:

In [None]:
device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'

Now, in the code we can just move the tensors to the device that is available.