# Creating a GPT from Scratch

In this notebook, I implement a GPT from scratch following Andrej Karpathy's YouTube series, along with my notes of the lecture.

In [1]:
# imports
import torch
import torch.nn as nn
from torch.nn import functional as F

### Creating the Dataset

We create the vocabulary and the dataset as we have been doing in the other `makemore` notebooks. But here, instead of using the `names` dataset, we will be using Tiny Shakespeare. 

In [2]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

print("Length of the dataset in characters: ", len(text))

Length of the dataset in characters:  1115394


Here, we are building a character level language model. Our vocabulary is going to be all the characters in the dataset, and the *tokens* in our language model are the characters mapped to integers. In LLMs, this tokenization could be at *subword* level, or something else also! 

The larger the vocabulary, the larger integer to token mapping you have. That means, that you can represent larger sentences using fewer tokens. On the contrary, if you have less number of tokens in your vocabulary, you will need more tokens to represent larger sentence. 

For example, with character level language model, we need `len(sentence)` tokens to represent it. But if we had a word level tokenization, then we would need `len(sentence.split(" "))` tokens, which would be fewer than the characters.

In [3]:
chars = sorted(list(set(text)))
vocab_size = len(chars)

print("Vocab Size is: ", vocab_size)

Vocab Size is:  65


In [4]:
# Create an integer to character mapping- i.e. the tokenizer that encodes and decodes tokens

stoi = { ch:i for i, ch in enumerate(chars) }
itos = { i:ch for i, ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s]  # takes an input string, and outputs a list of integers. i.e. the character map
decode = lambda l: "".join([itos[i] for i in l]) # takes the token list, and produces the string for it

print(encode("Hii, there!"))
print(decode(encode("Hii, there!")))

[20, 47, 47, 6, 1, 58, 46, 43, 56, 43, 2]
Hii, there!


We encode the text into PyTorch tensor now, and split the encoded dataset into train and validation split.

In [5]:
data = torch.tensor(encode(text), dtype=torch.long)

In [6]:
cut = int(0.9 * len(data))
train_data = data[:cut]
validation_data = data[cut:]

We define the context length first. This context length is the maximum context that the model can look at when making a prediction. However, there doesn't have to be 8 characters always- you can have less than that. Thus, you get something as this. But notice that now we're dealing with tokens and not integers.

In [7]:
block_size = 8 # context length: maximum 8 tokens can be taken as context

sample_x = train_data[:block_size]
sample_y = train_data[1:block_size + 1]

for t in range(block_size):
    context = sample_x[:t+1]
    target = sample_y[t]

    print(f"When input is {context} the target: {target}")

When input is tensor([18]) the target: 47
When input is tensor([18, 47]) the target: 56
When input is tensor([18, 47, 56]) the target: 57
When input is tensor([18, 47, 56, 57]) the target: 58
When input is tensor([18, 47, 56, 57, 58]) the target: 1
When input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
When input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


### Creating Batches

Now that the dataset is there, we need to think about how the input text can be passed as a batch. 

Before that, an important thing to note about transformers is that there is a maximum number of tokens that you can pass to them. They are able to handle sequential inputs of arbitrary length, but this arbitrary length is also capped to some number such as 512. This number is the context length. You can have at maximum that many tokens but at minimum, you can have any number of tokens. 

Now let's think about how would we create and pass a batch of sequences to the model. Our wishlist is the following:

1. We want to pick arbitrary sequences so that the model can generalize well. How do we pick random sequences? Just pick out random starting indexes.
2. How big a sequence should you pick? Well, it cannot be more than the context length of the model. For the moment, assume you would pick the input of size `block_size` i.e. the context size. For example, if you have a `batch_size` of 4 and `block_size` of 8, then you would randomly pick 4 indices in the dataset, and index 8 characters from that index. 
3. What should be the targets? The targets are just the next character. 

As we have seen before, one sequence of 8 characters gives us 8 training examples ( in cell above ). So when we have a batch of size 4, with each having a sequence of 32, it is going to give 32 training samples. 

**Important:** Each training sample can be passed independently to the transformer!

The key is going to be figuring out how to pass this to the transformer.

In [8]:
torch.manual_seed(1337)

batch_size = 4
block_size = 8

def get_batch(split:str):
    data = train_data if split == 'train' else validation_data
    ix = torch.randint(len(data) - block_size, (batch_size, )) # randomly select batch_size many indices. len(data) - block_size just handles edge case
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])

    return x, y

xb, yb = get_batch('train')

print(f"Shape of inputs: {xb.shape}")
print(f"Shape of outputs: {yb.shape}")

Shape of inputs: torch.Size([4, 8])
Shape of outputs: torch.Size([4, 8])


For this batched input, we can again split the training examples. But note, this is NOT relevant till we get to transformers. At the moment, we are just training a bigram model.

In [9]:
i = 0
for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension ( PyTorch convention: (B, T, C) = (Batch, Time, Channel))
        context = xb[b, :t+1]
        target = yb[b, t]
        print(f"{i}: When input is {context} the target is: {target}")
        i+=1

0: When input is tensor([24]) the target is: 43
1: When input is tensor([24, 43]) the target is: 58
2: When input is tensor([24, 43, 58]) the target is: 5
3: When input is tensor([24, 43, 58,  5]) the target is: 57
4: When input is tensor([24, 43, 58,  5, 57]) the target is: 1
5: When input is tensor([24, 43, 58,  5, 57,  1]) the target is: 46
6: When input is tensor([24, 43, 58,  5, 57,  1, 46]) the target is: 43
7: When input is tensor([24, 43, 58,  5, 57,  1, 46, 43]) the target is: 39
8: When input is tensor([44]) the target is: 53
9: When input is tensor([44, 53]) the target is: 56
10: When input is tensor([44, 53, 56]) the target is: 1
11: When input is tensor([44, 53, 56,  1]) the target is: 58
12: When input is tensor([44, 53, 56,  1, 58]) the target is: 46
13: When input is tensor([44, 53, 56,  1, 58, 46]) the target is: 39
14: When input is tensor([44, 53, 56,  1, 58, 46, 39]) the target is: 58
15: When input is tensor([44, 53, 56,  1, 58, 46, 39, 58]) the target is: 1
16: Wh

## Bigram Model

We've built a simple bigram model in the earlier part of this series. But since the dataset is newer, and there are some slight tweaks in the implementation, I am reimplementing the code.


In [10]:
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()

        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size) # vocab_size X vocab_size lookup table

    def forward(self, idx, targets=None):
        
        logits = self.token_embedding_table(idx) # logits.shape = (B, T, C) = (4, 8, 65) in our case

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            logits = logits.view(B*T, C)
            targets = targets.view(B*T) # targets are of shape (B, T) 

            loss = F.cross_entropy(logits, targets)

        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        # idx is a tuple of sample indices of characters from where to start generating

        for _ in range(max_new_tokens):
            logits, loss = self(idx) # logits.shape is (4, x, 65)

            # we want the row corresponding to the last character in each batch to predict next character- i.e. the last elem in T dimension
            logits = logits[:, -1, :]

            probs = F.softmax(logits, dim=1)

            next_idx = torch.multinomial(probs, num_samples=1)

            idx = torch.cat((idx, next_idx), dim=1)
        
        # idx will be the sequence generated for each batch
        return idx
    
m = BigramLanguageModel(vocab_size)

logits, loss = m(xb, yb)

print(f"Loss is: {loss.item():.4f}")
print("Generated sequence: ")
print(decode(m.generate(torch.zeros((1, 1), dtype=torch.long), max_new_tokens=50)[0].tolist()))

Loss is: 4.8786
Generated sequence: 

Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLER


**Note on Forward Pass:** Observe that for this bigram model, we don't have any context. So we can assume each character in each batch as a separate training example. For this model, the training examples are just one characters, as follows:

When input is `tensor([24])` the target is: 43
When input is `tensor([43])` the target is: 58

What is happening with the forward pass is that for each of the characters in each of the batch, the forward pass basically plucks out a row from the `token_embedding_table`. Since our `vocab_size` is 65, for a batch we get `logits` of shape `(4, 8, 65)`. For each, for each character in the batch we are plucking out a row from the embedding table and interpreting this row as the `logits`.

But there is one issue with this. PyTorch expects (B, C, ...) dimension in `F.cross_entropy()`. So we need to use `view` to change the shape f both the logits and the targets. Imagine it as a 3D cube. It helps a lot!

**Note On Generate function:** What is the wishlist for the generate function? For each of the batch, we want to generate the next token. This next token is based only on the last character that we generated, and *not* the entire batch! We haven't yet added context yet.

Further, we need to apply softmax to logits and draw a sample from it. And what we want is not just the next predicted token, but we want to add it to the current context which will be used to predict the next word again. 

### Training the Bigram Model

**Pro Tip:** For Adam, in practice, `lr=3e-4` works quite well. But for smaller datasets, you can have much faster learning rates like we are having. 

In [19]:
model = BigramLanguageModel(vocab_size=vocab_size)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

In [22]:
epochs = 10_000

for epoch in range(epochs):
    
    xb, yb = get_batch('train')

    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(f"Loss is: {loss.item():.4f}")
print("Generated sequence: ")
print(decode(model.generate(torch.zeros((1, 1), dtype=torch.long), max_new_tokens=150)[0].tolist()))

Loss is: 2.4321
Generated sequence: 

Isow's higr tioy s ld ser.
MIZd,
Thadar bu, w fio anghab' me h g weithadwe'd ond sitolste venck.
CUSCESeast flaive chalot f metstuplmeariloubeequrall 


Certainly the outputs we are getting are not Shakespeare like, and we're never going to get them with a bigram model but this is a decent start from the untrained model.

## First Self-Attention Block

The bigram model was not paying any attention to the context. It was just considering the previous character. Now, we will build a self-attention block that pays attention to context of characters.