# Creating a GPT from Scratch

In this notebook, I implement a GPT from scratch following Andrej Karpathy's YouTube series, along with my notes of the lecture.

In [1]:
# imports
import torch

### Creating the Dataset

We create the vocabulary and the dataset as we have been doing in the other `makemore` notebooks. But here, instead of using the `names` dataset, we will be using Tiny Shakespeare. 

In [2]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

print("Length of the dataset in characters: ", len(text))

Length of the dataset in characters:  1115394


Here, we are building a character level language model. Our vocabulary is going to be all the characters in the dataset, and the *tokens* in our language model are the characters mapped to integers. In LLMs, this tokenization could be at *subword* level, or something else also! 

The larger the vocabulary, the larger integer to token mapping you have. That means, that you can represent larger sentences using fewer tokens. On the contrary, if you have less number of tokens in your vocabulary, you will need more tokens to represent larger sentence. 

For example, with character level language model, we need `len(sentence)` tokens to represent it. But if we had a word level tokenization, then we would need `len(sentence.split(" "))` tokens, which would be fewer than the characters.

In [3]:
chars = sorted(list(set(text)))
vocab_size = len(chars)

print("Vocab Size is: ", vocab_size)

Vocab Size is:  65


In [4]:
# Create an integer to character mapping- i.e. the tokenizer that encodes and decodes tokens

stoi = { ch:i for i, ch in enumerate(chars) }
itos = { i:ch for i, ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s]  # takes an input string, and outputs a list of integers. i.e. the character map
decode = lambda l: "".join([itos[i] for i in l]) # takes the token list, and produces the string for it

print(encode("Hii, there!"))
print(decode(encode("Hii, there!")))

[20, 47, 47, 6, 1, 58, 46, 43, 56, 43, 2]
Hii, there!


We encode the text into PyTorch tensor now, and split the encoded dataset into train and validation split.

In [5]:
data = torch.tensor(encode(text), dtype=torch.long)

In [6]:
cut = int(0.9 * len(data))
train_data = data[:cut]
validation_data = data[cut:]

## Bigram Model

We've built a simple bigram model in the earlier part of this series. But since the dataset is newer and there are some slight tweaks in the implementation, I am reimplementing the code.

We define the context length first. This context length is the maximum context that the model can look at when making a prediction. However, there doesn't have to be 8 characters always- you can have less than that. Thus, you get something as this. But notice that now we're dealing with tokens and not integers.

In [7]:
block_size = 8 # context length: maximum 8 tokens can be taken as context

sample_x = train_data[:block_size]
sample_y = train_data[1:block_size + 1]

for t in range(block_size):
    context = sample_x[:t+1]
    target = sample_y[t]

    print(f"When input is {context} the target: {target}")

When input is tensor([18]) the target: 47
When input is tensor([18, 47]) the target: 56
When input is tensor([18, 47, 56]) the target: 57
When input is tensor([18, 47, 56, 57]) the target: 58
When input is tensor([18, 47, 56, 57, 58]) the target: 1
When input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
When input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58
