## Training Large Language Model from Scratch in PyTorch

### Part I: Data Preparation and Preprocessing

In this section we cover the data preparation and sampling to get our input data ready for the LLM. You can download our sample data from here: https://en.wikisource.org/wiki/The_Verdict

In [None]:
with open("sample_data/the-verdict.txt", encoding="utf-8") as f:
    raw_text = f.read()

print(f"Total number of characters: {len(raw_text)}")
print(raw_text[:20]) # print the first 20 charaters

Total number of characters: 20479
I HAD always thought


Next we tokenize and embed the input text for our LLM.
- First we develop a simple tokenizer based on some sample text that we then apply to the main input text above.

In [None]:
import re
# Tokenize our input by splitting on whitespace and other characters
# Then we strip whitespace from each item and then filer out any empty strings
tokenized_raw_text = [item.strip() for item in re.split(r'([,.?_!"()\']|--|\s)', raw_text) if item.strip()]
print(len(tokenized_raw_text))
print(tokenized_raw_text[:20])

4649
['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was']


Next we convert the text tokens into token Ids that can be processed via embedding layers later. We can then build a vocabulary that consists of all the unique tokens.

In [None]:
words = sorted(list(set(tokenized_raw_text)))
vocab_size = len(words)
print(f"Vocab size: {vocab_size}")

Vocab size: 1159


In [None]:
vocabulary = {token:integer for integer, token in enumerate(words)}

#Lets check the first 50 entries
for i, item in enumerate(vocabulary.items()):
    print(item)
    if i == 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Carlo;', 25)
('Chicago', 26)
('Claude', 27)
('Come', 28)
('Croft', 29)
('Destroyed', 30)
('Devonshire', 31)
('Don', 32)
('Dubarry', 33)
('Emperors', 34)
('Florence', 35)
('For', 36)
('Gallery', 37)
('Gideon', 38)
('Gisburn', 39)
('Gisburns', 40)
('Grafton', 41)
('Greek', 42)
('Grindle', 43)
('Grindle:', 44)
('Grindles', 45)
('HAD', 46)
('Had', 47)
('Hang', 48)
('Has', 49)
('He', 50)


We can put these all together into our tokenizer class

In [None]:
class TokenizerLayer:
    def __init__(self, vocabulary):
        self.token_to_int = vocabulary
        self.int_to_token = {integer:token for token, integer in vocabulary.items()}

    # The encode function turns text into token ids
    def encode(self, text):
        encoded_text = re.split(r'([,.?_!"()\']|--|\s)', text)
        encoded_text = [item.strip() for item in encoded_text if item.strip()]
        return [self.token_to_int[token] for token in encoded_text]

    # The decode function turns token ids back into text
    def decode(self, ids):
        text = " ".join([self.int_to_token[i] for i in ids])
        # Replace spaces before the specified punctuations
        return re.sub(r'\s+([,.?!"()\'])', r'\1', text)

# Initialize and test tokenizer layer
tokenizer = TokenizerLayer(vocabulary)
print(tokenizer.encode(""""It's the last he painted, you know," Mrs. Gisburn said with pardonable pride."""))
print(tokenizer.decode(tokenizer.encode("""It's the last he painted, you know," Mrs. Gisburn said with pardonable pride.""")))

[1, 58, 2, 872, 1013, 615, 541, 763, 5, 1155, 608, 5, 1, 69, 7, 39, 873, 1136, 773, 812, 7]
It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


Next we special tokens for unknown words and to mark end of text.

SPecial tokens include:

[BOS] - Beginning of Sequence

[EOS] - End of Sequence. This markds the end of a text, usually used to concatenate multiple unrelated texts e.g. two different documents, wikipedia articles, books etc.

[PAD] - Padding: If we train an LLM with a batch size greater than 1, we may include multiple texts with different lenghts; with the padding token we pad the shorter texts to the longest length so that all texts have an equal lenght.

[UNK] - denotes words not included in the vocabulary
GPT2 only uses <|endoftext|> token for end of sequence and padding to reduce complexity which is analogous to [EOS].
Instead of <UNK> token for out-of-vocabulary words, GPT-2 uses byte-pair encoding (BPE) tokenizer, which breaks down words into subword unis.
For our application, we use <|endoftext|> tokens between two independent sources of text.


In [None]:
tokenized_raw_text = [item.strip() for item in re.split(r'([,.?_!"()\']|--|\s)', raw_text) if item.strip()]
all_tokens = sorted(list(set(tokenized_raw_text)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocabulary = {token:integer for integer, token in enumerate(all_tokens)}
tokenizer = TokenizerLayer(vocabulary)
print(len(tokenized_raw_text))
print(tokenized_raw_text[:20])

4649
['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was']


In [None]:
for i, item in enumerate(list(vocabulary.items())[-5:]):
    print(item)

# Get the new length of our vocabulary
print(len(vocabulary.items()))

('younger', 1156)
('your', 1157)
('yourself', 1158)
('<|endoftext|>', 1159)
('<|unk|>', 1160)
1161


In [None]:
class TokenizerLayer:
    def __init__(self, vocabulary):
        self.token_to_int = vocabulary
        self.int_to_token = {integer:token for token, integer in vocabulary.items()}

    # The encode function turns text into token ids
    def encode(self, text):
        encoded_text = re.split(r'([,.?_!"()\']|--|\s)', text)
        encoded_text = [item.strip() for item in encoded_text if item.strip()]
        encoded_text = [item if item in self.token_to_int else "<|unk|>" for item in encoded_text]
        return [self.token_to_int[token] for token in encoded_text]

    # The decode function turns token ids back into text
    def decode(self, ids):
        text = " ".join([self.int_to_token[i] for i in ids])
        # Replace spaces before the specified punctuations
        return re.sub(r'\s+([,.?!"()\'])', r'\1', text)

# Initialize and test tokenizer layer
tokenizer = TokenizerLayer(vocabulary)
print(tokenizer.encode(""""It's the last he painted, you know," Mrs. Gisburn said with pardonable pride."""))
print(tokenizer.decode(tokenizer.encode("""It's the last he painted, you know," Mrs. Gisburn said with pardonable pride.""")))

print(tokenizer.encode(""""This is a test! <|endoftext|> What is your favourite movie"""))
print(tokenizer.decode(tokenizer.encode("""This is a test! <|endoftext|> What is your favourite movie""")))

[1, 58, 2, 872, 1013, 615, 541, 763, 5, 1155, 608, 5, 1, 69, 7, 39, 873, 1136, 773, 812, 7]
It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.
[1, 101, 595, 119, 1160, 0, 1159, 113, 595, 1157, 1160, 1160]
This is a <|unk|>! <|endoftext|> What is your <|unk|> <|unk|>


#### Byte Pair Encoding (BPE)
GPT-2 uses BPE as its tokenizer. This allows it to break down words that aren't in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words.

For example, if GPT-2's vocabulary doesn't have the word "unfamiliarword," it might tokenize it as ["unfam", "iliar", "word"] or some other subword breakdown, depending on its trained BPE merges

Original BPE Tokenizer can be found here: https://github.com/openai/gpt-2/blob/master/src/encoder.py


To use BPE tokenizer, we can use OpenAI's open-source tiktoken library which implements its core algorithms in Rust to improve computational performance.

In [None]:
# pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.6.0


In [None]:
import tiktoken
import importlib

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.6.0


In [None]:
tokenizer = tiktoken.get_encoding("gpt2")
text = "Hello, this is a test sentence from theouterspace. <|endoftext|> It's the last he painted, you know,"
token_ids = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(token_ids)

# Re-construct the input text using the token_ids
print(tokenizer.decode(token_ids))

[15496, 11, 428, 318, 257, 1332, 6827, 13, 220, 50256, 632, 338, 262, 938, 339, 13055, 11, 345, 760, 11]
Hello, this is a test sentence. <|endoftext|> It's the last he painted, you know,


BPE tokenizer breaks down the unknown words into subwords and individual characters.

#### Data sampling with sliding window
We train LLM to generate one word at a time, so we want to prepare the training data accordingly where the next word in a sequence represents the target to predict:

In [None]:
from IPython.display import Image
Image(url="https://drive.google.com/file/d/1-IpY_qgU0n704QJmoQYf8cAFIpeTuvTx/view?usp=sharing")

In [None]:
with open("sample_data/the-verdict.txt", "r") as f:
    raw_text = f.read()

encoded_text = tokenizer.encode(raw_text)
print(len(encoded_text))

5145


- For each ext chunk, we want inputs and targets
- Since we want the model to predict the next word, the targets are the inputs shifted by one position to the right.

In [None]:
sample = encoded_text[:100]
context_length = 5

for i in range(1, context_length + 1):
    context = sample[:i]
    desired_target = sample[i]
    print(context, "->", desired_target)

[40] -> 367
[40, 367] -> 2885
[40, 367, 2885] -> 1464
[40, 367, 2885, 1464] -> 1807
[40, 367, 2885, 1464, 1807] -> 3619


In [None]:
for i in range(1, context_length + 1):
    context = sample[:i]
    desired_target = sample[i]
    print(tokenizer.decode(context), "->", tokenizer.decode([desired_target]))

I ->  H
I H -> AD
I HAD ->  always
I HAD always ->  thought
I HAD always thought ->  Jack


### Data Loading
Next we implement a simple data loader ha iterates over the input dataset and returns the inputs and target shifted by one.

In [None]:
import torch
print("PyTorch version:", importlib.metadata.version("torch"))

PyTorch version: 2.2.1+cu121


- We use sliding window approach where we slide the window one word at a time (this is also called stride=1)
- We create a dataset and dataloader object that extract chunks from the input text dataset.

In [None]:
from torch.utils.data import Dataset, DataLoader

class LLMDataset(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.tokenizer = tokenizer
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Iterate over the tokenized text
        for i in range(0, len(token_ids) - max_length, stride):
            context = token_ids[i:i+max_length]
            desired_target = token_ids[i+1:i+max_length+1]
            self.input_ids.append(torch.tensor(context))
            self.target_ids.append(torch.tensor(desired_target))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]


def create_data_loader(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True):
    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create the dataset
    dataset = LLMDataset(txt, tokenizer, max_length, stride)

    # Create the data loader
    return DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last=drop_last)


In [None]:
with open("sample_data/the-verdict.txt", "r") as f:
    raw_text = f.read()

dataloader = create_data_loader(raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)
data_iterator = iter(dataloader)
batch = next(data_iterator)
print(batch)

[tensor([[319, 616, 835, 284]]), tensor([[  616,   835,   284, 22489]])]


In [None]:
batch_2 = next(data_iterator)
print(batch_2)

[tensor([[ 11, 290,  11, 355]]), tensor([[ 290,   11,  355, 9074]])]


In [None]:
# Increse the stride to remove overlaps between the batches since more overlap could lead to increased overfitting
dataloader = create_data_loader(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[41186, 39614,  1386,    11],
        [  373,  3957,   588,   262],
        [ 1169,  2994,   284,   943],
        [ 7067, 29396, 18443, 12271],
        [ 2666,   572,  1701,   198],
        [ 3666, 13674,    11,  1201],
        [ 1109,   815,   307,   900],
        [  465,  5986,   438,  1169]])

Targets:
 tensor([[39614,  1386,    11,   287],
        [ 3957,   588,   262, 26394],
        [ 2994,   284,   943, 17034],
        [29396, 18443, 12271,   290],
        [  572,  1701,   198,   198],
        [13674,    11,  1201,   314],
        [  815,   307,   900,   866],
        [ 5986,   438,  1169,  3081]])


#### Creating token embeddings
Next we embed the token in a continuous vector representation using an embedding layer. Usually the embedding layers are part of the LLM itself and are updated (trained) during model training.

In [None]:
# Suppose we have the following four input examples with ids 5,1,3 and 2 after tokenization
input_ids = torch.tensor([[5, 1, 3, 2]])

For simplicity, suppose we have a small vocabulary of only 6 words and we want to create embeddings of size 3:

In [None]:
vocab_size = 6
embedding_size = 3

torch.manual_seed(42)
embedding_layer = torch.nn.Embedding(vocab_size, embedding_size)

# This would result in a 6x3 weight matrix
print(embedding_layer.weight)

Parameter containing:
tensor([[ 1.9269,  1.4873, -0.4974],
        [ 0.4396, -0.7581,  1.0783],
        [ 0.8008,  1.6806,  0.3559],
        [-0.6866,  0.6105,  1.3347],
        [-0.2316,  0.0418, -0.2516],
        [ 0.8599, -0.3097, -0.3957]], requires_grad=True)


The embedding output for our example input tensor will look as follows

In [None]:
embedding_layer(input_ids)

tensor([[[ 0.8599, -0.3097, -0.3957],
         [ 0.4396, -0.7581,  1.0783],
         [-0.6866,  0.6105,  1.3347],
         [ 0.8008,  1.6806,  0.3559]]], grad_fn=<EmbeddingBackward0>)

#### Encoding Word Positions

- Embedding layer convert Ids into identical vector representations regardless of where they are located in the input sequence.
- Positional embeddings are combined with the token embedding vector to form the input embedding for a large language model
- The BytePair encoder has a vocabulary size of 50,257
- To encode the input token to a 256-dimensional representation


In [None]:
vocab_size = 50257
embedding_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size, embedding_dim)

- if we sample data from the dataloader, we embed the tokens in each batch into a 256-dim vector
- if we have a batch size of 8 with 4 tokens each, this will result in a 8x4x256 tensor:

In [None]:
max_length = 4
dataloader = create_data_loader(raw_text, batch_size=8, max_length=max_length, stride=max_length, shuffle=False)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Token Ids:\n", inputs)
print("\nInputs shape:\n", inputs.shape)
print("\nEmbedding shape:\n", token_embedding_layer(inputs).shape)

Token Ids:
 tensor([[  273,  1807,   673,   750],
        [21978, 44896,    11,   290],
        [  991,  2045,   546,   329],
        [ 7808,   607, 10927,  1108],
        [ 3226,  1781,    11,  2769],
        [   11,   644,   561,   339],
        [  326,  9074,    13,   402],
        [  373, 37895,   422,   428]])

Inputs shape:
 torch.Size([8, 4])

Embedding shape:
 torch.Size([8, 4, 256])


- GPT-2 uses absolute position enbeddings, so we simply create another embedding layer


In [None]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, embedding_dim)

position_embeddings = pos_embedding_layer(torch.arange(context_length))
print(position_embeddings.shape)

torch.Size([4, 256])


- To create the input embeddings used in an LLM, we add the token and positional embeddings

In [None]:
input_embeddings = token_embedding_layer(inputs) + position_embeddings
print(input_embeddings.shape)

torch.Size([8, 4, 256])


The illustration below shows the end-to-end preprocessing steps of input tokens to an LLM model.