### 1. How do you prepare input text for training LLMs?

- Step 1: Splitting text into individual words and subwords
- Step 2: Convert tokens into token IDs
- Step 3: Encode token IDs into vector representations

In [None]:
with open("data/the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total number of characters:", len(raw_text))
print(raw_text[:99])

Total number of characters: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


In [2]:
import re

text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text)    # splits before each whitespace character

print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


In [3]:
# We want the punctution as separate tokens too.
result = re.split(r'([,.]|\s)', text)  # splits on commas, periods, and whitespace
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


In [4]:
# Let's remove the white spaces
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


Removing whitespaces is a choice that we have to make. It reduces the memory and computing requirements. But if the task that you want to perform is sensitive to white spaces, for example Python code generation, you may want to keep them. For now, we are removing white spaces for simplicity but later, we'll switch to a tokenization scheme that includes white spaces.

In [5]:
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


Since we've got a basic tokenizer, let's now apply this to the entire dataset.

In [6]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


In [7]:
print(len(preprocessed))

4690


### Converting Tokens into Token IDs
We need to convert each token into a token ID so that we can represent them numerically. First, we need to build a vocabulary for our dataset. The tokens are sorted and then each unique token in the vocabulary is mapped to a unique number.

In [8]:
all_words = sorted(set(preprocessed))  # remove duplicates and sort
vocab_size = len(all_words)
print("Vocabulary size:", vocab_size)

Vocabulary size: 1130


In [9]:
vocab = {token:integer for integer, token in enumerate(all_words)}

In [10]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i>=50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)


Later, when we want to convert from token IDs to text, we will have a decoder (reverse of the above function) that maps token IDs to the text from vocabulary.

In [11]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s, i in vocab.items()}
        
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)  # remove space before punctuation
        return text

In [12]:
tokenizer = SimpleTokenizerV1(vocab)
encoded = tokenizer.encode(raw_text[:99])
decoded = tokenizer.decode(encoded)
print(raw_text[:99])
print("Encoded:", encoded)
print("Decoded:", decoded)

I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 
Encoded: [53, 44, 149, 1003, 57, 38, 818, 115, 256, 486, 6, 1002, 115, 500, 435, 392, 6, 908, 585, 1077, 709]
Decoded: I HAD always thought Jack Gisburn rather a cheap genius -- though a good fellow enough -- so it was no


### Adding special context tokens

In this section, we will modify the tokenizer to handle unknown words i.e. words that are not present in our vocabulary.

We add <|unk|> and <|endoftext|> tokens to the vocabulary.

In [13]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token: integer for integer, token in enumerate(all_tokens)}

In [14]:
print(len(vocab.items()))

1132


In [15]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


In [16]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s, i in vocab.items()}
        
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int[s] if s in self.str_to_int else self.str_to_int["<|unk|>"] for s in preprocessed]
        return ids
    
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text        

In [17]:
tokenizer = SimpleTokenizerV2(vocab)
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))
print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [18]:
tokenizer.encode(text)

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]

In [19]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

### Some commonly used Special Context Tokens
- [BOS](beginning of sequence): This token marks the start of a text. It signifies to the LLM where a piece of content begins.
- [EOS](end of sequence): This token is positioned at the end of a text, and is especially useful when concatenating multiple unrelated texts.
- [PAD](padding token): When training LLMs with batch size of more than one, the batch might contain sequences of unequal lengths. The shorter texts are extended or "padded" up to the length of the longest sequence. This token is ignored by the LLM, it's just there to keep sequence length equal.

The tokenizer used by GPT models does not need any of these tokens. It only needs the <|endoftext|> token.

The tokenizer used for GPT models doesn't use an <|unk|> token for out-of-vocab words. Instead, GPT models use byte-pair encoding tokenizer, which breaks down words into subword units.

### Byte Pair Encoding Tokenizer

BPE is a sub-word tokenization algorithm.

There are 3 kinds of tokenizers:
1. Word Tokenizer: Each word is a token. This is simple but not efficient for large vocabularies. The problem with this is that what do we do with out of vocabulary (OOV) words? Another problem is what do we do when same words are used in different contexts? For example, "bank" can mean a financial institution or the side of a river. This tokenizer does not handle these cases well.
2. Subword Tokenizer: Words are broken down into smaller subword units. Subword splitting helps the model learn that different words with the same root word as "token" like "tokens" and "tokenizing" are similar in meaning. It also helps the model learn that "tokenization" and "modernization" are made up of different root words but have the same suffix "ization" and are used in same syntactic contexts. It has two rules:
    - Do not split frequently used words into smaller subwords.
    - Split the rare words into smaller, meaningful subwords.
    - For example, "boy" should not be split but "boys" should be split into "boy" and "s"
3. Character Tokenizer: Each character is a token. It has a very small vocabulary since the English language has 256 characters; this solves the OOV problem. The problem that we have with this is that the tokenized sequence is much longer than the initial raw text. Another problem is that the meaning of the words is lost. For example, "bat" can be a flying mammal or a piece of sports equipment. The character tokenizer does not capture this context.

### The BPE Algorithm

BPE Algorithm (1994): Most common pair of consecutive bytes of data is replaced with a byte of data that does not occur in data. This wikipedia article has a good explanation of the algorithm: https://en.wikipedia.org/wiki/Byte_pair_encoding

Example: Let's assume that our original data is aaabdaaabac

1. Count the frequency of each pair of consecutive bytes:
   - aa: 4
   - ab: 2
   - ba: 1
   - ac: 1
2. Find the most frequent pair, which is "aa" in this case.
3. Replace the most frequent pair with a new byte, say "Z" since it's not occuring in our data. The new data becomes ZabdZabac.
4. The next common byte pair is "ab" with frequency 2. Replace "ab" with "Y". The new data becomes ZYdZYac.
5. The next common byte pair is "ZY" with frequency 2. Replace "ZY" with "W". The new data becomes WdWac.

Since we don't have any more pairs, we stop here.

### How is BPE used in LLMs?
BPE ensures that the most common words in the vocabulary are represented as single tokens, while less common words or rarer words are broken down into subword units. This allows the model to handle a wide range of vocabulary without needing to store every possible word in the vocabulary. 

Practical example:
* Let's consider the below dataset of words -
```
{"old":7, "older":3, "finest":9, "lowest":4}
```
* **Preprocessing**: We need to add end token "</w>" at the end of each word. This is also done when training LLMs. The data now becomes -
```
{"old</w>":7, "older</w>":3, "finest</w>":9, "lowest</w>":4}
```
* Let us now split words into characters and count their frequency in the table below:

|   Token   | Frequency |
|-----------|-----------|
| \</w\>    | 23        |
| o         | 14        |
| l         | 14        |
| d         | 10        |
| e         | 16        |
| r         | 3         |
| f         | 9         |
| i         | 9         |
| n         | 9         |
| s         | 13        |
| t         | 13        |
| w         | 4         |

* Now we will find the most frequent pairing of characters and **merge them** and perform the same iteration again and again until we reach teh token limit or iteration limit.

* The most common pair in our dataset is ("e", "s") with frequency 13 (9 times in finest and 4 times in lowest). So we treat "es" as a single token. The new frequency table becomes:

|   Token   | Frequency |
|-----------|-----------|
| \</w\>    | 23        |
| o         | 14        |
| l         | 14        |
| d         | 10        |
| e         | 16-13 = 3 |
| r         | 3         |
| f         | 9         |
| i         | 9         |
| n         | 9         |
| s         | 13-13 = 0 |
| t         | 13        |
| w         | 4         |
| es        | 9+4 = 13  |

Now in the next iteration, the most common pair is ("es", "t") with frequency 13 (9 times in finest and 4 times in lowest). So we treat "est" as a single token. The new frequency table becomes:

|   Token   | Frequency |
|-----------|-----------|
| \</w\>    | 23        |
| o         | 14        |
| l         | 14        |
| d         | 10        |
| e         | 3         |
| r         | 3         |
| f         | 9         |
| i         | 9         |
| n         | 9         |
| t         | 13-13 = 0 |
| w         | 4         |
| es        | 13-13 = 0 |
| est       | 9+4 = 13  |

Again, the most common pair is ("est", "</w>") with frequency 13. So the table becomes:

|   Token   | Frequency |
|-----------|-----------|
| \</w\>    | 23-13 = 10|
| o         | 14        |
| l         | 14        |
| d         | 10        |
| e         | 3         |
| r         | 3         |
| f         | 9         |
| i         | 9         |
| n         | 9         |
| w         | 4         |
| est       | 13-13 = 0 |
| est\</w\> | 13        |

If we didn't merge "est" and "</w>", there would be no difference between "estimate" and "highest". The tokenizer now knows that "est" is an ending sequence in all the words in our dataset, so we need to encode that information that it's an ending sequence.

On running the 4th iteration, we find out that ("o", "l") is the most common pair with frequency 10. So we treat "ol" as a single token. The new frequency table becomes:

|   Token   | Frequency |
|-----------|-----------|
| \</w\>    | 10        |
| o         | 14-10 = 4 |
| l         | 14-10 = 4 |
| d         | 10        |
| e         | 3         |
| r         | 3         |
| f         | 9         |
| i         | 9         |
| n         | 9         |
| w         | 4         |
| est\</w\> | 13        |
| ol        | 7+3 = 10  |

On the 5th iteration, we find out ("ol", "d") has frequency 10. The new table now becomes:

|   Token   | Frequency |
|-----------|-----------|
| \</w\>    | 10        |
| o         | 4         |
| l         | 4         |
| d         | 10-10 = 0 |
| e         | 3         |
| r         | 3         |
| f         | 9         |
| i         | 9         |
| n         | 9         |
| w         | 4         |
| est\</w\> | 13        |
| ol        | 10-10 = 0 |
| old       | 7+3 = 10  |

We can also observe that ("f", "i", "n") occur 9 times but that's just from a single word so we don't merge them. We also remove the tokens that have frequency 0. Finally, we have the following tokens:

|   Token   | Frequency |
|-----------|-----------|
| \</w\>    | 10        |
| o         | 4         |
| l         | 4         |
| e         | 3         |
| r         | 3         |
| f         | 9         |
| i         | 9         |
| n         | 9         |
| w         | 4         |
| est\</w\> | 13        |
| old       | 10        |

So the above tokens are the final list of tokens that will serve as our vocabulary. The BPE algorithm will continue to merge tokens until we reach the desired vocabulary size, or if we have a desired number of iterations for which it will run or until no more merges are possible.

Since implementing BPE from scratch is a bit complex, we will use the `tiktoken` library, which is a fast BPE tokenizer used by OpenAI's GPT models. It's very efficient since it's source code is written in Rust and it has Python bindings.

In [20]:
import importlib
import importlib.metadata
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.9.0


In [21]:
tokenizer = tiktoken.get_encoding("gpt2")

In [22]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
    "of someunknownPlace."
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print("Encoded integers:", integers)

Encoded integers: [15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


In [23]:
tokenizer.decode([617, 34680, 27271, 13])

' someunknownPlace.'

In [24]:
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


To handle words that are not in the vocabulary, the BPE algorithm breaks it down into familiar subwords and characters which are present in the vocabulary. This is how the OOV problem is solved using a subword tokenizer like BPE.

In [25]:
integers = tokenizer.encode("Akwirw ier")
print(integers)

strings = tokenizer.decode(integers)
print(strings)

[33901, 86, 343, 86, 220, 959]
Akwirw ier


### Creating Input-Target Pairs

LLMs are trained to predict the next token in a sequence given the previous tokens. To train an LLM, we need to create input-target pairs from our tokenized text. For example:

**Our sentence**: LLMs learn to predict one word at a time

**Input-Target Pair**: (LLMs, learn) -> (LLMs learn, to) -> (LLMs learn to, predict) -> (LLMs learn to predict, one) ... (LLMs learn to predict one word at a, time)

In [None]:
with open("data/the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

5145


In [27]:
enc_sample = enc_text[50:]  # Removing first 50 tokens from dataset to make results more interesting.

To create input-output pairs, we use the sliding window technique. We take a fixed-size window of tokens as input and the next token as output. The size of the window is called the context length. So, if our context length is 4 -

- X (input): [1, 2, 3, 4]
- Y (output): [2, 3, 4, 5]

If X is 1, Y is 2. If X is [1, 2], Y is 3. If X is [1, 2, 3], Y is 4. If X is [1, 2, 3, 4], Y is 5.

In [28]:
context_size = 4

X = enc_sample[:context_size]
Y = enc_sample[1:context_size + 1]

print(f"X:  {X}")
print(f"Y:       {Y}")

X:  [290, 4920, 2241, 287]
Y:       [4920, 2241, 287, 257]


In [29]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(context, "->", desired)

[290] -> 4920
[290, 4920] -> 2241
[290, 4920, 2241] -> 287
[290, 4920, 2241, 287] -> 257


In [30]:
# So if we convert this to text, we can see the inputs and targets of LLMs.
for i in range(1, context_size+1):
    context = tokenizer.decode(enc_sample[:i])
    desired = tokenizer.decode([enc_sample[i]])
    print(context, "->", desired)

 and ->  established
 and established ->  himself
 and established himself ->  in
 and established himself in ->  a


# Creating Input-Target Pairs with PyTorch Dataloaders

We just need to perform this operation for the entire dataset to create input-target pairs. To do this we take the help of PyTorch Dataloaders, which will help us create batches of data for training. If our dataset has sample text:

"In the heart of the city stood the old library, a relic from a bygone era. It's stone walls bore the marks of time, and ivy clung to it's facade..."

Our input and output tensors look like -
```
x = ([["In", "the", "heart", "of"],
      ["the", "city", "stood", "the"],
      ["old", "library,", "a", "relic"],
      [...]])

y = ([["the", "heart", "of", "the"],
      ["city", "stood", "the", "old"],
      ["library,", "a", "relic", "from"],
      [...]])
```


- Step 1: Tokenize the entire text
- Step 2: Use a sliding window to chunk the book into overlapping sequences of max_length
- Step 3: Return the total number of rows in the dataset
- Step 4: Return a single row from the dataset

In [39]:
import torch
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []
        
        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
        
        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i : i + max_length]
            target_chunk = token_ids[i + 1 : i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
        
    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]
        

Now we are going to use PyTorch's DataLoader to create batches of data for training.
- Step 1: Initialize the tokenizer
- Step 2: Create dataset
- Step 3: drop_last=True drops the last batch if it is shorter than the specified batch_size to prevent loss spikes during training
- Step 4: The number of CPU processes to use for preprocessing

In [40]:
def create_dataloader_v1(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):
    
    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")
    
    # Create the dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    
    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )
    
    return dataloader

Let's test the dataloader with a batch_size of 1 for an LLM with a context size of 4.

This will develop an intuition of how the GPTDatasetV1 class and the create_dataloader_v1 function work together.

In [None]:
with open('data/the-verdict.txt', 'r') as f:
    raw_text = f.read()

In [42]:
print("PyTorch version:", torch.__version__)
dataloader = create_dataloader_v1(raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print("First Batch:\n", first_batch)

PyTorch version: 2.7.0
First Batch:
 [tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


Since the max_length is set to 4, each of the two tensors contain 4 token_IDs. Note that the input size of 4 is relatively small and only chosen for illustration purposes. It is common to train LLMs with input sizes of atleast 256 tokens.

The meaning of stride - In the context of tokenization, stride refers to the step size taken when moving the sliding window across the input text. A larger stride results in fewer overlapping tokens, while a smaller stride increases overlap and potentially captures more context. More overlap can lead to overfitting, while less overlap can lead to underfitting. The choice of stride depends on the specific task and the desired balance between context and generalization.

To illustrate the meaning of stride=1, let's fetch another batch from this dataset:

In [43]:
second_batch = next(data_iter)
print("Second Batch:", second_batch)

Second Batch: [tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


Small batch sizes lead to faster training but noisier updates. Large batch sizes lead to slower training but more stable updates. The choice of batch size depends on the available memory and the specific task. It's a hyperparameter that we should carefully adjust.

In [44]:
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("Targets:\n", targets)

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])
Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])


Here, we increase the stride to 4. Since max_length is also 4, this utilizes the dataset fully (we don't skip a single word) but also avoid any overlap between the batches, since more overlap may lead to increased overfitting.

### What are Token Embeddings?

We've already converted our text into Token IDs and we maintain a vocabulary that contains all the unique token IDs. This can be the input to our LLMs, so why do we need to create token embeddings? We cannot just use randomly assigned numbers. The problem with using random numbers:

```
"cat" -> 34
"book" -> 2.9
"tablet" -> -20
"kitten" -> -13
```

"cat" and "kitten" are semantically related but the associated numbers 34 and -13 cannot capture this relation.