## Ch.2 Working with Text Data

- The full notebook can be accessed via: https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb

In [1]:
import re

### 1. Load Data

In [4]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total num of character: ", len(raw_text))
print(raw_text[:99])

Total num of character:  20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


### 2. Vocabulary

- A vocabulary is the set of all possible tokens that the model can recognize and process.
- Let's construct a vocabulary by creating a dictionary with unique tokens as the keys and Token IDs as the values

In [19]:
# Tokenize
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


In [22]:
# create a vocabulary
all_words = sorted(set(preprocessed)) # sorted list of unique words
vocab_size = len(all_words)

print("Vocab size: ", vocab_size)

vocab = {token: integer for integer,token in enumerate(all_words)}

Vocab size:  1130


In [23]:
for i,item in enumerate(vocab.items()):
    print(item)
    if i >= 5:
        break # print only the first 5 entries

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)


### 3. SimpleTokenizer

In [24]:
class SimpleTokenizer:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int 
            else "<|unk|>" for item in preprocessed # <|unk|> is for unknown token
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
        
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

In [25]:
tokenizer = SimpleTokenizer(vocab)

In [26]:
text = """"It's the last he painted, you know," 
           Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


## 4. BytePair Encoding

- GPT-2 used BytePair encoding (BPE) as its tokenizer
- It allows the model to break down words that aren't in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words
- In this chapter, we are using the BPE tokenizer from OpenAI's open-source tiktoken library, which implements its core algorithms in Rust to improve computational performance

In [2]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.7.0


In [3]:
tokenizer = tiktoken.get_encoding("gpt2")

In [5]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


In [7]:
strings = tokenizer.decode(integers)

print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


## 5. Data sampling with a sliding windows

We train LLMs to generate one word at a time, so we want to prepare the training data accordingly where the next word in a sequence represents the target to predict.

In [9]:
# let's encode the previous text

with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

5145


Let's implement a simple data loader that iterates over the input dataset and returns the inputs and targets shifted by one


In [10]:
import torch
print("PyTorch version:", torch.__version__)

PyTorch version: 2.7.0+cpu


In [11]:
from torch.utils.data import Dataset, DataLoader


class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
        assert len(token_ids) > max_length, "Number of tokenized inputs must at least be equal to max_length+1"

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

In [12]:
def create_dataloader_v1(txt, batch_size=4, max_length=256, 
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

Let's test the dataloader with a batch size of 1 for an LLM with a context size of 4.
- Context size: context size (also known as context window or context length) refers to the maximum number of tokens the model can consider at once when generating or analyzing text.
- The stride setting dictates the number of positions the inputs shift across batches, emulating a sliding window
approach. Try changing the stride from 1 to 2.

In [34]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

In [35]:
dataloader = create_dataloader_v1(txt=raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)

data_iter = iter(dataloader)

In [38]:
first_batch = next(data_iter)
inputs, targets = first_batch
print(first_batch)

print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

[tensor([[2885, 1464, 1807, 3619]]), tensor([[1464, 1807, 3619,  402]])]
Inputs:
 tensor([[2885, 1464, 1807, 3619]])

Targets:
 tensor([[1464, 1807, 3619,  402]])


In [39]:
second_batch = next(data_iter)
inputs, targets = second_batch

print(second_batch)

print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

[tensor([[1464, 1807, 3619,  402]]), tensor([[1807, 3619,  402,  271]])]
Inputs:
 tensor([[1464, 1807, 3619,  402]])

Targets:
 tensor([[1807, 3619,  402,  271]])


- We can also create batched outputs
- Note that we increase the stride here so that we don't have overlaps between the batches, since more overlap could lead to increased overfitting

In [41]:
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)

new_batch = next(data_iter)

print(new_batch)

inputs, targets = new_batch
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

[tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]]), tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])]
Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,  

### 6. Creating Token Embeddings

- Lastly let us embed the tokens in a continuous vector representation using an embedding layer
- Usually, these embedding layers are part of the LLM itself and are updated (trained) during model training

In [43]:
# Suppose we have the following four input examples with input ids 2, 3, 5, and 1 (after tokenization):

input_ids = torch.tensor([2, 3, 5, 1])

# For the sake of simplicity, suppose we have a small vocabulary of only 6 words and we want to create embeddings of size 3:
vocab_size = 6
output_dim = 3

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

In [44]:
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


In [45]:
# To convert a token with id 3 into a 3-dimensional vector, we do the following:
print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


- Note that the above is the 4th row in the embedding_layer weight matrix
- To embed all four input_ids values above, we do:

In [46]:
print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


## 7. Encoding word positions

- In the previous section, we converted the token IDs into a continuous vector
representation, the so-called token embeddings. In principle, this is a suitable
input for an LLM.
-  However, a minor shortcoming of LLMs is that their self-
attention mechanism, which will be covered in detail in chapter 3, doesn't
have a notion of position or order for the tokens within a sequence.
- The way the previously introduced embedding layer works is that the same
token ID always gets mapped to the same vector representation, regardless of
where the token ID is positioned in the input sequence,
- In principle, the deterministic, position-independent embedding of the token
ID is good for reproducibility purposes. However, since the self-attention
mechanism of LLMs itself is also position-agnostic, it is helpful to inject
additional position information into the LLM.

In [48]:
# The BytePair encoder has a vocabulary size of 50,257:
# Suppose we want to encode the input tokens into a 256-dimensional vector representation:

vocab_size = 50257
output_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

- If we sample data from the dataloader, we embed the tokens in each batch into a 256-dimensional vector
- If we have a batch size of 8 with 4 tokens each, this results in a 8 x 4 x 256 tensor:

In [49]:
max_length = 4
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=max_length,
    stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

In [50]:
print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape)

Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape:
 torch.Size([8, 4])


In [52]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


- GPT-2 uses __absolute__ position embeddings, so we just create another embedding layer:


In [54]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)

# uncomment & execute the following line to see how the embedding layer weights look like
# print(pos_embedding_layer.weight)

In [56]:
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)

# uncomment & execute the following line to see how the embeddings look like
# print(pos_embeddings)

torch.Size([4, 256])


- To create the input embeddings used in an LLM, we simply __add__ the token and the positional embeddings:


In [57]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

# uncomment & execute the following line to see how the embeddings look like
# print(input_embeddings)

torch.Size([8, 4, 256])
