# Working with text data

Here we will look into:
- Preparing text for LLM model training.
- Splitting text into word and subword tokens.
- Byte pair encoding.
- Sampling training examples using a sliding window approach.
- Converting tokens into vectors.

For the purposes of learning we will work with the text of short story by Edith Wharton called "The Verdict."

# Load data

In [1]:
import urllib.request

url = ("https://raw.githubusercontent.com/rasbt/"
       "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
       "the-verdict.txt")
urllib.request.urlretrieve(url, './data/the-verdict.txt')

('./data/the-verdict.txt', <http.client.HTTPMessage at 0x1052cc220>)

In [2]:
# Read the file.

with open('./data/the-verdict.txt', 'r', encoding='utf-8') as f_in:
    raw_text = f_in.read()

print(f'Num characters: {len(raw_text):,}')
print(f'Num words: {len(raw_text.split(" ")):,}')

Num characters: 20,479
Num words: 3,552


# Tokenizing Text

We cannot just feed raw words as input to the Transformer model; we need to first tokenize the text. Tokens are then converted to embeddings, which can be passed as input to the Transformer model.

More specifically: `input text --> tokenized text --> token IDs --> token embeddings`

Some key notes:
- It's better not to modify the capitalization of text because it helps the LLM understand the differences between different kinds of nouns, understand sentence structure, and generate text with proper capitalization.
- Simply splitting the text by word is not enough; we also need to separate out punctuation.
- Whether or not to remove whitespace characters is an important decision. You probably don't want to do it in cases where the structure of the input matters, such as in Python code.
- Tokenizers should be designed to handle special tokens. The essential ones to consider are:
    - End of text token
    - Unknown token

In [3]:
import re

class SimpleTokenizer:
    def __init__(self, vocab: dict = None):
        self.str_to_int = None
        self.int_to_str = None
        
        if vocab:
            self.str_to_int = vocab
            self.int_to_str = {i:s for s,i in vocab.items()}

    @staticmethod
    def add_special_tokens(tokens: set) -> set:
        special_tokens = ['<|endoftext|>', '<|unk|>']
        tokens.extend(special_tokens)

        return tokens
        
    def build_vocab(self, text: str) -> None:
        exp = r'([,.:;?_!"()\']|--|\s)'
        res = re.split(exp, text)
        res = [x for x in res if x.strip()]
        tokens = sorted(set(res))
        tokens = self.add_special_tokens(tokens)
        vocab = {token: i for i, token in enumerate(tokens)}
        
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}
        
    def encode(self, text: str) -> list[int]:
        """Covert text to token ids."""
        exp = r'([,.:;?_!"()\']|--|\s)'
        res = re.split(exp, text)
        res = [x for x in res if x.strip()]
        res = [token if token in self.str_to_int else '<|unk|>' for token in res]

        ids = [self.str_to_int[i] for i in res]
        
        return ids
    
    def decode(self, ids: list[int]) -> str:
        """Convert token ids to text."""
        tokens = [self.int_to_str[i] for i in ids]
        text = ' '.join(tokens)
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        
        return text
        
        

In [4]:
text_1 = "Hello, my name is Mukul!"
text_2 = "The Porsche 911 is a great car."
text_3 = "The quick brown fox jumps over the lazy dog."

all_text = " <|endoftext|> ".join([text_1, text_2, text_3])

tokenizer = SimpleTokenizer()
tokenizer.build_vocab(raw_text)

print(tokenizer.encode(all_text))
print(tokenizer.decode(tokenizer.encode(all_text)))

[1131, 5, 697, 1131, 584, 1131, 0, 1130, 93, 1131, 1131, 584, 115, 508, 1131, 7, 1130, 93, 1131, 235, 1131, 1131, 741, 988, 1131, 1131, 7]
<|unk|>, my <|unk|> is <|unk|>! <|endoftext|> The <|unk|> <|unk|> is a great <|unk|>. <|endoftext|> The <|unk|> brown <|unk|> <|unk|> over the <|unk|> <|unk|>.


# BytePair Encoding (BPE)

BPE is what is used by GPT3.

Key notes:
- BPE is able to handle out of vocabulary tokens by breaking them representing them as subword tokens that is does know.
- The BPE tokenizer has vocab size of 50,257.

In [20]:
%pip install tiktoken

Note: you may need to restart the kernel to use updated packages.


In [7]:
from importlib.metadata import version
import tiktoken

print("tiktoken version:", version("tiktoken"))

tiktoken version: 0.7.0


In [6]:
text_1 = "Hello, my name is Mukul!"
text_2 = "The Porsche 911 is a great car."
text_3 = "The quick brown fox jumps over the lazy dog."

all_text = " <|endoftext|> ".join([text_1, text_2, text_3])

tokenizer = tiktoken.get_encoding("gpt2")
ids = tokenizer.encode(all_text, allowed_special={"<|endoftext|>"})
print(ids)

decoded_text = tokenizer.decode(ids)
print(decoded_text)

[15496, 11, 616, 1438, 318, 31509, 377, 0, 220, 50256, 383, 28367, 16679, 318, 257, 1049, 1097, 13, 220, 50256, 383, 2068, 7586, 21831, 18045, 625, 262, 16931, 3290, 13]
Hello, my name is Mukul! <|endoftext|> The Porsche 911 is a great car. <|endoftext|> The quick brown fox jumps over the lazy dog.


In [7]:
# Example of using BPE on an unkown word.

unknown_word = "supercalifragilisticexpialidocious"

ids = tokenizer.encode(unknown_word)
print(ids)

for id in ids:
    print(tokenizer.decode([id]))

[16668, 9948, 361, 22562, 346, 396, 501, 42372, 498, 312, 32346]
super
cal
if
rag
il
ist
ice
xp
ial
id
ocious


# Data Sampling

LLMs learn by predicting the next word in a sequence. So, how do we convert our text data into training data? To do this, we need to slide over our data and generate {input, target} pairs, where the input is a chunk of text (tokens) and the target is the next word (token).

In [8]:
with open('./data/the-verdict.txt', 'r', encoding='utf-8') as f_in:
    raw_text = f_in.read()

enc_text = tokenizer.encode(raw_text)
enc_sample = enc_text[50:]
print(f'Number of tokens: {len(enc_text):,}')

Number of tokens: 5,145


In [9]:
context_size = 4
x = enc_sample[:context_size]
y = enc_sample[1:context_size]

for i in range(1, context_size+1):
    context = tokenizer.decode(enc_sample[:i])
    target = tokenizer.decode([enc_sample[i]])
    print(f'{context} --> {target}')

 and -->  established
 and established -->  himself
 and established himself -->  in
 and established himself in -->  a


# Implementing a data loader class

In [22]:
%pip install torch

Collecting torch
  Downloading torch-2.3.1-cp310-none-macosx_11_0_arm64.whl.metadata (26 kB)
Collecting filelock (from torch)
  Downloading filelock-3.15.4-py3-none-any.whl.metadata (2.9 kB)
Collecting sympy (from torch)
  Downloading sympy-1.12.1-py3-none-any.whl.metadata (12 kB)
Collecting networkx (from torch)
  Downloading networkx-3.3-py3-none-any.whl.metadata (5.1 kB)
Collecting fsspec (from torch)
  Downloading fsspec-2024.6.1-py3-none-any.whl.metadata (11 kB)
Collecting mpmath<1.4.0,>=1.1.0 (from sympy->torch)
  Downloading mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Downloading torch-2.3.1-cp310-none-macosx_11_0_arm64.whl (61.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 MB[0m [31m38.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
[?25hDownloading filelock-3.15.4-py3-none-any.whl (16 kB)
Downloading fsspec-2024.6.1-py3-none-any.whl (177 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.6/177.6 kB[0m [31m14.1 MB/s

In [2]:
import torch
from torch.utils.data import Dataset, DataLoader

class GPTDataset(Dataset):
    def __init__(self, text, tokenizer, max_length, stride) -> None:
        super().__init__()

        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(text)

        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, index):
        return self.input_ids[index], self.target_ids[index]


In [10]:
def create_dataloader(text, batch_size=4, max_length=256, stride=128, shuffle=True,
                      drop_last=True, num_workers=0):
    tokenizer = tiktoken.get_encoding('gpt2')
    dataset = GPTDataset(text, tokenizer, max_length, stride)
    dataloader = DataLoader(
        dataset=dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

In [13]:
with open('./data/the-verdict.txt', 'r', encoding='utf-8') as f:
    raw_text = f.read()
 
dataloader = create_dataloader(
    text=raw_text,
    batch_size=1,
    max_length=4,
    stride=1,
    shuffle=False
)

data_iter = iter(dataloader)

for i in range(0, 5):
    batch = next(data_iter)
    print(batch)


[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]
[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]
[tensor([[2885, 1464, 1807, 3619]]), tensor([[1464, 1807, 3619,  402]])]
[tensor([[1464, 1807, 3619,  402]]), tensor([[1807, 3619,  402,  271]])]
[tensor([[1807, 3619,  402,  271]]), tensor([[ 3619,   402,   271, 10899]])]


# Creating Token Embeddings

Now we need to figure out how to go from token IDs to token embeddings.

Key notes:
- We initialize the embedding weights with random values. The weights are later learned during the training process.
- The embedding layer is essentially a lookup table that retrieves rows from the embedding matrix corresponding to the token IDs.

In [21]:
# Simple example

token_ids = torch.tensor([1, 5, 3, 2])

vocab_size = 10
embedding_dim = 5

torch.manual_seed = 42
embedding_layer = torch.nn.Embedding(vocab_size, embedding_dim)

print(embedding_layer.weight)
print(embedding_layer.weight.shape)

Parameter containing:
tensor([[ 0.9428,  0.5536,  0.6039,  0.5082, -1.1227],
        [ 0.9949,  1.2063,  0.0420, -0.8989,  2.4404],
        [-1.6505, -2.5095,  1.0637,  2.1797, -0.5502],
        [ 0.2894, -0.2886,  0.6267,  0.2252, -0.6967],
        [-0.3044,  0.2805, -0.3211, -0.7243,  0.2545],
        [-1.0978, -0.8891, -1.3649, -0.8664,  0.6973],
        [ 0.4946,  0.0484, -0.6427, -0.3659, -1.5711],
        [ 0.6534, -0.8021,  0.1398, -0.7539, -0.9315],
        [ 0.0863, -0.3487, -0.2061,  1.0613, -0.7736],
        [ 0.3538,  0.8521,  0.2637,  1.5684, -1.3176]], requires_grad=True)
torch.Size([10, 5])


In [24]:
print(embedding_layer(token_ids))

tensor([[ 0.9949,  1.2063,  0.0420, -0.8989,  2.4404],
        [-1.0978, -0.8891, -1.3649, -0.8664,  0.6973],
        [ 0.2894, -0.2886,  0.6267,  0.2252, -0.6967],
        [-1.6505, -2.5095,  1.0637,  2.1797, -0.5502]],
       grad_fn=<EmbeddingBackward0>)


# Encoding Word Positions

Now we need to figure out how to convey the position of tokens in relation to each other.

Key notes:
- There are two types of positional encoding: absolute and relative. The choice of encoding depends on your use case.
    - Absolute positional encoding:
        - Directly associates specific positions in the input sequence with unique embeddings to convey exact locations.
    - Relative positional encoding:
        - Captures the relative distance between tokens, forcing the model to learn the relationship between tokens in terms of "how far apart" they are from one another.

In [34]:
# Creating our embedding layer.

output_dim = 768
vocab_size = 50257

token_embedding_layer = torch.nn.Embedding(
    num_embeddings=vocab_size,
    embedding_dim=output_dim
)

# Sample data from data loader and get token embeddings.
data_loader = create_dataloader(
    text=raw_text,
    batch_size=8,
    max_length=4,
    stride=4,
    shuffle=False
)

data_iter = iter(data_loader)
input_tokens, output_tokens = next(data_iter)

print("Input tokens, Output tokens")
print(input_tokens.shape, output_tokens.shape)

token_embeddings = token_embedding_layer(input_tokens)

print("\nToken embeddings")
print(token_embeddings.shape)

# Create absolute positional encoding layer.
context_length = 4
pos_embedding_layer = torch.nn.Embedding(
    num_embeddings=4,
    embedding_dim=output_dim
)
pos_embeddings = pos_embedding_layer(torch.arange(context_length))

print('\nPositional embeddings')
print(pos_embeddings.shape)

# Add the positional info the token embeddings.
input_embeddings = token_embeddings + pos_embeddings

print("\nInput embeddings")
print(input_embeddings.shape)

Input tokens, Output tokens
torch.Size([8, 4]) torch.Size([8, 4])

Token embeddings
torch.Size([8, 4, 768])

Positional embeddings
torch.Size([4, 768])

Input embeddings
torch.Size([8, 4, 768])


# Putting it all together

In [44]:
import torch
from torch.utils.data import DataLoader
import tiktoken


# Load the raw data.
def load_text(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        return f.read()


raw_text = load_text("./data/the-verdict.txt")
print(f"Num characters: {len(raw_text):,}")
print(f"Num words: {len(raw_text.split()):,}")


# Setup data loader.
class GPTDataset(torch.utils.data.Dataset):
    def __init__(self, text, tokenizer, max_length, stride):
        self.tokens = tokenizer.encode(text)
        self.max_length = max_length
        self.stride = stride

    def __len__(self):
        return (len(self.tokens) - self.max_length) // self.stride + 1

    def __getitem__(self, idx):
        start = idx * self.stride
        end = start + self.max_length
        input_tokens = self.tokens[start:end]
        output_tokens = self.tokens[start + 1 : end + 1]
        return torch.tensor(input_tokens), torch.tensor(output_tokens)


def create_dataloader(
    text,
    batch_size=4,
    max_length=256,
    stride=128,
    shuffle=True,
    drop_last=True,
    num_workers=0,
):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDataset(text, tokenizer, max_length, stride)
    return DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers,
    )


data_loader = create_dataloader(text=raw_text)
data_iter = iter(data_loader)

input_tokens, output_tokens = next(data_iter)
print("\nInput tokens, Output tokens")
print(input_tokens.shape, output_tokens.shape)


# Token embedding
def token_embedding_layer(input_tokens):
    embedding_layer = torch.nn.Embedding(num_embeddings=50257, embedding_dim=768)
    return embedding_layer(input_tokens)


token_embeddings = token_embedding_layer(input_tokens)
print("\nToken embeddings")
print(token_embeddings.shape)

# Positional encoding
context_length = 256


def create_positional_embeddings(context_length, output_dim):
    pos_embedding_layer = torch.nn.Embedding(
        num_embeddings=context_length, embedding_dim=output_dim
    )
    return pos_embedding_layer(torch.arange(context_length))


pos_embeddings = create_positional_embeddings(
    context_length, token_embeddings.shape[-1]
)
print("\nPositional embeddings")
print(pos_embeddings.shape)

# Combine embeddings
input_embeddings = token_embeddings + pos_embeddings
print("\nInput embeddings")
print(input_embeddings.shape)

Num characters: 20,479
Num words: 3,634

Input tokens, Output tokens
torch.Size([4, 256]) torch.Size([4, 256])

Token embeddings
torch.Size([4, 256, 768])

Positional embeddings
torch.Size([256, 768])

Input embeddings
torch.Size([4, 256, 768])
