## Working with Text Data

This notebook covers Chapter 2 of [*Build a Large Language Model from Scratch*](https://www.manning.com/books/build-a-large-language-model-from-scratch) by Sebastian Raschka (2025).

### Understanding Word Embeddings

> "The concept of converting text into a vector format is often referred to as *embedding*" (Rashcka 2025:18).

- Embeddings convert non-numerical data into continuous, dense vectors in a vector space.
- Texts can be converted to ***word***, ***sentence***, ***paragraph***, and even ***document*** embeddings.
- These continuous, dense embedding vectors can then be processed by neural networks.
  - Sentence and document embeddings are used for ***retrieval-augmented generation (RAG)***.
- The embeddings representing contextually or conceptually similar documents should be closer to each other in a vector space than those representing different contexts or concepts.
  
**LLM Embeddings**
- Pretrained word embedding models, like Word2Vec, can be used (or trained from scratch) to create embeddings.
- However, large language models (LLMs) "commonly produce their own embeddings that are part of the input layer and are updated during training" (Rashcka 2025:20).
  - The benefit of using an LLM model's embeddings is that they'll be optimized to the specific task and data.
- The size of an embedding (e.g., the ***dimensionality of its hidden state***) varies.
  - e.g., GPT-2 used 768 dimensions, GPT-3 used 12,288.

### Implementing Embeddings

Fetch the public domain text, *The Verdict*:

In [None]:
# urllib:
import urllib.request

# file url:
url = (
    "https://raw.githubusercontent.com/rasbt/" 
    "LLMs-from-scratch/main/ch02/01_main-chapter-code/" 
    "the-verdict.txt"
)

# fetch:
file_path = "the-verdict.txt"
urllib.request.urlretrieve(url, file_path)

# open:
with open(file_path, "r", encoding="utf-8") as f:
    raw_text = f.read()

In [None]:
print(f"N characters: {len(raw_text)}")
print(raw_text[:100])

#### Tokenization

**Simple tokenizer:**

We can start by just splitting on whitespace.

In [None]:
import re

# split on whitespace:
tokenizer_regex = re.compile(r"(\s)")

# test:
test_text = "Hello, world! This is a test of my simple tokenizer."
test_result = re.split(tokenizer_regex, test_text)
print(test_result)

This is too naive, so we can add additional rules (e.g., splitting punctuation from tokens):

In [None]:
tokenizer_regex = re.compile(r"([,.!]|\s)")
test_text = "Hello, world! This is a test of my simple tokenizer."
test_result = re.split(tokenizer_regex, test_text)
print(test_result)

Let's remove remaining white space:

In [None]:
test_result = [token for token in test_result if token.strip()]
print(test_result)

Further complexity so we can tokenize the example text from *The Verdict*:

In [None]:
tokenizer_regex = re.compile(r"([,.:;?_!\"()\']|--|\s)")
tokens = re.split(tokenizer_regex, raw_text)
tokens = [token for token in tokens if token.strip()]
print(f"N tokens: {len(tokens)}")
print(tokens[:20])

**Map tokens to token IDs**

In [None]:
# create the vocabulary:
all_words = sorted(set(tokens))
vocab_size = len(all_words)
print(f"Vocab size: {vocab_size}")

In [None]:
# token IDs:
vocab = {token:idx for idx,token in enumerate(all_words)}
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

**Convert simple tokenizer to a class:**

In [None]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_srt = {i:tok for tok,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r"([,.:;?_!\"()\']|--|\s)", text)
        preprocessed = [tok.strip() for tok in preprocessed if tok.strip()]
        ids = [self.str_to_int[tok] for tok in preprocessed]
        return ids
    
    def decode(self, ids):
        text = " ".join([self.int_to_srt[i] for i in ids])

        # remove white space before punctuation:
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

Test tokenizer:

In [None]:
tokenizer = SimpleTokenizerV1(vocab=vocab)

# test text:
text = """"It's the last he painted, you know," Mrs. Gisburn said with pardonable pride."""

# encode:
ids = tokenizer.encode(text)
print(ids)

In [None]:
# decode:
tokenizer.decode(ids)

We still have the issue of out of vocabulary text:

In [None]:
text = "Hello, do you like tea?"
print(tokenizer.encode(text))

We can fix this with **special character** tokens:

- `<|unk|>` can be used for out of vocabulary words.
- `<|endoftext|>` can be used to communicate tot he LLM that a text sequence has ended and it is unrelated to the following sequence.
  - In training, the training examples will be concatenated together.
  - Hence, we need to let the model know the boundaries of related tokens.

In [None]:
# fetch the vocab again:
all_tokens = sorted(list(set(tokens)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token:idx for idx,token in enumerate(all_tokens)}
print(f"Vocab size: {len(vocab.items())}")

In [None]:
for i, item in enumerate(list(vocab.items())[-5:]): print(item)

Let's update our tokenizer:

In [None]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_srt = {i:tok for tok,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r"([,.:;?_!\"()\']|--|\s)", text)
        preprocessed = [tok.strip() for tok in preprocessed if tok.strip()]

        # flag unkown tokens:
        preprocessed = [
            tok if tok in self.str_to_int else "<|unk|>"
            for tok in preprocessed
        ]
    

        ids = [self.str_to_int[tok] for tok in preprocessed]
        return ids
    
    def decode(self, ids):
        text = " ".join([self.int_to_srt[i] for i in ids])

        # remove white space before punctuation:
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

And test:

In [None]:
tokenizer = SimpleTokenizerV2(vocab=vocab)
text1 = "Hello, do you like tea?" 
text2 = "In the sunlit terraces of the palace."

# join on <|endoftext|> token:
text = " <|endoftext|> ".join((text1, text2))
print(text)

In [None]:
ids = tokenizer.encode(text)
print(ids)

In [None]:
print(tokenizer.decode(ids))

Other common special tokens include:

- `[BOS]`: beginning of sequence (start of text).
- `[EOS]`: end of sequence (end of a text).
  - Useful when concatenating multilpe texts.
  - Similar to `<|endoftext|>`.
- `[PAD]`: padding (indicates padding token).
  - Padding is used when training examples have documents of different lengths.
  - When training on batched inputs, a mask is typically use and we "don't attend to padded tokens" (p. 32).

**Note:** GPT models do not use `<|unk|>` tokens; instead they use a ***byte-pair encoding tokenizr*** that breaks words into sub-word units.

#### Byte-Pair Encoding

- Byte-pair encoding (BPE) was used to train LLMs like GPT-2, GPT-3, and the original ChatGPT.
  - This BPE tokenizer has a vocabulary size of 50,257.
  - Even though the BPE tokenizer doesn't use `<|unk|>`, the model breaks words into subword units or individual characters, avoiding out-of-vocab (OOV) errors.
    - If an OOV token is encountered, BPE can simply represent it as sequence of subword tokens or characters.
  - BPE fundamentally works by "iteratively merging frequent characters into subwords and frequent subwords into words" (p. 34).
- We use the implementation in `tiktoken` here.

In [None]:
from importlib.metadata import version 
import tiktoken 
print("tiktoken version:", version("tiktoken"))

In [None]:
# get GPT-2 tokenizer:
tokenizer = tiktoken.get_encoding("gpt2")
text = """Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace."""
ids = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(ids)

In [None]:
# decode:
decoded_strings = tokenizer.decode(ids)
print(decoded_strings)

**Sampling data with a sliding window:**

At each step `i` in a sequence of length `t`, the decoder can only access tokens at each step in the range `t - i`.

Let's see how this works:

In [None]:
raw_text[:100]

In [None]:
encoded_text = tokenizer.encode(raw_text)
print(len(encoded_text))

In [None]:
# mask 50 first tokens:
enc_sample = encoded_text[50:]

# choose context size:
context_size = 4

# input tokens:
input_tokens = enc_sample[:context_size]

# target tokens will be the inputs shifted by one position:
target_tokens = enc_sample[1:context_size+1]

print(f"Input tokens: {input_tokens}")
print(f"Target tokens: {target_tokens}")

In [None]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    target = enc_sample[i]
    print(context, "---->", target)

In [None]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    target = enc_sample[i]
    print(tokenizer.decode(context), "---->", tokenizer.decode([target]))

Let's make use of PyTorch:

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader

In [None]:
# more efficient data handling:
class GPTDatasetV1(Dataset):
    def __init__(self, text, tokenizer, max_length, stride):
        """PyTorch data loader.

        Args:
            text (str): the raw text string.
            tokenizer (Any): the tokenizer.
            max_length (int): the max length of the sequence.
            stride (int): the sliding window size.
        """
        self.input_ids = []
        self.target_ids = []

        # encode:
        token_ids = tokenizer.encode(text)

        # get inputs and targets:
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i: i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    # __len__ gets total number of rows in dataset:
    def __len__(self):
        return len(self.input_ids)
    
    # __getitem__ returns a single row from the dataset:
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

Create a data loader:

In [None]:
def create_dataloader_v1(text, batch_size=4, max_length=256, 
                         stride=128, shuffle=True, drop_last=True, 
                         num_workers=0):
    
    # fetch tokenizer:
    tokenizer = tiktoken.get_encoding("gpt2")

    # create dataset:
    dataset = GPTDatasetV1(text, tokenizer, max_length, stride)

    # create data loader:
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last, # if last batch is shorter than batch size, then it is dropped (this prevents loss spikes).
        num_workers=num_workers
    )

    return dataloader

Try it out:

In [None]:
dataloader = create_dataloader_v1(raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)
data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

What exactaly is going on with ***stride***?

- Stride controls how far the input shifts after each batch.

For example, with `stride=1`, the second batch would start at `367`, the second token position of the input from the first batch:

In [None]:
second_batch = next(data_iter)
print(second_batch)

But if we set, e.g., `stride=2`, then the data loader would jump two token positions to `2885`:

In [None]:
dataloader = create_dataloader_v1(raw_text, batch_size=1, max_length=4, stride=2, shuffle=False)
data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

In [None]:
second_batch = next(data_iter)
print(second_batch)

We can also up the batch size:

In [None]:
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("Targets:\n", targets)

#### Creating Token Embeddings

**Toy Example**

Assume we have a vocabulary size of just `6` and we want to create `3D` embeddings. We can use the `nn.Embedding` method from `torch` to initialize random embedding weights:

In [None]:
# set seed:
torch.manual_seed(123)

vocab_size = 6
output_dim = 3

embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(embedding_layer.weight)

The random values in the embedding vectors will be optimized during training.

We can grab the embedding for a specific token:

In [None]:
token_index = 3
embedding_layer(torch.tensor([token_index]))

Simulate some input IDs:

In [None]:
input_ids = torch.tensor([2, 3, 5, 1])
print(embedding_layer(input_ids))

Under the hood, `nn.Embedding` does the same thing as passing a one-hot-encoded matrix through a `nn.Linear` layer:

In [None]:
num_idx = max(input_ids) + 1
linear_output = torch.nn.Linear(num_idx, output_dim, bias=False)
linear_output.weight

The weight matrix must be transposed to match the shape of `nn.Embedding` (note: for this example, the actual weight values will not match):

In [None]:
linear_output.weight.T