In [2]:
import tiktoken
from importlib.metadata import version
import torch
from torch.utils.data import Dataset, DataLoader
from pathlib import Path

# Tokenization with tiktoken

This cell demonstrates how to use the `tiktoken` library for tokenizing and detokenizing text, which is essential for working with language models such as those from OpenAI.

## Steps Covered
1. **Install and Import tiktoken**: Ensure the `tiktoken` library is installed and import it along with the `version` utility.
2. **Check tiktoken Version**: Print the installed version of `tiktoken` to verify the setup.
3. **Initialize Tokenizer**: Load the `cl100k_base` encoding, which is commonly used for OpenAI models.
4. **Explore Vocabulary Size**: Display the size of the tokenizer's vocabulary.
5. **Tokenize Sample Text**: Encode a sample string into tokens and display the result.
6. **Decode Tokens**: Convert the tokens back to text and verify the output.

---

> **Note:** Tokenization is the process of converting text into a sequence of tokens (numbers) that a model can understand. Detokenization is the reverse process, converting tokens back into human-readable text.

In [3]:
print(f'Tiktoken version: {version('tiktoken')}')

tokenizer = tiktoken.get_encoding("cl100k_base")
print(f'Vocabulary size: {tokenizer.n_vocab}')

sample_text = "Hello, world! This is a test of how well the tokenizer works."
tokens = tokenizer.encode(sample_text)
print(f'Encoded tokens: {tokens}')

decoded = tokenizer.decode(tokens)
print(f'Decoded text: {decoded}')

Tiktoken version: 0.11.0
Vocabulary size: 100277
Encoded tokens: [9906, 11, 1917, 0, 1115, 374, 264, 1296, 315, 1268, 1664, 279, 47058, 4375, 13]
Decoded text: Hello, world! This is a test of how well the tokenizer works.


# Combine Text Files for LLM Training

This cell merges all `.txt` files from the `books` directory into a single file, `all_books.txt`, with each book separated by an `<EOS>` (End Of Sequence) token. This is a common preprocessing step for language model training, allowing the model to learn document boundaries.

## What the Code Does
- **Finds all `.txt` files** in the `books` directory using `Path.glob`.
- **Opens `all_books.txt`** for writing in UTF-8 encoding.
- **Iterates through each file**, reads its content, and writes it to the output file.
- **Appends `<EOS>`** after each book to mark the end of a document.

---

> **Why use `<EOS>`?**
>
> The `<EOS>` token helps the language model distinguish where one document ends and another begins. This is important for tasks like text generation, where you want the model to respect document boundaries.

In [4]:
files = Path('./books').glob('*.txt')
with open('all_books.txt', 'w', encoding='utf-8') as outfile:
    for file in files:
        outfile.write(Path(file).read_text(encoding='utf-8') + '<EOS>')

# Tokenize the Combined Text Corpus

This cell reads the entire combined text file (`all_books.txt`) and tokenizes it using the `tiktoken` tokenizer. Tokenization is a crucial preprocessing step for training language models, as it converts raw text into a sequence of tokens (integers) that the model can understand.

## What the Code Does
- **Reads the full corpus**: Loads all text from `all_books.txt` into memory.
- **Tokenizes the text**: Uses the `tokenizer.encode()` method to convert the text into a list of tokens.
- **Prints the token count**: Displays the total number of tokens, which is useful for estimating dataset size and batching during training.

---

> **Tip:** Knowing the total number of tokens helps you plan batch sizes, training steps, and memory requirements for your LLM training pipeline.

In [5]:
with(open('all_books.txt', 'r', encoding = 'utf-8') as textfile):
    all_text = textfile.read()

encoded_text = tokenizer.encode(all_text)

print(f'Total number of tokens: {len(encoded_text)}')




Total number of tokens: 4962540


# Creating a PyTorch Dataset for Language Model Training

This cell defines a custom `BooksDataset` class, which prepares tokenized text data for training an autoregressive language model (such as GPT). The dataset generates input and target sequences for each training example, enabling the model to learn to predict the next token in a sequence.

## What the Code Does
- **Initializes the Dataset**: Takes the full text, a tokenizer, a maximum sequence length (`max_length`), and a step size (step size between sequences).
- **Tokenizes the Text**: Encodes the entire text into tokens using the provided tokenizer.
- **Creates Overlapping Chunks**: For each position in the tokenized text, creates an input chunk of length `max_length` and a target chunk that is shifted by one token (the next-token prediction target).
- **Stores as Tensors**: Converts each chunk into a PyTorch tensor for efficient batching and training.
- **Implements `__len__` and `__getitem__`**: Standard PyTorch Dataset methods for compatibility with DataLoader.

---

> **Tip:**
> - Adjust `max_length` and `stride` to control the size and overlap of training samples.
> - This dataset structure is ideal for next-token prediction tasks, where the model learns to generate text one token at a time.

In [6]:
class BooksDataset(Dataset):
    def __init__(self, text, tokenizer, max_length, step_size):
        self.input_ids = []
        self.target_ids = []

        encoded_text = tokenizer.encode(text)

        for i in range(0, len(encoded_text) - max_length, step_size):
            input_chunk = encoded_text[i:i + max_length]
            target_chunk = encoded_text[i + 1 : i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
    
    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, index):
        return self.input_ids[index], self.target_ids[index]


# Utility Function: Create DataLoader for LLM Training

This cell defines a helper function, `create_dataloader`, which streamlines the process of preparing a PyTorch DataLoader for language model training. It takes raw text and returns a DataLoader that yields batches of input and target token sequences, ready for model training.

## What the Code Does
- **Defines `create_dataloader`**: Accepts the full text, maximum sequence length, step size (stride), and batch size as arguments.
- **Initializes the Tokenizer**: Uses the `cl100k_base` encoding from `tiktoken` for tokenization.
- **Creates a `BooksDataset`**: Uses the custom dataset class to generate input-target pairs from the tokenized text.
- **Builds a DataLoader**: Wraps the dataset in a PyTorch DataLoader for efficient batching, shuffling, and parallel data loading.
- **Returns the DataLoader**: Ready to be used in a training loop for next-token prediction tasks.

---

> **Tip:**
> - Adjust `max_length`, `step_size`, and `batch_size` to fit your model and hardware constraints.
> - This function abstracts away the repetitive setup code, making your training pipeline cleaner and more modular.

In [7]:
def create_dataloader(text, max_length = 512, step_size = 256, batch_size = 8, shuffle = True):
    tokenizer = tiktoken.get_encoding('cl100k_base')
    dataset = BooksDataset(text,tokenizer, max_length, step_size)
    dataloader = DataLoader(
        dataset = dataset,
        batch_size = batch_size,
        shuffle = shuffle,
        drop_last = True,
        num_workers = 0
    )

    return dataloader



# Example: Using the DataLoader for Batching

This cell demonstrates how to use the `create_dataloader` utility to generate batches of input and target sequences for language model training. It shows how to iterate through the DataLoader and inspect the structure of each batch.

## What the Code Does
- **Creates a DataLoader**: Calls `create_dataloader` with the full text, a batch size of 2, a maximum sequence length of 8, and a step size of 4. Shuffling is disabled for demonstration purposes.
- **Iterates Through Batches**: Loops through the DataLoader and prints the first 4 batches, showing the input and target tensors for each batch.
- **Batch Structure**: Each batch contains a tuple of input and target tensors, where:
  - The input tensor is a sequence of token IDs of length `max_length`.
  - The target tensor is the same sequence shifted by one token (for next-token prediction).

---

> **Tip:**
> - Adjust `batch_size`, `max_length`, and `step_size` to match your model and hardware.
> - Inspecting batches before training helps verify that your data pipeline is working as expected.

In [8]:
dataloader = create_dataloader(all_text, batch_size = 2, max_length = 8, step_size = 4, shuffle = False)

for i, batch in enumerate(dataloader):
    print(f"Batch {i}: {batch}")
    if i == 3:
        break
    

Batch 0: [tensor([[ 3305,   791,  5907, 52686, 58610,   315,   578, 19121],
        [58610,   315,   578, 19121, 21785,   315, 12656, 42482]]), tensor([[  791,  5907, 52686, 58610,   315,   578, 19121, 21785],
        [  315,   578, 19121, 21785,   315, 12656, 42482,  7361]])]
Batch 1: [tensor([[21785,   315, 12656, 42482,  7361,  2028, 35097,   374],
        [ 7361,  2028, 35097,   374,   369,   279,  1005,   315]]), tensor([[  315, 12656, 42482,  7361,  2028, 35097,   374,   369],
        [ 2028, 35097,   374,   369,   279,  1005,   315,  5606]])]
Batch 2: [tensor([[  369,   279,  1005,   315,  5606, 12660,   304,   279],
        [ 5606, 12660,   304,   279,  3723,  4273,   323,   198]]), tensor([[  279,  1005,   315,  5606, 12660,   304,   279,  3723],
        [12660,   304,   279,  3723,  4273,   323,   198,  3646]])]
Batch 3: [tensor([[3723, 4273,  323,  198, 3646, 1023, 5596,  315],
        [3646, 1023, 5596,  315,  279, 1917,  520,  912]]), tensor([[4273,  323,  198, 3646, 1023,

# Token and Positional Embeddings for LLMs

This cell demonstrates how to combine token embeddings and positional embeddings, which are essential components in transformer-based language models (LLMs) like GPT. Token embeddings represent the meaning of each token, while positional embeddings encode the position of each token in the input sequence, allowing the model to capture word order and context.

## What the Code Does
- **Defines `vocab_size` and `embedding_dim`**: Sets the vocabulary size and embedding dimension for the model.
- **Sets `context_length`**: Specifies the maximum sequence length (context window) the model can process.
- **Creates Token Embedding Layer**: Maps each token ID to a dense vector of size `embedding_dim`.
- **Creates Positional Embedding Layer**: Maps each position (from 0 to `context_length - 1`) to a dense vector of the same size.
- **Loads a Batch of Data**: Uses the DataLoader to get a batch of input and target token sequences.
- **Computes Token Embeddings**: Looks up embeddings for each token in the input batch.
- **Computes Positional Embeddings**: Looks up embeddings for each position in the sequence.
- **Combines Embeddings**: Adds token and positional embeddings to form the final input embeddings for the model.
- **Prints Shapes**: Displays the shapes of the resulting tensors to verify correctness.

---

> **Tip:**
> - Adding token and positional embeddings is a standard practice in transformer models, enabling them to understand both content and order of tokens.
> - Ensure that `context_length` matches the maximum sequence length used during training and inference.

In [29]:
vocab_size = tokenizer.n_vocab
embedding_dim = 512
context_length = 1024

token_embedding_layer = torch.nn.Embedding(vocab_size, embedding_dim)
positional_embedding_layer = torch.nn.Embedding(context_length, embedding_dim)

dataloader = create_dataloader(all_text, batch_size = 8, max_length = context_length, step_size = 512, shuffle = True)
dataiter = iter(dataloader)

test_input, test_targets = next(dataiter)


token_embeddings = token_embedding_layer(test_input)
print(f'Token embeddings shape: {token_embeddings.shape}')

positional_embeddings = positional_embedding_layer(torch.arange(context_length))
print(f'Positional embeddings shape: {positional_embeddings.shape}')

input_embeddings = token_embeddings + positional_embeddings
print(f'Input embeddings shape: {input_embeddings.shape}')

Token embeddings shape: torch.Size([8, 1024, 512])
Positional embeddings shape: torch.Size([1024, 512])
Input embeddings shape: torch.Size([8, 1024, 512])
