# Tokenizing text

We'll use an of-the-shelve tokenizer for doing Byte Pair Encoding (BPE).  We'll use `tiktoken` for this.

In [27]:
#| echo: true
#| output: false
%conda install -y tiktoken

157625.20s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Retrieving notices: done
Channels:
 - defaults
Platform: linux-aarch64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - tiktoken


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-25.5.1               |  py313hd43f75c_0         1.2 MB
    ------------------------------------------------------------
                                           Total:         1.2 MB

The following packages will be UPDATED:

  conda                              25.5.0-py313hd43f75c_0 --> 25.5.1-py313hd43f75c_0 



Downloading and Extracting Packages:
                                                                                
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
ERROR conda.core.link:_execute(1029): An error occurred while uninstalling packa

Let's load a text and tokenize it:

In [28]:
import tiktoken

filepath = '../data/dracula.txt'

def load_text(path):
    with open(path, 'r') as f:
        raw_text = f.read()
    return raw_text

def tokens_from_text(text: str):
    tokenizer = tiktoken.get_encoding("gpt2")
    integers = tokenizer.encode(text)
    return integers

def text_from_tokens(tokens: list[int]):
    tokenizer = tiktoken.get_encoding("gpt2")
    text = tokenizer.decode(tokens)
    return text


This now allows us to load text and turn it into tokens (each identified by an integer) or the reverse: given a set of tokens, reconstruct the text from them:

In [29]:
def get_sample_text(num_chars:int = 40):
    raw_text = load_text(filepath)
    return raw_text[:num_chars]

sample_text = get_sample_text()
print(sample_text)

tokens = tokens_from_text(sample_text)
print(tokens)

text = text_from_tokens(tokens)
print(text)

The Project Gutenberg eBook of Dracula
 
[464, 4935, 20336, 46566, 286, 41142, 198, 220]
The Project Gutenberg eBook of Dracula
 


# Creating a dataset

We'll first make sure to install pytorch: `conda install pytorch cpuonly -c pytorch`.


In [30]:
import torch
from torch.utils.data import Dataset, DataLoader

class MyDataset(Dataset):
    def __init__(self, txt: str, tokenizer, max_length=16, stride=4):
        """
        Args:
            txt (str): The input text to tokenize and split into sequences.
            tokenizer: The tokenizer used to encode the text into token ids.
            max_length (int): The context length, i.e., the number of tokens in each input sequence.
            stride (int): The step size between the start of consecutive sequences.
        """
        self.tokenizer = tokenizer
        self.max_length = max_length  # context length for each input sequence
        self.stride = stride
        self.token_ids = self.tokenizer.encode(txt)
        self.length = len(self.token_ids)

        self.input_ids = []    # list of input tokens, our "context" as input to the LLM
        self.target_ids = []   # list of target tokens that will need to be predicted, our "context" shifted by stride

        for i in range(0, len(self.token_ids) - self.max_length):
            input_chunk = self.token_ids[i:i + self.max_length]
            target_chunk = self.token_ids[i + 1:i + self.max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]
    
def create_dataloader(txt:str, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True,num_workers=0):
    """
    Create a DataLoader for the given text.
    Args:
        txt (str): The input text to tokenize and split into sequences.
        batch_size (int): Number of samples per batch.
        max_length (int): The context length, i.e., the number of tokens in each input sequence.
        stride (int): The step size between the start of consecutive sequences.
        shuffle (bool): Whether to shuffle the data at every epoch.
        drop_last (bool): Whether to drop the last incomplete batch.
        num_workers (int): Number of subprocesses to use for data loading.
    """

    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = MyDataset(txt, tokenizer, max_length=max_length, stride=stride)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)
    return dataloader

Let's test our dataloader now:

In [31]:
text = get_sample_text(300)
print("sample_text: ", text)
print("======")
dataloader = create_dataloader(txt=text, batch_size=2, max_length=8, stride=2, drop_last=False)
for batch in dataloader:
    input_ids, target_ids = batch
    print("Input IDs first batch: \n", input_ids)
    print("Target IDs first batch: \n", target_ids)
    break  # Just show the first batch
print("Total batches:", len(dataloader))
print("Batch size:", dataloader.batch_size)
print("Number of workers:", dataloader.num_workers)

sample_text:  The Project Gutenberg eBook of Dracula
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included 
Input IDs first batch: 
 tensor([[41142,   198,   220,   220,   220,   220,   198,  1212],
        [  220,   220,   220,   198,  1212, 47179,   318,   329]])
Target IDs first batch: 
 tensor([[  198,   220,   220,   220,   220,   198,  1212, 47179],
        [  220,   220,   198,  1212, 47179,   318,   329,   262]])
Total batches: 31
Batch size: 2
Number of workers: 0


# From token IDs to Embeddings

We now need to translate our token IDs to multi-dimensional vectors that can be used as input for our neural network.  What we have here:
- `nr_batches`: the count of batches
- `batch_size`: the number of samples in each batch, before we'll update our weights
- `max_length`: the length of our context window: how many tokens there are in every sample, to predict the next token from
- `vocab_size`: the size of our vocabulary (which in the case of tiktoken "gpt2" tokenizer is 50,257 tokens)
- `embedding_dim`: the length of each embedding vector, representing a single token (12,288 for GPT-3 for example)

Let's work through a simple example first, assuming a single sample, four token input:


In [32]:
torch.manual_seed(42)                                           # For reproducibility
input_ids = torch.tensor([3, 5, 1, 4 ])                         # Example input tensor, four tokens

vocab_size = 6                                                  # Size of the vocabulary, here 6 tokens
embedding_dim = 8                                               # Size of the embedding vector for each token, here 8 floats
embedding_layer = torch.nn.Embedding(vocab_size, embedding_dim)       # Create the embedding layer

print(embedding_layer.weight)
print("embedding_layer shape:", embedding_layer.weight.shape)   # Shape of the embedding matrix

Parameter containing:
tensor([[ 1.9269,  1.4873,  0.9007, -2.1055,  0.6784, -1.2345, -0.0431, -1.6047],
        [-0.7521,  1.6487, -0.3925, -1.4036, -0.7279, -0.5594, -0.7688,  0.7624],
        [ 1.6423, -0.1596, -0.4974,  0.4396, -0.7581,  1.0783,  0.8008,  1.6806],
        [ 1.2791,  1.2964,  0.6105,  1.3347, -0.2316,  0.0418, -0.2516,  0.8599],
        [-1.3847, -0.8712, -0.2234,  1.7174,  0.3189, -0.4245,  0.3057, -0.7746],
        [-1.5576,  0.9956, -0.8798, -0.6011, -1.2742,  2.1228, -1.2347, -0.4879]],
       requires_grad=True)
embedding_layer shape: torch.Size([6, 8])


This is our embedding layer, having for every of the 6 tokens in the vocabulary, a vector of 8 floats representing that token.
`nn.Embedding`. This module is a lookup table that stores embeddings of a fixed dictionary and size. When you pass a tensor of token IDs to it, it returns the corresponding embedding vectors for each token. This is commonly used in NLP models to convert token IDs into dense vector representations that can be processed by neural networks.

Learn more in the [PyTorch documentation for nn.Embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html).

In [36]:
# lookup the embeddings for the input tokens
embeddings = embedding_layer(input_ids)                        # Shape: (4, 8), 4 tokens, each with an 8-dimensional vector
print("input_ids:\n", input_ids)
print("embedding layer:\n", embedding_layer.weight)
print("embedding layer shape:\n", embedding_layer.weight.shape)
print("embeddings shape:\n", embeddings.shape)
print("embeddings:\n", embeddings)

input_ids:
 tensor([3, 5, 1, 4])
embedding layer:
 Parameter containing:
tensor([[ 1.9269,  1.4873,  0.9007, -2.1055,  0.6784, -1.2345, -0.0431, -1.6047],
        [-0.7521,  1.6487, -0.3925, -1.4036, -0.7279, -0.5594, -0.7688,  0.7624],
        [ 1.6423, -0.1596, -0.4974,  0.4396, -0.7581,  1.0783,  0.8008,  1.6806],
        [ 1.2791,  1.2964,  0.6105,  1.3347, -0.2316,  0.0418, -0.2516,  0.8599],
        [-1.3847, -0.8712, -0.2234,  1.7174,  0.3189, -0.4245,  0.3057, -0.7746],
        [-1.5576,  0.9956, -0.8798, -0.6011, -1.2742,  2.1228, -1.2347, -0.4879]],
       requires_grad=True)
embedding layer shape:
 torch.Size([6, 8])
embeddings shape:
 torch.Size([4, 8])
embeddings:
 tensor([[ 1.2791,  1.2964,  0.6105,  1.3347, -0.2316,  0.0418, -0.2516,  0.8599],
        [-1.5576,  0.9956, -0.8798, -0.6011, -1.2742,  2.1228, -1.2347, -0.4879],
        [-0.7521,  1.6487, -0.3925, -1.4036, -0.7279, -0.5594, -0.7688,  0.7624],
        [-1.3847, -0.8712, -0.2234,  1.7174,  0.3189, -0.4245,  0.3

# Adding positional information

One downside of the self-attention mechanics that we'll work with later is that there's no positional information to the tokens.  It all looks the same to the neural net as we have it for now; it won't distinguish between the order of words/tokens in our input.  We'll address this by creating another layer of embeddings, our positional embeddings with dimension `[max_length, embedding_dim]`.  

For each position embedding, we'll chose the same length as our `embedding_dim`, so that we can add each of the respective positional vector to each of the corresponding token embedding vector.

So in terms of dimensions, we have:
- `[max_length, embedding dim]` as our single sample input embedding vector.
- `[max_length, embedding dim]` as our position embedding vector

It's those two we'll add together to form our input to the LLM.

Let's start from our batch again, using somewhat more realistic sizes and dimensions:

In [41]:
context_length = 4          # Maximum length of the input sequences
vocab_size = 50257          # Size of the vocabulary for GPT-2
embedding_dim = 256         # Let's say we want 256-dimensional embeddings

embedding_layer = torch.nn.Embedding(vocab_size, embedding_dim)

dataloader = create_dataloader(txt=text, batch_size=2, max_length=context_length, stride=2, drop_last=True)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

print("First batch of inputs and targets:")
print("Inputs shape:", inputs.shape)            # Shape: (batch_size, max_length)
print("Targets shape:", targets.shape)          # Shape: (batch_size, max_length)

# Get the embeddings for the input tokens
embeddings = embedding_layer(inputs)            # Shape: (batch_size, max_length, embedding_dim)
print("Embeddings shape:", embeddings.shape)    # Shape: (batch_size, max_length, embedding_dim)

# Get the embedding layer for the positions
pos_embedding_layer = torch.nn.Embedding(context_length, embedding_dim)
pos_embeddings = pos_embedding_layer(torch.arange(context_length))  # Shape: (context_length, embedding_dim)
print("Position embeddings shape:", pos_embeddings.shape)           # Shape: (context_length, embedding_dim)

# Combine input embeddings and position embeddings
combined_embeddings = embeddings + pos_embeddings                   # Shape: (batch_size, max_length, embedding_dim)
print("Combined embeddings shape:", combined_embeddings.shape)  # Shape: (batch_size, max_length, embedding_dim)

First batch of inputs and targets:
Inputs shape: torch.Size([2, 4])
Targets shape: torch.Size([2, 4])
Embeddings shape: torch.Size([2, 4, 256])
Position embeddings shape: torch.Size([4, 256])
Combined embeddings shape: torch.Size([2, 4, 256])


This means that in our first batch we have:
- 2 samples
- each 4 tokens
- each represented by an embedding vector of 256