# Understanding Tokens and Byte Pair Encoding (BPE) with `tiktoken`

This notebook explains the concepts of tokens and Byte Pair Encoding (BPE), which are fundamental to how large language models process text. We'll use the `tiktoken` library, developed by OpenAI, to demonstrate these concepts.

## What are Tokens?

In the context of language models, a token is a sequence of characters that the model treats as a single unit. This can be a word, part of a word, or even punctuation. Models don't process raw text character by character; instead, they break down text into tokens.

## Why Tokenization?

Tokenization is crucial for several reasons:

* **Efficiency:** Processing tokens is much more efficient than processing individual characters, especially for large amounts of text.
* **Vocabulary:** Models work with a fixed vocabulary of tokens. Tokenization maps the input text to this vocabulary.
* **Handling Out-of-Vocabulary Words:** BPE, which we'll discuss next, helps handle words that are not explicitly in the model's vocabulary by breaking them down into sub-word units.

## What is Byte Pair Encoding (BPE)?

Byte Pair Encoding (BPE) is a common tokenization algorithm used by many language models, including those from OpenAI. It works by iteratively merging the most frequent pairs of bytes (or characters) in a text until a desired vocabulary size is reached.

Here's a simplified idea of how it works:

1. Start with a vocabulary of individual characters.
2. Count the frequency of adjacent pairs of characters.
3. Replace the most frequent pair with a new, unique token.
4. Repeat steps 2 and 3 until a desired number of tokens is created or no further merges are beneficial.

This process creates a vocabulary of tokens that includes individual characters, common words, and sub-word units.

## Simplified BPE from Scratch

To better understand BPE, let's implement a simplified version from scratch. This implementation will demonstrate the core idea of iteratively merging the most frequent pairs of characters.

In [None]:
from collections import Counter


def get_pair_counts(corpus: list[list[str]]) -> Counter:
    """Count all adjacent character pairs across the corpus."""
    pairs = [(word[i], word[i + 1]) for word in corpus for i in range(len(word) - 1)]
    return Counter(pairs)


def merge_pair(corpus: list[list[str]], pair: tuple[str, str]) -> list[list[str]]:
    """Merge all occurrences of a pair into a single token."""
    a, b = pair
    merged_token = a + b

    new_corpus = []
    for word in corpus:
        new_word = []
        i = 0
        while i < len(word):
            if i < len(word) - 1 and word[i] == a and word[i + 1] == b:
                new_word.append(merged_token)
                i += 2
            else:
                new_word.append(word[i])
                i += 1
        new_corpus.append(new_word)

    return new_corpus


# --- Training loop ---

corpus: list[list[str]] = [
    ["l", "o", "w", "</w>"],
    ["l", "o", "w", "e", "r", "</w>"],
    ["l", "o", "w", "e", "s", "t", "</w>"],
    ["s", "l", "o", "w", "</w>"],
    ["s", "l", "o", "w", "e", "r", "</w>"],
]

vocab: set[str] = set(ch for word in corpus for ch in word)

num_merges: int = 3
for step in range(num_merges):
    pair_counts = get_pair_counts(corpus)
    best_pair = pair_counts.most_common(1)[0][0]
    corpus = merge_pair(corpus, best_pair)
    vocab.add(best_pair[0] + best_pair[1])

    print(f"Step {step + 1}: merged {best_pair} -> {best_pair[0] + best_pair[1]}")
    print(f"Current corpus: {corpus}")
    print(f"Current vocab: {vocab}")
    print()

Now, let's go back to using the `tiktoken` library to explore its functionalities.

## Using `tiktoken`

`tiktoken` is a fast open-source tokenizer by OpenAI. It's used to count tokens and encode/decode text using various encodings used by their models.

First, let's install `tiktoken`:

In [None]:
!pip install tiktoken

Now, let's import `tiktoken` and explore some of its functionalities.

`tiktoken` supports different encodings for different models. You can get an encoding by its name:

In [None]:
import tiktoken

# Get the encoding for a specific model (e.g., 'gpt-4')
encoding = tiktoken.encoding_for_model("gpt-4")

# Alternatively, get an encoding by its name
# encoding = tiktoken.get_encoding("cl100k_base")

print(f"Encoding name: {encoding.name}")
print(f"Vocabulary size: {encoding.n_vocab}")

Let's see how to encode text into tokens:

In [None]:
text = "Hello, world! This is a test sentence."

# Encode the text
tokens = encoding.encode(text)

print(f"Original text: {text}")
print(f"Tokens: {tokens}")
print(f"Number of tokens: {len(tokens)}")

You can also decode tokens back into text:

In [None]:
# Decode the tokens
decoded_text = encoding.decode(tokens)

print(f"Tokens: {tokens}")
print(f"Decoded text: {decoded_text}")

`tiktoken` is particularly useful for counting tokens, which is important for estimating costs when using language models or managing context window limits.

In [None]:
text_to_count = "This is a longer piece of text to demonstrate token counting."
num_tokens = len(encoding.encode(text_to_count))
print(f"Text: '{text_to_count}'")
print(f"Number of tokens: {num_tokens}")

## How BPE Affects Tokenization

Let's look at some examples to see how BPE can break down words into sub-word units. Notice how common words or parts of words might be single tokens, while less common words or combinations might be split.

In [None]:
words = ["tokenization", "untokenizable", "jupyter", "notebook", "extraordinarily"]

for word in words:
    tokens = encoding.encode(word)
    print(f"Word: '{word}'")
    print(f"Tokens: {tokens}")
    print(f"Decoded tokens: {[encoding.decode([token]) for token in tokens]}")
    print("-" * 20)

As you can see, "tokenization", "jupyter", and "notebook" are treated as single tokens in this encoding. "untokenizable" is split into "un" and "tokenizable", and "extraordinarily" is split into several sub-word units. This demonstrates how BPE can handle variations in words and less common terms.

### Handling Special Tokens

Special tokens are used for specific purposes by language models. We can see how `tiktoken` handles them.

In [None]:
text_with_special_tokens = (
    "<|endoftext|> This is a test with a special token. <|fim_middle|>"
)

# Encode the text
tokens_with_special = encoding.encode(text_with_special_tokens, allowed_special="all")

print(f"Original text with special tokens: {text_with_special_tokens}")
print(f"Tokens: {tokens_with_special}")
print(f"Number of tokens: {len(tokens_with_special)}")

# Decode the tokens
decoded_text_with_special = encoding.decode(tokens_with_special)
print(f"Decoded text with special tokens: {decoded_text_with_special}")

### Handling Typos and Nonexistent Words

Let's see how `tiktoken` tokenizes words that are not in its vocabulary, such as typos or made-up words.

In [None]:
typo_text = "This is a typoed sentense with a nonexistentword."

# Encode the text
typo_tokens = encoding.encode(typo_text)

print(f"Original text with typo: {typo_text}")
print(f"Tokens: {typo_tokens}")
print(f"Decoded tokens: {[encoding.decode([token]) for token in typo_tokens]}")
print(f"Number of tokens: {len(typo_tokens)}")

## Conclusion

Understanding tokenization and BPE is fundamental to working with large language models. `tiktoken` provides a convenient way to work with tokenization for OpenAI models, allowing you to encode/decode text and count tokens efficiently. This knowledge is essential for managing model inputs, understanding model outputs, and estimating the computational resources required for processing text.