# GPT Tokenizer from Scratch

In this notebook, I created a tokenizer from scratch following Andrej Karpathy's tutorial on the same.

In [None]:
text = "Ｕｎｉｃｏｄｅ! 🅤🅝🅘🅒🅞🅓🅔‽ 🇺‌🇳‌🇮‌🇨‌🇴‌🇩‌🇪! 😄 The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to “support Unicode” in our software (whatever that means—like using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don’t blame programmers for still finding the whole thing mysterious, even 30 years after Unicode’s inception."

tokens = text.encode('utf-8')
tokens = list(map(int, tokens))
len(text), len(tokens)

(533, 616)

UTF-8 is a variable length encoding which encodes much of ASCII characters using a single byte but characters such as emojis take more number of bytes. Remember huffman coding? The length of the text represents the number of Unicode Code Points, but those code points are represented using some integers whose length can be more than the length of the text because it uses variable byte encoding.

## Byte Pair Encoding

What BPE does is similar to huffman coding. You iterate over the text and find out which byte pairs are occurring most frequently. Then you merge those byte pairs. Here, by byte pairs, we mean the two consecutive pairs of bytes.

This function is usually called `get_stats` and that's what we are also calling it here.

In [None]:
from collections import Counter

def get_stats(ids):
    counts = Counter(zip(ids, ids[1:]))
    return counts

counts = get_stats(tokens)
top_pair = counts.most_common(1)[0][0]
top_pair

(101, 32)

The way to interpret this would be to say that the most common pair of bytes in this sequence has ids (101) followed by (32), and this has count equal 20. We can find the characters by using `chr` function in Python. It happens that these two characters are `e` followed by a space.

Now we can write a merge function that replaces every pair of `(101, 32)` with some new character. Notice that even though some characters have multiple bytes, when we are thinking in terms of integer tokens for those, we still have a single byte for each number. None of the numbers in our list are > 255. Some fancy characters like emojis may have multiple numbers, one after the other, but they are still within the range `[0, 255]`. Thus, if we want to create a new token that indicates a merged character `(101, 32)`, then we have to assign this merged character the number 256. The number 256 represents (101, 32).

In [None]:
def merge(ids, pair, new_idx):
    new_ids = [ ]
    i = 0

    while i < len(ids):
        # replace all instances of the pair with the new_idx
        if i < len(ids) - 1 and ids[i] == pair[0] and ids[i + 1] == pair[1]:
            new_ids.append(new_idx)
            i += 2
        # append all other tokens as is
        else:
            new_ids.append(ids[i])
            i += 1
    return new_ids

In [None]:
tokens_after_one_merge = merge(tokens, top_pair, new_idx=256)
len(tokens_after_one_merge)

596

As we can see, the number of tokens now is reduced since we merged a few tokens.

This was one merge. If we do this iteratively, we will get more tokens and that's that!

How many times should you do the merge operation? That's a hyperparameter based on hardware, etc. constraints. The more the number of tokens, the greater the storage and compute requirement. But the smaller the number of tokens, the shorter the vocabulary but the bigger the sequence length.

GPT-4 uses around 100k tokens in the vocabulary.

In [None]:
desired_vocab_size = 276
num_of_merges = desired_vocab_size - 256 # because we already have 256 tokens in our vocab
ids = list(tokens) # make a copy of the original tokens list

merges = { } #  (int, int) -> int
for i in range(num_of_merges):
    stats = get_stats(ids)
    pair = max(stats, key=stats.get)
    new_idx = 256 + i

    print(f"Merging {pair} into a new token {new_idx}")
    ids = merge(ids, pair, new_idx=new_idx)
    merges[pair] = new_idx

Notice how merged tokens can be merged even further.

If you think about it, you are creating a binary forest of the merges. Each time you're merging two tokens, so you have two children and a new parent token, and you're doing this from the leaves up. Not all tokens are going to get merged into a single tree, like Huffman encoding, but you're going to end up with several binary trees.



**Compression Ratio:**

Initially, you start off with all characters in your vocabulary. Tokenizing in this way gives us a length that is equal to the string length ( or slightly more due to multi-byte encoding of some characters ). But after merging, if you tokenize again, you're going to get a token length that is less than the original text.

This reduction is measured by compression ratio:
$$\dfrac{len(tokens)}{len(newtokens)}$$


The more the number of merges, the greater this compression ratio would be.

Tokenizer are **completely** separate stage than the large language model. Typically, this is the preprocessing stage, which may have its own data, and training stage. Once it is trained on some corpus, you can use this tokenizer to encode and decode the text which can be used with the LLM.

Typically, you run tokenizer on all the raw text data that you have gathered to train your LLM on. Once you have the tokens, you can store them on disk and get rid of the text data, and work with tokens hereonafter.

Other considerations are the languages to support, different encodings, etc. that you want in your language model when training this tokenizer.

## Encoding and Decoding

Now that we have a training algorithm that will give us the tokens, we would want to encode text into tokens and decode from the tokens.