### **"Much glory awaits someone who can delete the need for tokenization" -- (Andrej Karpathy)**

# 1. Strings in Python

According to Python's documentation, "strings are immutable *sequences* of *Unicode code points*". The function to access the Unicode code point of a character is `ord()`. The function to access the character of a Unicode code point is `chr()`. Also, Unicode text is processed and stored as binary data *using one of several encodings*: `UTF-8`, `UTF-16`, `UTF-32`, among others. Of these, `UTF-8` is the most widely used, in part due to its backwards-compatibility with ASCII. The function to encode a string into a binary data is `encode()`. The function to decode a binary data into a string is `decode()`.

`UTF-8` means *Unicode Transformation Format - 8 bit* and supports all valid Unicode code points using a *variable-width encoding* of one to four one-byte code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. In the following table, the characters `u` to `z` are replaced by the bits of the code point, from the positions U+uvwxyz:

<div align="center">
  <img src="../assets/utf8-encoding.jpg" width="700"/>
</div>

Examples:
- U+0041 (‘A’) → 01000001 → 01000001 (same as ASCII)
- U+00A9 (‘©’)	→ 1010001001 → 11010100 10010001

Now, considering that `UTF-8` is represented as byte streams, it implies a maximum vocabulary length of 256 possible tokens. This means tiny embedding tables, counterweighted by very long sequences of tokens, which can be a hindrance to context length in transformer-based neural networks, where each tokens needs to attend to all other tokens in the sequence.

In [28]:
unicode_enc = [ord(x) for x in '안녕하세요']
unicode_enc

[50504, 45397, 54616, 49464, 50836]

In [29]:
utf8_enc = '안녕하세요'.encode('utf-8')
utf8_enc, list(utf8_enc)

(b'\xec\x95\x88\xeb\x85\x95\xed\x95\x98\xec\x84\xb8\xec\x9a\x94',
 [236, 149, 136, 235, 133, 149, 237, 149, 152, 236, 132, 184, 236, 154, 148])

In [30]:
print('Unicode length: ', len(unicode_enc))
print('UTF-8 length: ', len(utf8_enc))

Unicode length:  5
UTF-8 length:  15


# 2. Byte Pair Encoding (BPE)

This algorithm was first described in 1994, by Philip Gage, for encoding strings of text into smaller strings by creating and using a translation table. It builds "tokens" (units of recognition) that match varying amounts of source text, from single characters (including single digits or single punctuation marks) to whole words (even long compound words).

In [37]:
with open('../data/unicode.txt', 'r', encoding='utf-8') as f:
    text = f.read()

print('Number of characters in the text: ', len(text))

Number of characters in the text:  1414


In [43]:
tokens = list(map(int, text.encode('utf-8')))
print('Number of single tokens in the text: ', len(tokens))

Number of single tokens in the text:  2058


In [44]:
def get_stats(ids):
    counts = {}
    for pair in zip(ids, ids[1:]):
        counts[pair] = counts.get(pair, 0) + 1
    return counts

stats = get_stats(tokens)
print('Number of unique bigrams: ', len(stats))
print('Most common bigrams: ', sorted(stats.items(), key=lambda x: x[1], reverse=True)[:5])

Number of unique bigrams:  617
Most common bigrams:  [((101, 32), 24), ((204, 173), 18), ((205, 153), 18), ((204, 178), 18), ((115, 32), 17)]


In [45]:
# merging the most common pair
top_pair = max(stats, key=stats.get)
top_pair

(101, 32)

# 10. training the tokenizer: adding the while loop, compression ratio
# 11. tokenizer/LLM diagram: it is a completely separate stage
# 12. decoding tokens to strings
# 13. encoding strings to tokens
# 14. regex patterns to force splits across categories
# 15. tiktoken library intro, differences between GPT-2/GPT-4 regex
# 16. GPT-2 encoder.py released by OpenAI walkthrough
# 17. special tokens, tiktoken handling of, GPT-2/GPT-4 differences
# 18. minbpe exercise time! write your own GPT-4 tokenizer
# 19. sentencepiece library intro, used to train Llama 2 vocabulary
# 20. how to set vocabulary set? revisiting gpt.py transformer
# 21. training new tokens, example of prompt compression
# 22. multimodal [image, video, audio] tokenization with vector quantization
# 23. revisiting and explaining the quirks of LLM tokenization
# 24. final recommendations

# Sources

1. [Ground truth - Let's build the GPT Tokenizer, by Andrej Karpathy](https://www.youtube.com/watch?v=zduSFxRajkE&t=38s)
2. [A programmer's introduction to Unicode, by Nathan Reed](https://www.reedbeta.com/blog/programmers-intro-to-unicode)