## Tokenization is at the heart of LLMs

<small> Simple string processing can be difficult due to tokenization for LLM </small>

<small> Sometimes LLMs are bad at artithmetic due to tokenization </small>

<small> Earlier versions of gpt had issues with Python due to tokenization </small>

<small> LLMs do worse for non english languages due to tokenization </small>

<small> Using YAML over json due to tokenization </small>

<small> tiktokenizer.vercel.app is good for visualizing tokenization </small>

<small>So like even for the same concept "egg" can be very different tokens and ids. The model has to somehow learn that these are the same concept and group them in the nn properly. </small>
<small> For GPT4, their tokenizer is able to group white spaces into one token which is great for its Python coding ability because of the fact that Python needs many spaces </small>

In [3]:
ord("h") #gets the unocode integer value of a character

104

In [None]:
[ord(x) for x in "hello"] #gets the unicode integer values of all characters in a string\
'''
We can't just use this for tokenization as many characters map to the same integer value and also not stable
UTF-8 takes every code point and encodes it into 1-4 bytes. Each byte is an integer from 0-255
'''

[104, 101, 108, 108, 111]

In [None]:
list("hello".encode("utf-8"))  # encodes the string into bytes using utf-8 encoding
'''
the other utf encodings are utf-16 and utf-32 and are pretty wasteful
We want to be able to support larger vocabulary sizes so we use byte pair encoding (BPE)
BPE works by merging the most common pairs of bytes iteratively to form a vocabulary of tokens
This way we can represent common words or subwords with a single token, while still being able
to represent rare words with multiple tokens.
To want to just have raw bytes in there, you have to change the structure of the transformer,
theres a research paper on it: MEGABYTE: Predicting Million-byte sequences with multiscale transformers
'''

[104, 101, 108, 108, 111]

In [12]:
text = "long sentence here: The quick brown fox jumps over the lazy dog."
tokens = text.encode("utf-8") # raw bytes
tokens = list(map(int, tokens)) # convert bytes to integers
print('---')
print (text)
print("length: ", len(text))
print('---')
print(tokens)
print("length: ", len(tokens))

---
long sentence here: The quick brown fox jumps over the lazy dog.
length:  64
---
[108, 111, 110, 103, 32, 115, 101, 110, 116, 101, 110, 99, 101, 32, 104, 101, 114, 101, 58, 32, 84, 104, 101, 32, 113, 117, 105, 99, 107, 32, 98, 114, 111, 119, 110, 32, 102, 111, 120, 32, 106, 117, 109, 112, 115, 32, 111, 118, 101, 114, 32, 116, 104, 101, 32, 108, 97, 122, 121, 32, 100, 111, 103, 46]
length:  64


In [None]:
def get_stats(ids):
    counts = {}
    for pair in zip(ids, ids[1:]): #Pythonic way to iterate over adjacent elements
        counts[pair] = counts.get(pair, 0) + 1
    return counts

stats = get_stats(tokens)
print(stats)
print(sorted(((value, key) for key, value in stats.items()), reverse=True))  #print 10 most common byte pairs

'''
So this gives us the frequency of each adjacent byte pair in the token list
It wil be a dictionary where the keys are tuples of byte pairs and the values are their counts
We can use this to find the most common byte pairs and merge them iteratively to form our BPE vocabulary
'''


{(108, 111): 1, (111, 110): 1, (110, 103): 1, (103, 32): 1, (32, 115): 1, (115, 101): 1, (101, 110): 2, (110, 116): 1, (116, 101): 1, (110, 99): 1, (99, 101): 1, (101, 32): 3, (32, 104): 1, (104, 101): 3, (101, 114): 2, (114, 101): 1, (101, 58): 1, (58, 32): 1, (32, 84): 1, (84, 104): 1, (32, 113): 1, (113, 117): 1, (117, 105): 1, (105, 99): 1, (99, 107): 1, (107, 32): 1, (32, 98): 1, (98, 114): 1, (114, 111): 1, (111, 119): 1, (119, 110): 1, (110, 32): 1, (32, 102): 1, (102, 111): 1, (111, 120): 1, (120, 32): 1, (32, 106): 1, (106, 117): 1, (117, 109): 1, (109, 112): 1, (112, 115): 1, (115, 32): 1, (32, 111): 1, (111, 118): 1, (118, 101): 1, (114, 32): 1, (32, 116): 1, (116, 104): 1, (32, 108): 1, (108, 97): 1, (97, 122): 1, (122, 121): 1, (121, 32): 1, (32, 100): 1, (100, 111): 1, (111, 103): 1, (103, 46): 1}
[((122, 121), 1), ((121, 32), 1), ((120, 32), 1), ((119, 110), 1), ((118, 101), 1), ((117, 109), 1), ((117, 105), 1), ((116, 104), 1), ((116, 101), 1), ((115, 101), 1), ((115, 3

'\nSo this gives us the frequency of each adjacent byte pair in the token list\nIt wil be a dictionary where the keys are tuples of byte pairs and the values are their counts\nWe can use this to find the most common byte pairs and merge them iteratively to form our BPE vocabulary\n'