## Tokenization is at the heart of much weirdness of LLMs. 

You can play around with different kind of tokenizers here in [this website](https://tiktokenizer.vercel.app/).
Check out how code an foreign languages are represented in different tokenizers. Check how spaces in python are handled


**Strings** are just numbers encoded by a standard text encoding scheme like unicode. To get the unicode number representation for a number we can use the `ord` function in python. 

### Character-Level Tokenization

In [1]:
print("l: " + str(ord("l")))
print("🫨: " + str(ord("🫨")))
print([ord(c) for c in "hi there 👋🏼"])

l: 108
🫨: 129768
[104, 105, 32, 116, 104, 101, 114, 101, 32, 128075, 127996]


Unicode supports about 150k such numbers. Adding support to all the different characters supported by Unicode will massively increase the input to our language model. Standards can also change. We primarily use unicode (utf-8) encoding because it's fairly concise in terms of the size of the representation for english and it's backwards compatible with ASCII. This would also count as a character tokenization. 

But before implementing BPE tokenization, we want to start with simple whitespace tokenization.

### Word-Level Tokenization

In [None]:
import re
from collections import Counter

text = """ScaDS.AI (Center for Scalable Data Analytics and Artificial Intelligence) Dresden/Leipzig is a center for Data Science, Artificial Intelligence and Big Data with locations in Dresden and Leipzig. One of five new AI centers in Germany funded under the federal government’s AI strategy by the Federal Ministry of Education and Research (BMBF) and the Free State of Saxony. Established as a permanent research facility at both locations with strong connections to the local universities: TU Dresden and Leipzig University. Over 60 Principal Investigators, more than 180 employees and up to 12 new AI professorships in Dresden and Leipzig"""

# TODO: Split the text by whitespace

def simple_tokenizer(text):
    # TODO: Convert into lowercase and remove punctuation
    return tokens

word_tokens = simple_tokenizer(text)
print("Tokens after preprocessing:", word_tokens)

# TODO: Calculate unigram frequencies - hint: you can use the Counter Class

## Subword-Level-Tokenization - Byte Pair Encoding (BPE) algorithm

Feeding raw UTF-8 would be really nice, but the downside to that is the long context length. Remember our model has a limited amount of memory, thus we need to compress the raw text input into something smaller (ideally variable length) - which still retains the same information as our original text. Also, splitting text by whitespaces is very easy, but it does not work well with rare, Out-Of-Bag (OOB) words. 

The classic idea of text compression (Huffman Coding) says we put more effort (more characters, more memory) to represent characters which are rare in our sequence and characters or rather sequence of characters which repeat a lot can be represented with shorter symbols or less memory

The BPE algorithm follows from the same idea. For example
Suppose we have the following string:
```
aaabdaaabac
```
The byte pair "aa" occurs most often, so it will be replaced by a bute that is not used in the data, such as "Z". Now there is the following data and replacement table
```
ZabdZabac
Z=aa
```
Then this process is repeated. We keep minting new tokens (symbols) to replace old symbols which repeat frequently
```
ZYdZYac
Y=ab
Z=aa
```
(Much like how we expand grammar in formal languages)

And finally
```
XdXac
X=ZY
Y=ab
Z=aa
```
**Vocabulary** refers to the number of unique symbols we use to represent our text. For decimal number system with base 10 the vocabulary is 10 (0-9) and for english the vocabulary of characters is 26 (a-z). Note that by replacing `Z=aa` we effectively reduced the length of our string but the vocabulary size has increased

In [None]:
text = """ScaDS.AI (Center for Scalable Data Analytics and Artificial Intelligence) Dresden/Leipzig is a center for Data Science, Artificial Intelligence and Big Data with locations in Dresden and Leipzig. One of five new AI centers in Germany funded under the federal government’s AI strategy by the Federal Ministry of Education and Research (BMBF) and the Free State of Saxony. Established as a permanent research facility at both locations with strong connections to the local universities: TU Dresden and Leipzig University. Over 60 Principal Investigators, more than 180 employees and up to 12 new AI professorships in Dresden and Leipzig"""

bytes = text.encode('utf-8')
tokens = [int(token) for token in bytes]
vocab_size = len(set(tokens))
print("Vocabulary Size: " + str(vocab_size))
mapping = {token: chr(token) for token in tokens}
print("length of text:", len(text))
print("Tokens from UTF-8 Values:")
print(tokens)
print("length of tokens:", len(tokens))

We want to count which pairs of characters are occuring the most amount of time. So we write a function for the same:

In [None]:
def get_stats(ids):
    # TODO: Implement the function body. It should return a dictionary with each pair as key and the frequency as its value. 
    return counts

stats = get_stats(tokens)
print(stats)
print(sorted(((v, k) for k,v in stats.items()), reverse=True)) # sort by count

In [None]:
top_pair = max(stats, key=stats.get)
top_pair

Now we want to merge this pair so we write a merge function

In [None]:
def merge(ids, pair, idx):
  # TODO: Write a function to merge the given pair of into a new pair with token id 'idx' in 'ids'. 
  # It should return a list of the new token ids. 
  # replace all consecutive occurrences of pair with a new idx
  return newids

print(merge([4,3,3,8,1,5], (3,8), 999))
print(merge(tokens, top_pair, 637))

In [None]:
vocab_size = 800 # the desired size of the vocabulary
token_size = len(tokens) # the current size of the vocabulary
num_merges = vocab_size - token_size # the number of merges we need to make
ids = list(tokens) # the current vocabulary, so we don't destroy the original list

for i in range(num_merges):
  # TODO: Code the for loop through the number of merges.

In [None]:
print("tokens length: ", len(tokens))
print("ids length: ", len(ids))
print(tokens)
print(ids)
print(f"Compression ratio: {len(tokens) / len(ids):.2f}X")

In [None]:
new_vocab_size = len(set(ids))
print(f"Old Vocabulary size: {vocab_size}")
print(f"New Vocabulary size: {new_vocab_size}")