## Tokenization is at the heart of much weirdness of LLMs. 

You can play around with different kind of tokenizers here in [this website](https://tiktokenizer.vercel.app/).
Check out how code an foreign languages are represented in different tokenizers. Check how spaces in python are handled


**Strings** are just numbers encoded by a standard text encoding scheme like unicode. To get the unicode number representation for a number we can use the `ord` function in python. 

### Character-Level Tokenization

In [1]:
print("l: " + str(ord("l")))
print("🫨: " + str(ord("🫨")))
print([ord(c) for c in "hi there 👋🏼"])

l: 108
🫨: 129768
[104, 105, 32, 116, 104, 101, 114, 101, 32, 128075, 127996]


Unicode supports about 150k such numbers. Adding support to all the different characters supported by Unicode will massively increase the input to our language model. Standards can also change. We primarily use unicode (utf-8) encoding because it's fairly concise in terms of the size of the representation for english and it's backwards compatible with ASCII. This would also count as a character tokenization. 

But before implementing BPE tokenization, we want to start with simple whitespace tokenization.

### Word-Level Tokenization

In [2]:
import re
from collections import Counter

text = """ScaDS.AI (Center for Scalable Data Analytics and Artificial Intelligence) Dresden/Leipzig is a center for Data Science, Artificial Intelligence and Big Data with locations in Dresden and Leipzig. One of five new AI centers in Germany funded under the federal government’s AI strategy by the Federal Ministry of Education and Research (BMBF) and the Free State of Saxony. Established as a permanent research facility at both locations with strong connections to the local universities: TU Dresden and Leipzig University. Over 60 Principal Investigators, more than 180 employees and up to 12 new AI professorships in Dresden and Leipzig"""

# TODO: Split the text by whitespace
word_tokens = text.split()
print("Tokens with simple whitespace tokenization:", word_tokens)

def simple_tokenizer(text):
    # TODO: Convert into lowercase and remove punctuation
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)  # Keep alphanumeric and whitespace only
    tokens = text.split()
    return tokens

word_tokens = simple_tokenizer(text)
print("Tokens after preprocessing:", word_tokens)

# TODO: Calculate unigram frequencies - hint: you can use the Counter Class
unigram_freq = Counter(word_tokens)
print("Top 10 unigrams and their frequencies:", unigram_freq.most_common(10))

Tokens with simple whitespace tokenization: ['ScaDS.AI', '(Center', 'for', 'Scalable', 'Data', 'Analytics', 'and', 'Artificial', 'Intelligence)', 'Dresden/Leipzig', 'is', 'a', 'center', 'for', 'Data', 'Science,', 'Artificial', 'Intelligence', 'and', 'Big', 'Data', 'with', 'locations', 'in', 'Dresden', 'and', 'Leipzig.', 'One', 'of', 'five', 'new', 'AI', 'centers', 'in', 'Germany', 'funded', 'under', 'the', 'federal', 'government’s', 'AI', 'strategy', 'by', 'the', 'Federal', 'Ministry', 'of', 'Education', 'and', 'Research', '(BMBF)', 'and', 'the', 'Free', 'State', 'of', 'Saxony.', 'Established', 'as', 'a', 'permanent', 'research', 'facility', 'at', 'both', 'locations', 'with', 'strong', 'connections', 'to', 'the', 'local', 'universities:', 'TU', 'Dresden', 'and', 'Leipzig', 'University.', 'Over', '60', 'Principal', 'Investigators,', 'more', 'than', '180', 'employees', 'and', 'up', 'to', '12', 'new', 'AI', 'professorships', 'in', 'Dresden', 'and', 'Leipzig']
Tokens after preprocessing: [

## Subword-Level-Tokenization - Byte Pair Encoding (BPE) algorithm

Feeding raw UTF-8 would be really nice, but the downside to that is the long context length. Remember our model has a limited amount of memory, thus we need to compress the raw text input into something smaller (ideally variable length) - which still retains the same information as our original text. Also, splitting text by whitespaces is very easy, but it does not work well with rare, Out-Of-Bag (OOB) words. 

The classic idea of text compression (Huffman Coding) says we put more effort (more characters, more memory) to represent characters which are rare in our sequence and characters or rather sequence of characters which repeat a lot can be represented with shorter symbols or less memory

The BPE algorithm follows from the same idea. For example
Suppose we have the following string:
```
aaabdaaabac
```
The byte pair "aa" occurs most often, so it will be replaced by a bute that is not used in the data, such as "Z". Now there is the following data and replacement table
```
ZabdZabac
Z=aa
```
Then this process is repeated. We keep minting new tokens (symbols) to replace old symbols which repeat frequently
```
ZYdZYac
Y=ab
Z=aa
```
(Much like how we expand grammar in formal languages)

And finally
```
XdXac
X=ZY
Y=ab
Z=aa
```
**Vocabulary** refers to the number of unique symbols we use to represent our text. For decimal number system with base 10 the vocabulary is 10 (0-9) and for english the vocabulary of characters is 26 (a-z). Note that by replacing `Z=aa` we effectively reduced the length of our string but the vocabulary size has increased

In [3]:
text = """ScaDS.AI (Center for Scalable Data Analytics and Artificial Intelligence) Dresden/Leipzig is a center for Data Science, Artificial Intelligence and Big Data with locations in Dresden and Leipzig. One of five new AI centers in Germany funded under the federal government’s AI strategy by the Federal Ministry of Education and Research (BMBF) and the Free State of Saxony. Established as a permanent research facility at both locations with strong connections to the local universities: TU Dresden and Leipzig University. Over 60 Principal Investigators, more than 180 employees and up to 12 new AI professorships in Dresden and Leipzig"""

bytes = text.encode('utf-8')
tokens = [int(token) for token in bytes]
vocab_size = len(set(tokens))
print("Vocabulary Size: " + str(vocab_size))
mapping = {token: chr(token) for token in tokens}
print("length of text:", len(text))
print("Tokens from UTF-8 Values:")
print(tokens)
print("length of tokens:", len(tokens))

Vocabulary Size: 54
length of text: 634
Tokens from UTF-8 Values:
[83, 99, 97, 68, 83, 46, 65, 73, 32, 40, 67, 101, 110, 116, 101, 114, 32, 102, 111, 114, 32, 83, 99, 97, 108, 97, 98, 108, 101, 32, 68, 97, 116, 97, 32, 65, 110, 97, 108, 121, 116, 105, 99, 115, 32, 97, 110, 100, 32, 65, 114, 116, 105, 102, 105, 99, 105, 97, 108, 32, 73, 110, 116, 101, 108, 108, 105, 103, 101, 110, 99, 101, 41, 32, 68, 114, 101, 115, 100, 101, 110, 47, 76, 101, 105, 112, 122, 105, 103, 32, 105, 115, 32, 97, 32, 99, 101, 110, 116, 101, 114, 32, 102, 111, 114, 32, 68, 97, 116, 97, 32, 83, 99, 105, 101, 110, 99, 101, 44, 32, 65, 114, 116, 105, 102, 105, 99, 105, 97, 108, 32, 73, 110, 116, 101, 108, 108, 105, 103, 101, 110, 99, 101, 32, 97, 110, 100, 32, 66, 105, 103, 32, 68, 97, 116, 97, 32, 119, 105, 116, 104, 32, 108, 111, 99, 97, 116, 105, 111, 110, 115, 32, 105, 110, 32, 68, 114, 101, 115, 100, 101, 110, 32, 97, 110, 100, 32, 76, 101, 105, 112, 122, 105, 103, 46, 32, 79, 110, 101, 32, 111, 102, 32, 102,

We want to count which pairs of characters are occuring the most amount of time. So we write a function for the same:

In [4]:
def get_stats(ids):
    # TODO: Implement the function body. It should return a dictionary with each pair as key and the frequency as its value. 
    counts = {} # count of each id
    for pair in zip(ids, ids[1:]): # iterate over all pairs 
        counts[pair] = counts.get(pair, 0) + 1 # if pair is not in counts, return 0 otherwise increment by 1
    return counts

stats = get_stats(tokens)
print(stats)
print(sorted(((v, k) for k,v in stats.items()), reverse=True)) # sort by count

{(83, 99): 3, (99, 97): 6, (97, 68): 1, (68, 83): 1, (83, 46): 1, (46, 65): 1, (65, 73): 4, (73, 32): 4, (32, 40): 2, (40, 67): 1, (67, 101): 1, (101, 110): 12, (110, 116): 7, (116, 101): 7, (101, 114): 12, (114, 32): 6, (32, 102): 6, (102, 111): 2, (111, 114): 5, (32, 83): 4, (97, 108): 8, (108, 97): 1, (97, 98): 2, (98, 108): 2, (108, 101): 1, (101, 32): 11, (32, 68): 7, (68, 97): 3, (97, 116): 10, (116, 97): 5, (97, 32): 5, (32, 65): 6, (65, 110): 1, (110, 97): 1, (108, 121): 1, (121, 116): 1, (116, 105): 9, (105, 99): 3, (99, 115): 1, (115, 32): 10, (32, 97): 12, (97, 110): 11, (110, 100): 10, (100, 32): 10, (65, 114): 2, (114, 116): 2, (105, 102): 2, (102, 105): 3, (99, 105): 5, (105, 97): 2, (108, 32): 6, (32, 73): 3, (73, 110): 3, (101, 108): 2, (108, 108): 2, (108, 105): 4, (105, 103): 8, (103, 101): 2, (110, 99): 4, (99, 101): 5, (101, 41): 1, (41, 32): 2, (68, 114): 4, (114, 101): 7, (101, 115): 10, (115, 100): 4, (100, 101): 8, (110, 47): 1, (47, 76): 1, (76, 101): 4, (101, 

In [5]:
top_pair = max(stats, key=stats.get)
top_pair

(101, 110)

Now we want to merge this pair so we write a merge function

In [6]:
def merge(ids, pair, idx):
  # TODO: Write a function to merge the given pair of into a new pair with token id 'idx' in 'ids'. 
  # It should return a list of the new token ids. 
  # replace all consecutive occurrences of pair with a new idx
  newids = []
  i = 0
  while i < len(ids):
    # if we are not a the end and the current id and the pair matches, replace it
    if i < len(ids) - 1 and ids[i] == pair[0] and ids[i+1] == pair[1]:
      newids.append(idx)
      i += 2
    else:
      newids.append(ids[i])
      i += 1
  return newids

print(merge([4,3,3,8,1,5], (3,8), 999))
print(merge(tokens, top_pair, 637))

[4, 3, 999, 1, 5]
[83, 99, 97, 68, 83, 46, 65, 73, 32, 40, 67, 637, 116, 101, 114, 32, 102, 111, 114, 32, 83, 99, 97, 108, 97, 98, 108, 101, 32, 68, 97, 116, 97, 32, 65, 110, 97, 108, 121, 116, 105, 99, 115, 32, 97, 110, 100, 32, 65, 114, 116, 105, 102, 105, 99, 105, 97, 108, 32, 73, 110, 116, 101, 108, 108, 105, 103, 637, 99, 101, 41, 32, 68, 114, 101, 115, 100, 637, 47, 76, 101, 105, 112, 122, 105, 103, 32, 105, 115, 32, 97, 32, 99, 637, 116, 101, 114, 32, 102, 111, 114, 32, 68, 97, 116, 97, 32, 83, 99, 105, 637, 99, 101, 44, 32, 65, 114, 116, 105, 102, 105, 99, 105, 97, 108, 32, 73, 110, 116, 101, 108, 108, 105, 103, 637, 99, 101, 32, 97, 110, 100, 32, 66, 105, 103, 32, 68, 97, 116, 97, 32, 119, 105, 116, 104, 32, 108, 111, 99, 97, 116, 105, 111, 110, 115, 32, 105, 110, 32, 68, 114, 101, 115, 100, 637, 32, 97, 110, 100, 32, 76, 101, 105, 112, 122, 105, 103, 46, 32, 79, 110, 101, 32, 111, 102, 32, 102, 105, 118, 101, 32, 110, 101, 119, 32, 65, 73, 32, 99, 637, 116, 101, 114, 115, 32,

In [7]:
vocab_size = 800 # the desired size of the vocabulary
token_size = len(tokens) # the current size of the vocabulary
num_merges = vocab_size - token_size # the number of merges we need to make
ids = list(tokens) # the current vocabulary, so we don't destroy the original list

for i in range(num_merges):
  # TODO: Code the for loop through the number of merges.
  stats = get_stats(ids)
  pair = max(stats, key=stats.get) # get the most common pair by value
  idx = token_size + i # the new id
  print(f"merging {pair} to a new token {idx}")
  ids = merge(ids, pair, idx)

merging (101, 110) to a new token 636
merging (101, 114) to a new token 637
merging (32, 97) to a new token 638
merging (101, 32) to a new token 639
merging (110, 100) to a new token 640
merging (101, 115) to a new token 641
merging (97, 116) to a new token 642
merging (97, 108) to a new token 643
merging (638, 640) to a new token 644
merging (644, 32) to a new token 645
merging (105, 103) to a new token 646
merging (116, 104) to a new token 647
merging (111, 110) to a new token 648
merging (643, 32) to a new token 649
merging (32, 68) to a new token 650
merging (105, 112) to a new token 651
merging (115, 32) to a new token 652
merging (636, 116) to a new token 653
merging (32, 102) to a new token 654
merging (111, 114) to a new token 655
merging (116, 105) to a new token 656
merging (114, 641) to a new token 657
merging (105, 110) to a new token 658
merging (65, 73) to a new token 659
merging (659, 32) to a new token 660
merging (99, 105) to a new token 661
merging (650, 657) to a new

In [8]:
print("tokens length: ", len(tokens))
print("ids length: ", len(ids))
print(tokens)
print(ids)
print(f"Compression ratio: {len(tokens) / len(ids):.2f}X") # Compromise between token length and vocab size

tokens length:  636
ids length:  185
[83, 99, 97, 68, 83, 46, 65, 73, 32, 40, 67, 101, 110, 116, 101, 114, 32, 102, 111, 114, 32, 83, 99, 97, 108, 97, 98, 108, 101, 32, 68, 97, 116, 97, 32, 65, 110, 97, 108, 121, 116, 105, 99, 115, 32, 97, 110, 100, 32, 65, 114, 116, 105, 102, 105, 99, 105, 97, 108, 32, 73, 110, 116, 101, 108, 108, 105, 103, 101, 110, 99, 101, 41, 32, 68, 114, 101, 115, 100, 101, 110, 47, 76, 101, 105, 112, 122, 105, 103, 32, 105, 115, 32, 97, 32, 99, 101, 110, 116, 101, 114, 32, 102, 111, 114, 32, 68, 97, 116, 97, 32, 83, 99, 105, 101, 110, 99, 101, 44, 32, 65, 114, 116, 105, 102, 105, 99, 105, 97, 108, 32, 73, 110, 116, 101, 108, 108, 105, 103, 101, 110, 99, 101, 32, 97, 110, 100, 32, 66, 105, 103, 32, 68, 97, 116, 97, 32, 119, 105, 116, 104, 32, 108, 111, 99, 97, 116, 105, 111, 110, 115, 32, 105, 110, 32, 68, 114, 101, 115, 100, 101, 110, 32, 97, 110, 100, 32, 76, 101, 105, 112, 122, 105, 103, 46, 32, 79, 110, 101, 32, 111, 102, 32, 102, 105, 118, 101, 32, 110, 101,

In [9]:
new_vocab_size = len(set(ids))
print(f"Old Vocabulary size: {vocab_size}")
print(f"New Vocabulary size: {new_vocab_size}")

Old Vocabulary size: 800
New Vocabulary size: 102
