<a href="https://colab.research.google.com/github/thyarles/unb-fmc-nlp/blob/main/aula_1/notes_lets_build_the_gpt_tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Let's build the GPT Tokenizer
Notes from the video https://www.youtube.com/watch?v=zduSFxRajkE.

* Most of the problem we see on the LLM are from Tokenizers (like do simple math wrong).
* The unicode has three types, UTF-8, UTF-16, and UTF-32. The UTF-8 is the standard because it the only one that has variable length. For latin characters, the UTF-16 add zero word on every letter, and UTF-32 add two zero words.
* We can't use the Unicode to tokenizer because it has a huge code space (about 150 thousand).  

In [None]:
# To check the UTF-8 value
[ord(x) for x in "Charles."]

[67, 104, 97, 114, 108, 101, 115, 46]

In [None]:
# To check the UTFs 8, 16 and 32.
print("%s\n%s\n%s" %
(
  list("Charles.".encode("utf-8")),
  list("Charles.".encode("utf-16")),
  list("Charles.".encode("utf-32")))
)

[67, 104, 97, 114, 108, 101, 115, 46]
[255, 254, 67, 0, 104, 0, 97, 0, 114, 0, 108, 0, 101, 0, 115, 0, 46, 0]
[255, 254, 0, 0, 67, 0, 0, 0, 104, 0, 0, 0, 97, 0, 0, 0, 114, 0, 0, 0, 108, 0, 0, 0, 101, 0, 0, 0, 115, 0, 0, 0, 46, 0, 0, 0]


# Byte pair encoding

In [18]:
# Example: aaabdaaabac (vocabulary size = eleven, four tokens)
#          -> find the pair that occurs more frequently and replace it with a single token
#          Z = aa, Y = ab, X = zy -> XdXac (vocabulary size = seven, five tokens)

# Fancy chars consume more bytes, that's why the code points is less than tokens
text = """
  Alan Turing foi um matemático e criptógrafo inglês considerado atualmente como o
  pai da computação, uma vez que, por meio de suas ideias, foi possível desenvolver
  o que chamamos hoje de computador. Turing também ficou muito conhecido como um dos
  responsáveis por decifrar o código utilizado pelas comunicações nazistas durante
  a Segunda Guerra Mundial.

  Por meio do seu trabalho, foi desenvolvida uma máquina conhecida como “bomba
  eletromecânica” (The Bombe, em inglês), que decifrou o código da máquina Enigma
  utilizado pelos alemães, e permitiu que os Aliados tivessem acesso a informações
  privilegiadas ao longo da guerra. Turing morreu em 1954, provavelmente tendo
  cometido suicídio.
"""
tokens = text.encode("utf-8")     # raw bytes
tokens = list(map(int, tokens))   # integers from 0 to 255
print("The text has %d code points and %d tokens." % (len(text), len(tokens)))

The text has 717 code points and 741 tokens.


In [19]:
# Let's find the most frequent value

# Using the pythonic way
def get_stats_pythonic(ids):
  counts = {}
  for pair in zip(ids, ids[1:]):
    counts[pair] = counts.get(pair, 0) + 1
  return counts

# Using human way
def get_stats_human(ids):
    counts = {}
    for i in range(len(ids) - 1):
        pair = (ids[i], ids[i + 1])
        if pair in counts:
            counts[pair] += 1
        else:
            counts[pair] = 1
    return counts

stats = get_stats_pythonic(tokens)
print(sorted(((v, k) for k,v in stats.items()), reverse=True))

[(24, (111, 32)), (15, (97, 32)), (14, (32, 99)), (13, (115, 32)), (12, (101, 32)), (12, (99, 111)), (12, (32, 100)), (10, (111, 109)), (10, (100, 111)), (10, (32, 32)), (10, (10, 32)), (9, (32, 112)), (8, (105, 110)), (8, (101, 115)), (8, (100, 101)), (8, (44, 32)), (8, (32, 10)), (7, (114, 97)), (7, (100, 97)), (7, (32, 109)), (6, (118, 101)), (6, (117, 101)), (6, (116, 105)), (6, (113, 117)), (6, (111, 115)), (6, (111, 114)), (6, (110, 103)), (6, (109, 101)), (6, (109, 97)), (6, (109, 32)), (6, (105, 100)), (6, (97, 100)), (6, (32, 117)), (6, (32, 97)), (5, (117, 32)), (5, (116, 101)), (5, (114, 105)), (5, (114, 32)), (5, (111, 110)), (5, (109, 111)), (5, (105, 99)), (5, (102, 111)), (5, (101, 114)), (5, (101, 110)), (5, (101, 109)), (5, (101, 108)), (5, (101, 99)), (5, (97, 115)), (5, (32, 111)), (5, (32, 101)), (4, (195, 161)), (4, (117, 116)), (4, (117, 114)), (4, (117, 109)), (4, (117, 105)), (4, (116, 97)), (4, (115, 101)), (4, (112, 111)), (4, (109, 195)), (4, (105, 97)), (4, 

In [20]:
# Let's see what are the most printed values
chr(111), chr(32) # this is the opposite of ord(x)

('o', ' ')

In [29]:
# Let's create news tokens starting from 256
top_pair = max(stats, key=stats.get)

def merge(ids, pair, idx):
  # replace ids of the pair with idx
  new_ids = []
  i = 0
  while i < len(ids):
    # if not on the last position and finds, replace it
    if i < len(ids) -1 and ids[i] == pair[0] and ids[i+1] == pair[1]:
      new_ids.append(idx)
      i += 2
    else:
      new_ids.append(ids[i])
      i += 1
  return new_ids

# To check
# print(merge([5, 6, 6, 7, 9, 1], (6, 7), 99))
new_tokens = merge(tokens, top_pair, 256)
# The result should change from 741 to 717 as we had 24 pairs of ('o', ' ')
print("The text has %d code points and %d tokens." % (len(text), len(new_tokens)))

The text has 717 code points and 717 tokens.


In [34]:
# The more changes you make, bigger will be your vocabulary and shorter will be
# your text. You need to find the best balance.

vocabulary_size = 276 # desired vocab size
number_merges = vocabulary_size - 256 # minus what we have already
ids = list(tokens) # let's keep the original list intact (use list to copy)

merges = {}
for i in range(number_merges):
  stats = get_stats_pythonic(ids)
  pair = max(stats, key=stats.get)
  idx = 256 + i
  print(f"Merging in a token {idx} the pair {pair}...")
  ids = merge(ids, pair, idx)
  merges[pair] = idx

Merging in a token 256 the pair (111, 32)...
Merging in a token 257 the pair (97, 32)...
Merging in a token 258 the pair (115, 32)...
Merging in a token 259 the pair (101, 32)...
Merging in a token 260 the pair (99, 111)...
Merging in a token 261 the pair (10, 32)...
Merging in a token 262 the pair (261, 32)...
Merging in a token 263 the pair (105, 110)...
Merging in a token 264 the pair (44, 32)...
Merging in a token 265 the pair (100, 256)...
Merging in a token 266 the pair (260, 109)...
Merging in a token 267 the pair (109, 32)...
Merging in a token 268 the pair (116, 105)...
Merging in a token 269 the pair (114, 97)...
Merging in a token 270 the pair (100, 101)...
Merging in a token 271 the pair (100, 257)...
Merging in a token 272 the pair (118, 101)...
Merging in a token 273 the pair (113, 117)...
Merging in a token 274 the pair (111, 114)...
Merging in a token 275 the pair (263, 103)...


In [35]:
print(ids, merges)

[262, 65, 108, 97, 110, 32, 84, 117, 114, 275, 32, 102, 111, 105, 32, 117, 267, 109, 97, 116, 101, 109, 195, 161, 268, 99, 256, 259, 99, 114, 105, 112, 116, 195, 179, 103, 269, 102, 256, 275, 108, 195, 170, 258, 260, 110, 115, 105, 270, 269, 265, 97, 116, 117, 97, 108, 109, 101, 110, 116, 259, 266, 256, 256, 262, 112, 97, 105, 32, 271, 266, 112, 117, 116, 97, 195, 167, 195, 163, 111, 264, 117, 109, 257, 272, 122, 32, 273, 101, 264, 112, 274, 32, 109, 101, 105, 256, 100, 259, 115, 117, 97, 258, 105, 270, 105, 97, 115, 264, 102, 111, 105, 32, 112, 111, 115, 115, 195, 173, 272, 108, 32, 270, 115, 101, 110, 118, 111, 108, 272, 114, 32, 262, 256, 273, 259, 99, 104, 97, 109, 97, 109, 111, 258, 104, 111, 106, 259, 100, 259, 266, 112, 117, 116, 97, 100, 274, 46, 32, 84, 117, 114, 275, 32, 116, 97, 109, 98, 195, 169, 267, 102, 105, 260, 117, 32, 109, 117, 105, 116, 256, 260, 110, 104, 101, 99, 105, 265, 266, 256, 117, 267, 100, 111, 258, 262, 114, 101, 115, 112, 111, 110, 115, 195, 161, 272, 10