<a href="https://colab.research.google.com/github/taskswithcode/GPTTokenizationTutorial/blob/main/GPTTokenizationTutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Notebook to examine GPT's byte level BPE tokenization process**
Byte level BPE tokenization avoids unknown tokens by having a base vocabulary composed of all 256 bit patters in a byte. So any character can be mapped to a sequence of bytes in the base vocabulary.

In [1]:
pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m34.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.12.1 tokenizers-0.13.2 transformers-4.26.1


In [2]:
from transformers import GPT2Tokenizer

In [3]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2") #this downloads the vocab.json file

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [4]:
resp= tokenizer("a")['input_ids']

In [5]:
resp #this is the index of token a in the vocab. If we examine the character this index 64 will point to, it will be "a" - no surprise there.

[64]

In [6]:
resp= tokenizer("அ ஆ இ ஈ உ ")['input_ids'] #these are tamil characters

In [7]:
resp

[156,
 106,
 227,
 220,
 156,
 106,
 228,
 220,
 156,
 106,
 229,
 220,
 156,
 106,
 230,
 220,
 156,
 106,
 231,
 220]

In [8]:
resp= tokenizer("அ")['input_ids'] #The UTF 8 encoding for அ is 0xE0 AE 85

In [9]:
resp # the three indices point to three symbols in the vocab that represent the bit patterns constituting the UTF-8 encoding of அ

[156, 106, 227]

[156,106,227] is the indices of the symbols representing those bytes in the vocab.json file. In the vocab.json file, index 156 is the character  "à"; 106 is the character ® and 227 is the character ħ. 

The UTF-8 encoding for à is 0x C3 A0 in UTF-8. Expanding this 1100 0011 1010 0000. Removing the control bits and aggregating the rest we get E0. This is indeed the first byte of the UTF-8 encoding of அ

Similarly the UTF-8 encoding for à is 0x C2 AE in UTF-8. Expanding this 1100 0010 1010 1110. Removing the control bits and aggregating the rest we get AE. This is indeed the second byte of the UTF-8 encoding of அ

The UTF-8 encoding for the third byte,  ħ is 0x C4 A7. Expanding this 1100 0100 1010 0111. Removing control bits and aggregating we get 0x127. This does not match the third byte of அ which is 0x85. This is simply because 0x85 is a control character in ISO-8859-1 (the standard for 1 byte representations). These control characters are mapped to visible unicode characters. In this case the mapping is to ħ

In [10]:
def bytes_to_unicode():
    bs = ( 
        list(range(ord("!"), ord("~") + 1)) 
        + list(range(ord("¡"), ord("¬") + 1)) 
        + list(range(ord("®"), ord("ÿ") + 1)) 
    )   
    cs = bs[:]
    n = 0 
    missing = []
    missing_map = []
    for b in range(2 ** 8): 
        if b not in bs: 
            bs.append(b)
            cs.append(2 ** 8 + n)
            missing.append(b)
            missing_map.append(2 ** 8 + n)
            n += 1
    cs = [chr(n) for n in cs] 
    return dict(zip(bs, cs)),dict(zip(missing,missing_map))

In [11]:
#Displays all the mapped characters for the all 256 bit patterns in a byte. This mapping maps all control characters to visible unicode characters/
#Note the model only sees the mapped indices during training.
x,missing_dict = bytes_to_unicode()
count = 0 
for key in x:
    count += 1
    print(f"{count}] {key} {key:02x} {x[key]}")

1] 33 21 !
2] 34 22 "
3] 35 23 #
4] 36 24 $
5] 37 25 %
6] 38 26 &
7] 39 27 '
8] 40 28 (
9] 41 29 )
10] 42 2a *
11] 43 2b +
12] 44 2c ,
13] 45 2d -
14] 46 2e .
15] 47 2f /
16] 48 30 0
17] 49 31 1
18] 50 32 2
19] 51 33 3
20] 52 34 4
21] 53 35 5
22] 54 36 6
23] 55 37 7
24] 56 38 8
25] 57 39 9
26] 58 3a :
27] 59 3b ;
28] 60 3c <
29] 61 3d =
30] 62 3e >
31] 63 3f ?
32] 64 40 @
33] 65 41 A
34] 66 42 B
35] 67 43 C
36] 68 44 D
37] 69 45 E
38] 70 46 F
39] 71 47 G
40] 72 48 H
41] 73 49 I
42] 74 4a J
43] 75 4b K
44] 76 4c L
45] 77 4d M
46] 78 4e N
47] 79 4f O
48] 80 50 P
49] 81 51 Q
50] 82 52 R
51] 83 53 S
52] 84 54 T
53] 85 55 U
54] 86 56 V
55] 87 57 W
56] 88 58 X
57] 89 59 Y
58] 90 5a Z
59] 91 5b [
60] 92 5c \
61] 93 5d ]
62] 94 5e ^
63] 95 5f _
64] 96 60 `
65] 97 61 a
66] 98 62 b
67] 99 63 c
68] 100 64 d
69] 101 65 e
70] 102 66 f
71] 103 67 g
72] 104 68 h
73] 105 69 i
74] 106 6a j
75] 107 6b k
76] 108 6c l
77] 109 6d m
78] 110 6e n
79] 111 6f o
80] 112 70 p
81] 113 71 q
82] 114 72 r
83] 115 73

In [12]:
#To see all the control character mappings only
count = 0 
for key in missing_dict:
    count += 1
    print(f"{count}] {key} {key:02x} {x[key]} {missing_dict[key]}")

1] 0 00 Ā 256
2] 1 01 ā 257
3] 2 02 Ă 258
4] 3 03 ă 259
5] 4 04 Ą 260
6] 5 05 ą 261
7] 6 06 Ć 262
8] 7 07 ć 263
9] 8 08 Ĉ 264
10] 9 09 ĉ 265
11] 10 0a Ċ 266
12] 11 0b ċ 267
13] 12 0c Č 268
14] 13 0d č 269
15] 14 0e Ď 270
16] 15 0f ď 271
17] 16 10 Đ 272
18] 17 11 đ 273
19] 18 12 Ē 274
20] 19 13 ē 275
21] 20 14 Ĕ 276
22] 21 15 ĕ 277
23] 22 16 Ė 278
24] 23 17 ė 279
25] 24 18 Ę 280
26] 25 19 ę 281
27] 26 1a Ě 282
28] 27 1b ě 283
29] 28 1c Ĝ 284
30] 29 1d ĝ 285
31] 30 1e Ğ 286
32] 31 1f ğ 287
33] 32 20 Ġ 288
34] 127 7f ġ 289
35] 128 80 Ģ 290
36] 129 81 ģ 291
37] 130 82 Ĥ 292
38] 131 83 ĥ 293
39] 132 84 Ħ 294
40] 133 85 ħ 295
41] 134 86 Ĩ 296
42] 135 87 ĩ 297
43] 136 88 Ī 298
44] 137 89 ī 299
45] 138 8a Ĭ 300
46] 139 8b ĭ 301
47] 140 8c Į 302
48] 141 8d į 303
49] 142 8e İ 304
50] 143 8f ı 305
51] 144 90 Ĳ 306
52] 145 91 ĳ 307
53] 146 92 Ĵ 308
54] 147 93 ĵ 309
55] 148 94 Ķ 310
56] 149 95 ķ 311
57] 150 96 ĸ 312
58] 151 97 Ĺ 313
59] 152 98 ĺ 314
60] 153 99 Ļ 315
61] 154 9a ļ 316
62] 155 9b Ľ 31