#### Tokenization: Byte Pair Encoding

Based on Andrej Karpathy youtube tutorial.

In [7]:
from collections import defaultdict

`Tokenization into bytes`: We can use UTF-8 encoding to convert our string of Unicode characters into a sequence of bytes. Then we could define each byte as a separate `token`.

In [24]:
# sample string of Unicode characters
s = 'café'
# convert to bytes using UTF-8 encoding
b = s.encode('utf8')
print(f"Original string: {s}, UTF-8 encoding: {b}, size of encoding: {len(b)} bytes")
# show each character and it's utf-8 byte representation
for c in s:
    print(f"{c} -> {c.encode('utf8')} --> {list(c.encode('utf8'))}, num bytes: {len(c.encode('utf8'))}")

# convert each of the 5 bytes in the utf-8 encoding of the sample string to its corresponding integer value (0-255)
byte_values = list(b)
print(f"\n UTF-8 encoding of '{s}' converted to a list of integers: {byte_values}")


Original string: café, UTF-8 encoding: b'caf\xc3\xa9', size of encoding: 5 bytes
c -> b'c' --> [99], num bytes: 1
a -> b'a' --> [97], num bytes: 1
f -> b'f' --> [102], num bytes: 1
é -> b'\xc3\xa9' --> [195, 169], num bytes: 2

 UTF-8 encoding of 'café' converted to a list of integers: [99, 97, 102, 195, 169]


Note that utf-8 encoding is variable length, the encoding for a character can range from 1 to 4 bytes. The first 3 chacracters `c`, `a` and `f` are each represented by a single byte, while the accented character `é` is represented by 2 bytes.

In [6]:
# longer sample text (taken from https://www.reedbeta.com/blog/programmers-intro-to-unicode/)
text = "Ｕｎｉｃｏｄｅ! 🅤🅝🅘🅒🅞🅓🅔‽ 🇺‌🇳‌🇮‌🇨‌🇴‌🇩‌🇪! 😄 The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to “support Unicode” in our software (whatever that means—like using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don’t blame programmers for still finding the whole thing mysterious even 30 years after Unicode’s inception."

# encode text into utf-8 byte sequence
tokens = text.encode('utf-8') # byte stream
# convert bytes to integers
tokens = list(tokens) # integer tokens

print(f"Original text: {text} \nlength of text: {len(text)} characters \nUTF-8 encoded bytes (each byte converted to an integer): {tokens} \nlength of encoding: {len(tokens)} bytes")


Original text: Ｕｎｉｃｏｄｅ! 🅤🅝🅘🅒🅞🅓🅔‽ 🇺‌🇳‌🇮‌🇨‌🇴‌🇩‌🇪! 😄 The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to “support Unicode” in our software (whatever that means—like using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don’t blame programmers for still finding the whole thing mysterious even 30 years after Unicode’s inception. 
length of text: 532 characters 
UTF-8 encoded bytes (each byte converted to an integer): [239, 188, 181, 239, 189, 142, 239, 189, 137, 239, 189, 131, 239, 189, 143, 239, 189, 132, 239, 189, 133, 33, 32, 240, 159, 133, 164, 240, 159, 133, 157, 240, 159, 133, 152, 240, 159, 133, 146, 240, 159, 133, 158, 240, 159, 133, 147, 240, 159, 133, 148, 226, 128, 189, 32, 240, 159, 135, 186, 226, 128, 140, 240, 159, 135, 179, 226, 128, 140, 240, 159, 135, 174,

In this simple tokenization scheme, since each byte is represented as an integer value in the range 0-255, we effectively have a vocabulary size of 256.

We will now implement the `Byte-Pair Encoding` algorithm to obtain a new vocabulary which is created by iteratively merging the most frequency co-occuring tokens into a single new token. 

First, let's implement a function for finding the most commonly occuring pair of adjacent tokens.

In [8]:
def most_common_pair(tokens):
    """
    Given a list of integers, return the most common pair of integers
    """
    pair_count = defaultdict(int)
    for pair in zip(tokens, tokens[1:]):
        pair_count[pair] = pair_count[pair] + 1
    
    # convert pair_count dict into list of (value, key) tuples and sort by value
    pair_count = sorted([(v, k) for k, v in pair_count.items()], reverse=True)

    print(pair_count)    
    

In [9]:
most_common_pair(tokens)

[(20, (101, 32)), (15, (240, 159)), (12, (226, 128)), (12, (105, 110)), (11, (115, 32)), (10, (97, 110)), (10, (32, 97)), (9, (32, 116)), (8, (116, 104)), (7, (159, 135)), (7, (159, 133)), (7, (97, 114)), (6, (239, 189)), (6, (140, 240)), (6, (128, 140)), (6, (116, 32)), (6, (114, 32)), (6, (111, 114)), (6, (110, 103)), (6, (110, 100)), (6, (109, 101)), (6, (104, 101)), (6, (101, 114)), (6, (32, 105)), (5, (117, 115)), (5, (115, 116)), (5, (110, 32)), (5, (100, 101)), (5, (32, 115)), (4, (116, 105)), (4, (116, 101)), (4, (114, 105)), (4, (111, 117)), (4, (111, 100)), (4, (110, 116)), (4, (110, 105)), (4, (105, 99)), (4, (104, 97)), (4, (103, 32)), (4, (101, 97)), (4, (100, 32)), (4, (99, 111)), (4, (97, 109)), (4, (85, 110)), (4, (44, 32)), (4, (32, 119)), (4, (32, 111)), (4, (32, 102)), (4, (32, 85)), (3, (118, 101)), (3, (116, 115)), (3, (116, 114)), (3, (116, 111)), (3, (115, 44)), (3, (114, 116)), (3, (114, 115)), (3, (114, 101)), (3, (111, 102)), (3, (111, 32)), (3, (108, 108)), (