# Byte Pair Encoding Tokenizer

Unicode is a mapping that assigns to every character an integer called the code point of that character. This scheme for converting character to numbers is called **character level tokenizaiton**. 

Since the number of all human characters is large, this one-to-one tokenization leads to a huge vocabulary. One way of reducing the vocabulary size is to group together bytes and leverage **byte level tokenization** by adopt an encoding. For example in UTF-8 encoding, every code point is represented by a seqence of bytes and the possible tokens are integers in [0, 256). One can also adopt UTF-16 encoding, so as to use a vocabulary of size $2^{16}$. Observe the tradeoff between vocabulary size and the length of tokenized text with respect to that vocabulary. 

The idea of **subword tokenization** is to create tokens out of groups of characters in a word. One proposal of a method for identifying such subwords to be considered a token is called **byte-pair encoding**, which mints new tokens from the most freqently occuring pair of bytes. Thus freqently occuring sets of characters are considered units called tokens.

# Training a BPE Tokenizer

At the start we know that all 256 possible bytes will be a subset of the final token vocabulary.

We need to pre-tokenize in order to treat semantically close words like `cat,` and `cat.` and `cat!` similarly.



### References

1. [Python Unicode Documenation](https://docs.python.org/3/howto/unicode.html)

In [1]:
'ðŸ˜‚'.isidentifier(), 'Ï€'.isidentifier()

(False, True)

In [49]:
# single character
Ï€ = ord('Ï€')  # code point
chr(Ï€) # actual char

'Ï€'

In [5]:
for n in range(65, 123):
    print(chr(n), end='')

ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz

In [48]:
s = 'ðŸ™‚'
utf8_s = s.encode('utf-8')
print(type(utf8_s), utf8_s, list(utf8_s), utf8_s.decode('utf-8'))   # observe four bytes

<class 'bytes'> b'\xf0\x9f\x99\x82' [240, 159, 153, 130] ðŸ™‚


In [124]:
# pre-tokenization pattern
import regex

PAT = r"'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"
corpus = r"""I'm a byte-pair tokenizer 123456789012"""
itr = regex.finditer(PAT, corpus)
for item in itr:
    print(item.group(), end='|')

I|'m| a| byte|-pair| tokenizer| |123|456|789|012|