# Byte-Pair Encoding (BPE) Tokenizer
In the first part of the assignment, we will train and implement a byte-level byte-pair encoding (BPE)
tokenizer [Sennrich et al., 2016, Wang et al., 2019]. In particular, we will represent arbitrary (Unicode)
strings as a sequence of bytes and train our BPE tokenizer on this byte sequence. Later, we will use this
tokenizer to encode text (a string) into tokens (a sequence of integers) for language modeling.

## The Unicode Standard
In Python, you can use the `ord()` function
to convert a single Unicode character into its integer representation. The `chr()` function converts an integer
Unicode code point into a string with the corresponding character.

In [1]:
ord('牛')

29275

In [2]:
chr(29275)

'牛'

- What Unicode character does chr(0) return?

In [3]:
chr(0)

'\x00'

- How does this character’s string representation (__repr__()) differ from its printed representation?

In [4]:
repr(chr(0))

"'\\x00'"

In [5]:
type(repr(chr(0)))

str

- What happens when this character occurs in text? It may be helpful to play around with the following in your Python interpreter and see if it matches your expectations:

In [6]:
chr(0)
print(chr(0))
"this is a test" + chr(0) + "string"

 


'this is a test\x00string'

In [7]:
print("this is a test" + chr(0) + "string")

this is a test string


## Unicode Encodings

While the Unicode standard defines a mapping from characters to code points (integers), it’s impractical to
train tokenizers directly on Unicode codepoints, since the vocabulary would be prohibitively large (around
150K items) and sparse (since many characters are quite rare). Instead, we’ll use a Unicode encoding, which
converts a Unicode character into a sequence of bytes. The Unicode standard itself defines three encodings:
UTF-8, UTF-16, and UTF-32, with UTF-8 being the dominant encoding for the Internet (more than 98%
of all webpages).

To encode a Unicode string into UTF-8, we can use the `encode()` function in Python. To access the
underlying byte values for a Python bytes object, we can iterate over it (e.g., call `list()`). Finally, we can
use the `decode()` function to decode a UTF-8 byte string into a Unicode string.

In [8]:
test_string = "hello! こんにちは!"
utf8_encoded = test_string.encode("utf-8")
print(utf8_encoded)

print(type(utf8_encoded))

# Get the byte values for the encoded string (integers from 0 to 255).
print(list(utf8_encoded))

# One byte does not necessarily correspond to one Unicode character!
print(len(test_string))
print(len(utf8_encoded))
print(utf8_encoded.decode("utf-8"))

b'hello! \xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf!'
<class 'bytes'>
[104, 101, 108, 108, 111, 33, 32, 227, 129, 147, 227, 130, 147, 227, 129, 171, 227, 129, 161, 227, 129, 175, 33]
13
23
hello! こんにちは!


By converting our Unicode codepoints into a sequence of bytes (e.g., via the UTF-8 encoding), we
are essentially taking a sequence of codepoints (integers in the range 0 to 154,997) and transforming it
into a sequence of byte values (integers in the range 0 to 255). The 256-length byte vocabulary is much
more manageable to deal with. When using byte-level tokenization, we do not need to worry about out-of-
vocabulary tokens, since we know that any input text can be expressed as a sequence of integers from 0 to
255.

- Consider the following (incorrect) function, which is intended to decode a UTF-8 byte string into
a Unicode string. Why is this function incorrect? Provide an example of an input byte string
that yields incorrect results.

In [9]:
def decode_utf8_bytes_to_str_wrong(bytestring: bytes):
    return "".join([bytes([b]).decode("utf-8") for b in bytestring])
decode_utf8_bytes_to_str_wrong("hello".encode("utf-8"))

'hello'

In [13]:
"hello牛".encode("utf-8")

b'hello\xe7\x89\x9b'

In [14]:
decode_utf8_bytes_to_str_wrong("hello牛".encode("utf-8"))

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 0: unexpected end of data

- Give a two byte sequence that does not decode to any Unicode character(s).

In [23]:
bytes(b"\xe7\x89\x9b").decode("utf-8")

'牛'

In [24]:
bytes(b"\xe7\x89").decode("utf-8")

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data

## Experimenting with BPE Tokenizer Training

In [25]:
PAT = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

Implementation remarks:
- Use `uv run pytest tests/test_train_bpe.py` for unit test of the tokenizer part. There are three tests:
    - `test_train_bpe_speed`: pass if running time < 1.5 second on `corpus.en`;
    - `test_train_bpe`: pass if the constructed vocab and values match the gpt2 reference;
    - `test_train_bpe_special_tokens`: pass if b"<|" not merged in bytes?
- Modify adapters.py which is desigend to be a wrapper around the actual function;
- `pretokenization_example.py` as a (slow?) example of pre-tokenization implementation;

In [7]:
list("abc牛牛".encode("utf-8"))

[97, 98, 99, 231, 137, 155, 231, 137, 155]

In [8]:
list("<|endoftext|>".encode("utf-8"))

[60, 124, 101, 110, 100, 111, 102, 116, 101, 120, 116, 124, 62]