## tokenizing with byte-pair encoding

### *Neural Machine Translation of Rare Words with Subword Units, Sennrich et. al (2015)*

[Byte-pair encoding](https://arxiv.org/pdf/1508.07909) (BPE) is a compression algorithm that was first introduced by Rico Sennrich, Barry Haddow and Alexandra Birch to optimize neural machine translation, and namely to solve the problem of translation of rare words. word-level models, relying on complete words, can't possibly generate or translate words they had never seen before, a problem that becomes bigger when dealing with various dialects, alphabets, etc.

Sennrich et.al showed that translation of rare words is possible through their encoding via subword units, and additionally created a vocabulary, (a set of tokens, each one of variable-length) that was fixed in size but that could handle any input given to it. the argument was that, the translation of some word is possible and 'transparent' for any competent translator, even if the word is novel or unknown, based on an analysis of known subwords contained within that word (morphemes or phonemes).

> a very simplified view of the algorithm:
>
> - Start with a vocabulary that is just the raw symbols (characters)
> - Repeatedly merge the most-common adjacent symbol pairs into new, longer tokens.
> - Stop when you hit a fixed vocab size (e.g. 50k)

### *Language Models are Unsupervised Multitask Learners, Radford et.al (2019)*

in the context of language models, this paper introduced BPE as a mechanism for language models, recognizing and leveraging the many desirable properties that the algorithm had in the context of language modeling. alternatives to BPE included classic word-level tokenisers that, as Sennrich et.al suggested, choken on funky slang, creative spelling, emojis, and anything else they had never seen.

> nice compression (one token per known word) but brittle. any typoe or neologism explodes into an `<unk>` or a pile of fallback characters.

character-level models fixed this coverage, but exploded sequence length and more easily lost track of higher-level patterns.

> antidisestablishmentarianism -> 28 tokens. long sequences, slower training, limits how much context we can put into the model's window.

BPE is described to be a middle ground between both: shorter than characters, nimbler than words. It produces fewer tokens than a pure character model (shorter sequences) while staying more adaptable than a pure word model.







### implementing byte-pair encoding

Strings are sequences of Unicode code points. [Unicode](https://en.wikipedia.org/wiki/Unicode) is a character encoding standard defining more than 150,000 characters and 168 scripts.

> From the Python documentation: *Textual data in Python is handled with str objects, or strings. Strings are immutable sequences of Unicode code points.*

the vast majority of text available in the internet is encoded using Unicode. The `ord()` function returns the number representing the unicode code of a given character.

In [7]:
characs = ['ლ', 'პ', '🌞', '🔥', '༂', '༅']
for char in characs:
    print(f"Character: {char} -> Unicode code: {ord(char)}")

Character: ლ -> Unicode code: 4314
Character: პ -> Unicode code: 4318
Character: 🌞 -> Unicode code: 127774
Character: 🔥 -> Unicode code: 128293
Character: ༂ -> Unicode code: 3842
Character: ༅ -> Unicode code: 3845


computers ultimately read and write bytes, and not abstract code points like the ones shown above. An *encoding* is a mapping turning each code point sequence to a byte sequence (a process called serialization) and back (deserialization). The Unicode Standard defines three main character encoding standards used for electronic communication:

- UTF-8 -> 1 byte unit size, variable length (1-4 bytes per code point), typically used in the web, APIs and Unix.
- UTF-16 -> 2 bytes unit size (variable length).
- UTF-32 -> 4 bytes unit size (fixed-width)

almost every single webpage is transmitted as UTF-8, mostly due to the following reasons:
- it is the only encoding standard that is backward compatible with ASCII (another encoding standard used in a lot of legacy tooling) meaning that any text file encoded in ASCII can be decoded as UTF-8 to get exactly the same result.
- it is space efficient, at least for latin corpora. English text, for instance, stays at around 1 byte per character, whereas UTF-16 or 32 double or quadruple it (see below).

In [8]:
from sys import getsizeof

samples = {
    "ascii": "Hello, world!",
    "mutlilang": "こんにちは世界🌍", # hello world in japanese + emoji
    "emojis": "🔥🌞💧🌱" # emojis
}

print("{name:10} | utf‑8 | utf‑16‑le | utf‑32‑le | Python str (CPython 3.12) sizeof")
print("---------")
for name, s in samples.items():
    utf8 = len(s.encode("utf-8"))
    u16 = len(s.encode("utf-16-le"))
    u32 = len(s.encode("utf-32-le"))
    pyobj = getsizeof(s)
    print(f"{name:10} | {utf8:5d} | {u16:10d} | {u32:10d} | {pyobj:27d}")

{name:10} | utf‑8 | utf‑16‑le | utf‑32‑le | Python str (CPython 3.12) sizeof
---------
ascii      |    13 |         26 |         52 |                          62
mutlilang  |    25 |         18 |         32 |                         108
emojis     |    16 |         16 |         16 |                          92


Notice that, for ASCII-heavy text (latin-based corpora), UTF-8 is 2x-4x smaller than UTF-18/32. let's show the proportion of zero-valued bytes and preview the first 32 bytes of each encoding:

In [9]:
import textwrap

def hex_preview(blob:bytes, limit=32):
    head = blob[:limit]
    return " ".join(f"{b:02x}" for b in head) + (" ..." if len(blob) > limit else "")

for name, s in samples.items():
    print(name.upper())
    for enc in ("utf-8", "utf-16-le", "utf-32-le"):
        blob = s.encode(enc)
        zeros = blob.count(0)
        print(f"{enc:10} | {len(blob):2d} bytes | {zeros:2d} zero bytes | {100*zeros/len(blob):5.1f}% zeros")
        print("  ", hex_preview(blob))
    print()



ASCII
utf-8      | 13 bytes |  0 zero bytes |   0.0% zeros
   48 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21
utf-16-le  | 26 bytes | 13 zero bytes |  50.0% zeros
   48 00 65 00 6c 00 6c 00 6f 00 2c 00 20 00 77 00 6f 00 72 00 6c 00 64 00 21 00
utf-32-le  | 52 bytes | 39 zero bytes |  75.0% zeros
   48 00 00 00 65 00 00 00 6c 00 00 00 6c 00 00 00 6f 00 00 00 2c 00 00 00 20 00 00 00 77 00 00 00 ...

MUTLILANG
utf-8      | 25 bytes |  0 zero bytes |   0.0% zeros
   e3 81 93 e3 82 93 e3 81 ab e3 81 a1 e3 81 af e4 b8 96 e7 95 8c f0 9f 8c 8d
utf-16-le  | 18 bytes |  0 zero bytes |   0.0% zeros
   53 30 93 30 6b 30 61 30 6f 30 16 4e 4c 75 3c d8 0d df
utf-32-le  | 32 bytes | 15 zero bytes |  46.9% zeros
   53 30 00 00 93 30 00 00 6b 30 00 00 61 30 00 00 6f 30 00 00 16 4e 00 00 4c 75 00 00 0d f3 01 00

EMOJIS
utf-8      | 16 bytes |  0 zero bytes |   0.0% zeros
   f0 9f 94 a5 f0 9f 8c 9e f0 9f 92 a7 f0 9f 8c b1
utf-16-le  | 16 bytes |  0 zero bytes |   0.0% zeros
   3d d8 25 dd 3c d8 1e df 3d d8 a7 dc 3

note that each ASCII character in UTF-16-LE introduces one zero byte of padding , while UTF-32-LE introduces three. that is 50% and 75% 'wasted' space respectively, for corporas at least mostly dominated by latin symbols. even more, for emoji-heavy text (which already needs more than 3 bytes in UTF-8), we still pay a 25% overhead moving to UTF-32.