Designed as a walkthrough with this [YouTube video](https://colab.research.google.com/drive/1y0KnCFZvGVf_odSfcNAws6kcDD7HsI0L?usp=sharing)

# Tokenization :(

Tokenization is at the heart of much weirdness of LLMs. Do not brush it off.

- Why can't LLM spell words? **Tokenization**.
- Why can't LLM do super simple string processing tasks like reversing a string? **Tokenization**.
- Why is LLM worse at non-English languages (e.g. Japanese)? **Tokenization**.
- Why is LLM bad at simple arithmetic? **Tokenization**.
- Why did GPT-2 have more than necessary trouble coding in Python? **Tokenization**.
- Why did my LLM abruptly halt when it sees the string "<|endoftext|>"? **Tokenization**.
- What is this weird warning I get about a "trailing whitespace"? **Tokenization**.
- Why the LLM break if I ask it about "SolidGoldMagikarp"? **Tokenization**.
- Why should I prefer to use YAML over JSON with LLMs? **Tokenization**.
- Why is LLM not actually end-to-end language modeling? **Tokenization**.
- What is the real root of suffering? **Tokenization**.

---

Good tokenization web app: [https://tiktokenizer.vercel.app](https://tiktokenizer.vercel.app)

Example string:

```
Tokenization is at the heart of much weirdness of LLMs. Do not brush it off.

127 + 677 = 804
1275 + 6773 = 8041

Egg.
I have an Egg.
egg.
EGG.

만나서 반가워요. 저는 OpenAI에서 개발한 대규모 언어 모델인 ChatGPT입니다. 궁금한 것이 있으시면 무엇이든 물어보세요.

for i in range(1, 101):
    if i % 3 == 0 and i % 5 == 0:
        print("FizzBuzz")
    elif i % 3 == 0:
        print("Fizz")
    elif i % 5 == 0:
        print("Buzz")
    else:
        print(i)
```

---

Much glory awaits someone who can delete the need for tokenization. But meanwhile, let's learn about it.


In [1]:
"안녕하세요 👋 (hello in Korean!)"

'안녕하세요 👋 (hello in Korean!)'

In [2]:
len([ord(x)for x in "안녕하세요 👋 (hello in Korean!)"])

26

In [3]:
len(list("안녕하세요 👋 (hello in Korean!)".encode("utf-8"))) # utf-8 decreases vocab size but expands the seq lengths

39

In [4]:
# text from https://www.reedbeta.com/blog/programmers-intro-to-unicode/
text = "Ｕｎｉｃｏｄｅ! 🅤🅝🅘🅒🅞🅓🅔‽ 🇺‌🇳‌🇮‌🇨‌🇴‌🇩‌🇪! 😄 The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to “support Unicode” in our software (whatever that means—like using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don’t blame programmers for still finding the whole thing mysterious, even 30 years after Unicode’s inception."
tokens = text.encode("utf-8") # raw bytes
tokens = list(map(int, tokens)) # convert to a list of integers in range 0..255 for convenience
print('---')
print(text)
print("length:", len(text))
print('---')
print(tokens)
print("length:", len(tokens))

---
Ｕｎｉｃｏｄｅ! 🅤🅝🅘🅒🅞🅓🅔‽ 🇺‌🇳‌🇮‌🇨‌🇴‌🇩‌🇪! 😄 The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to “support Unicode” in our software (whatever that means—like using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don’t blame programmers for still finding the whole thing mysterious, even 30 years after Unicode’s inception.
length: 533
---
[239, 188, 181, 239, 189, 142, 239, 189, 137, 239, 189, 131, 239, 189, 143, 239, 189, 132, 239, 189, 133, 33, 32, 240, 159, 133, 164, 240, 159, 133, 157, 240, 159, 133, 152, 240, 159, 133, 146, 240, 159, 133, 158, 240, 159, 133, 147, 240, 159, 133, 148, 226, 128, 189, 32, 240, 159, 135, 186, 226, 128, 140, 240, 159, 135, 179, 226, 128, 140, 240, 159, 135, 174, 226, 128, 140, 240, 159, 135, 168, 226, 128, 140, 240, 159, 135, 180, 226, 128, 140

## Implement BPE

We do Byte Pair Encoding for this text. [BPE article](https://en.wikipedia.org/wiki/Byte_pair_encoding)

The idea is to find the most commonly ocurring bigram in this list and replace that with a new token from our side recursively. This decreases the sequence length but increases vocab size (which will inturn affect our embeddings table and softmax layer)

In [5]:
def get_stats(byte_list):
    freq_dict = {}

    for b1, b2 in zip(byte_list, byte_list[1:]):
        freq_dict[(b1, b2)] = freq_dict.get((b1, b2), 0) + 1

    freq_list = sorted([(v, k) for k, v in freq_dict.items()], reverse=True)
    return freq_list

In [6]:
get_stats(tokens)[:20]

[(20, (101, 32)),
 (15, (240, 159)),
 (12, (226, 128)),
 (12, (105, 110)),
 (10, (115, 32)),
 (10, (97, 110)),
 (10, (32, 97)),
 (9, (32, 116)),
 (8, (116, 104)),
 (7, (159, 135)),
 (7, (159, 133)),
 (7, (97, 114)),
 (6, (239, 189)),
 (6, (140, 240)),
 (6, (128, 140)),
 (6, (116, 32)),
 (6, (114, 32)),
 (6, (111, 114)),
 (6, (110, 103)),
 (6, (110, 100))]

In [7]:
chr(101), chr(32) # visualise the most common pair

('e', ' ')

In [8]:
def create_new_token_list(byte_list, new_token_idx_start):
    freq_list = get_stats(byte_list)

    # most common is always at the beginning
    max_freq_pair = freq_list[0][0]

    # nothing to do if this does not exist (meaning all pairs have count 1)
    if max_freq_pair == 1:
        return byte_list, new_token_idx_start, {}

    new_token_mapping = {}
    for v, k in freq_list:
        if v == max_freq_pair:
            new_token_mapping[k] = new_token_idx_start
            new_token_idx_start += 1
        else:
            break

    new_byte_list = []

    i = 0
    while i < len(byte_list):
        if (i < len(byte_list) - 1) and (new_token_mapping.get((byte_list[i], byte_list[i + 1])) is not None):
            new_byte_list.append(new_token_mapping[byte_list[i], byte_list[i + 1]])
            i += 2
        else:
            new_byte_list.append(byte_list[i])
            i += 1

    return new_byte_list, new_token_idx_start, new_token_mapping

In [9]:
## now loop this to get the BPE sized token lisr

new_token_idx_start = 256
bpe_tokens = tokens
i = 1
token_mapping = {i: i for i in range(256)}
while True:
    old_vocab_len = len(token_mapping)
    bpe_tokens_new, new_token_idx_start, new_token_mapping = create_new_token_list(bpe_tokens, new_token_idx_start)
    token_mapping = token_mapping | new_token_mapping
    print(f"Iteration {i:3d} | Old Tokens size: {len(bpe_tokens):4d} | Old vocab size: {old_vocab_len:4d} | New Tokens size: {len(bpe_tokens_new):4d} | | New vocab size: {len(token_mapping):4d}")

    if len(bpe_tokens) == len(bpe_tokens_new):
        break
    else:
        bpe_tokens = bpe_tokens_new
    i += 1

Iteration   1 | Old Tokens size:  616 | Old vocab size:  256 | New Tokens size:  596 | | New vocab size:  257
Iteration   2 | Old Tokens size:  596 | Old vocab size:  257 | New Tokens size:  581 | | New vocab size:  258
Iteration   3 | Old Tokens size:  581 | Old vocab size:  258 | New Tokens size:  557 | | New vocab size:  260
Iteration   4 | Old Tokens size:  557 | Old vocab size:  260 | New Tokens size:  537 | | New vocab size:  262
Iteration   5 | Old Tokens size:  537 | Old vocab size:  262 | New Tokens size:  529 | | New vocab size:  263
Iteration   6 | Old Tokens size:  529 | Old vocab size:  263 | New Tokens size:  508 | | New vocab size:  266
Iteration   7 | Old Tokens size:  508 | Old vocab size:  266 | New Tokens size:  472 | | New vocab size:  273
Iteration   8 | Old Tokens size:  472 | Old vocab size:  273 | New Tokens size:  466 | | New vocab size:  274
Iteration   9 | Old Tokens size:  466 | Old vocab size:  274 | New Tokens size:  446 | | New vocab size:  278
Iteration 

### Bring everything together

Like Karpathy's implementation

In [10]:
def get_stats(byte_list):
    freq_dict = {}

    for b1, b2 in zip(byte_list, byte_list[1:]):
        freq_dict[(b1, b2)] = freq_dict.get((b1, b2), 0) + 1

    return freq_dict

def create_new_token_list(byte_list, new_token_idx):
    freq_dict = get_stats(byte_list)

    max_freq_pair = max(freq_dict, key=freq_dict.get)

    new_byte_list = []

    i = 0
    while i < len(byte_list):
        if (i < len(byte_list) - 1) and ((byte_list[i], byte_list[i + 1]) == max_freq_pair):
            new_byte_list.append(new_token_idx)
            i += 2
        else:
            new_byte_list.append(byte_list[i])
            i += 1

    return new_byte_list, {max_freq_pair: new_token_idx}

def bpe(tokens_list, orig_vocab_size, desired_vocab_size):
    num_compressions = desired_vocab_size - orig_vocab_size
    new_tokens_dict = {}
    print(f"Iteration {0:4d} | Sequence length = {len(tokens_list):10d}")


    for i in range(num_compressions):
        tokens_list, new_pair_dict = create_new_token_list(tokens_list, orig_vocab_size + i)
        new_tokens_dict = new_tokens_dict | new_pair_dict
        print(f"Iteration {(i + 1):4d} | Sequence length = {len(tokens_list):10d} | Pair compressed: {str(new_pair_dict):15s}")

    return tokens_list, new_tokens_dict

In [11]:
# making the training text longer to have more representative token statistics
# text from https://www.reedbeta.com/blog/programmers-intro-to-unicode/
text = """A Programmer’s Introduction to Unicode March 3, 2017 · Coding · 22 Comments  Ｕｎｉｃｏｄｅ! 🅤🅝🅘🅒🅞🅓🅔‽ 🇺\u200c🇳\u200c🇮\u200c🇨\u200c🇴\u200c🇩\u200c🇪! 😄 The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to “support Unicode” in our software (whatever that means—like using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don’t blame programmers for still finding the whole thing mysterious, even 30 years after Unicode’s inception.  A few months ago, I got interested in Unicode and decided to spend some time learning more about it in detail. In this article, I’ll give an introduction to it from a programmer’s point of view.  I’m going to focus on the character set and what’s involved in working with strings and files of Unicode text. However, in this article I’m not going to talk about fonts, text layout/shaping/rendering, or localization in detail—those are separate issues, beyond my scope (and knowledge) here.  Diversity and Inherent Complexity The Unicode Codespace Codespace Allocation Scripts Usage Frequency Encodings UTF-8 UTF-16 Combining Marks Canonical Equivalence Normalization Forms Grapheme Clusters And More… Diversity and Inherent Complexity As soon as you start to study Unicode, it becomes clear that it represents a large jump in complexity over character sets like ASCII that you may be more familiar with. It’s not just that Unicode contains a much larger number of characters, although that’s part of it. Unicode also has a great deal of internal structure, features, and special cases, making it much more than what one might expect a mere “character set” to be. We’ll see some of that later in this article.  When confronting all this complexity, especially as an engineer, it’s hard not to find oneself asking, “Why do we need all this? Is this really necessary? Couldn’t it be simplified?”  However, Unicode aims to faithfully represent the entire world’s writing systems. The Unicode Consortium’s stated goal is “enabling people around the world to use computers in any language”. And as you might imagine, the diversity of written languages is immense! To date, Unicode supports 135 different scripts, covering some 1100 languages, and there’s still a long tail of over 100 unsupported scripts, both modern and historical, which people are still working to add.  Given this enormous diversity, it’s inevitable that representing it is a complicated project. Unicode embraces that diversity, and accepts the complexity inherent in its mission to include all human writing systems. It doesn’t make a lot of trade-offs in the name of simplification, and it makes exceptions to its own rules where necessary to further its mission.  Moreover, Unicode is committed not just to supporting texts in any single language, but also to letting multiple languages coexist within one text—which introduces even more complexity.  Most programming languages have libraries available to handle the gory low-level details of text manipulation, but as a programmer, you’ll still need to know about certain Unicode features in order to know when and how to apply them. It may take some time to wrap your head around it all, but don’t be discouraged—think about the billions of people for whom your software will be more accessible through supporting text in their language. Embrace the complexity!  The Unicode Codespace Let’s start with some general orientation. The basic elements of Unicode—its “characters”, although that term isn’t quite right—are called code points. Code points are identified by number, customarily written in hexadecimal with the prefix “U+”, such as U+0041 “A” latin capital letter a or U+03B8 “θ” greek small letter theta. Each code point also has a short name, and quite a few other properties, specified in the Unicode Character Database.  The set of all possible code points is called the codespace. The Unicode codespace consists of 1,114,112 code points. However, only 128,237 of them—about 12% of the codespace—are actually assigned, to date. There’s plenty of room for growth! Unicode also reserves an additional 137,468 code points as “private use” areas, which have no standardized meaning and are available for individual applications to define for their own purposes.  Codespace Allocation To get a feel for how the codespace is laid out, it’s helpful to visualize it. Below is a map of the entire codespace, with one pixel per code point. It’s arranged in tiles for visual coherence; each small square is 16×16 = 256 code points, and each large square is a “plane” of 65,536 code points. There are 17 planes altogether.  Map of the Unicode codespace (click to zoom)  White represents unassigned space. Blue is assigned code points, green is private-use areas, and the small red area is surrogates (more about those later). As you can see, the assigned code points are distributed somewhat sparsely, but concentrated in the first three planes.  Plane 0 is also known as the “Basic Multilingual Plane”, or BMP. The BMP contains essentially all the characters needed for modern text in any script, including Latin, Cyrillic, Greek, Han (Chinese), Japanese, Korean, Arabic, Hebrew, Devanagari (Indian), and many more.  (In the past, the codespace was just the BMP and no more—Unicode was originally conceived as a straightforward 16-bit encoding, with only 65,536 code points. It was expanded to its current size in 1996. However, the vast majority of code points in modern text belong to the BMP.)  Plane 1 contains historical scripts, such as Sumerian cuneiform and Egyptian hieroglyphs, as well as emoji and various other symbols. Plane 2 contains a large block of less-common and historical Han characters. The remaining planes are empty, except for a small number of rarely-used formatting characters in Plane 14; planes 15–16 are reserved entirely for private use.  Scripts Let’s zoom in on the first three planes, since that’s where the action is:  Map of scripts in Unicode planes 0–2 (click to zoom)  This map color-codes the 135 different scripts in Unicode. You can see how Han () and Korean () take up most of the range of the BMP (the left large square). By contrast, all of the European, Middle Eastern, and South Asian scripts fit into the first row of the BMP in this diagram.  Many areas of the codespace are adapted or copied from earlier encodings. For example, the first 128 code points of Unicode are just a copy of ASCII. This has clear benefits for compatibility—it’s easy to losslessly convert texts from smaller encodings into Unicode (and the other direction too, as long as no characters outside the smaller encoding are used).  Usage Frequency One more interesting way to visualize the codespace is to look at the distribution of usage—in other words, how often each code point is actually used in real-world texts. Below is a heat map of planes 0–2 based on a large sample of text from Wikipedia and Twitter (all languages). Frequency increases from black (never seen) through red and yellow to white.  Heat map of code point usage frequency in Unicode planes 0–2 (click to zoom)  You can see that the vast majority of this text sample lies in the BMP, with only scattered usage of code points from planes 1–2. The biggest exception is emoji, which show up here as the several bright squares in the bottom row of plane 1.  Encodings We’ve seen that Unicode code points are abstractly identified by their index in the codespace, ranging from U+0000 to U+10FFFF. But how do code points get represented as bytes, in memory or in a file?  The most convenient, computer-friendliest (and programmer-friendliest) thing to do would be to just store the code point index as a 32-bit integer. This works, but it consumes 4 bytes per code point, which is sort of a lot. Using 32-bit ints for Unicode will cost you a bunch of extra storage, memory, and performance in bandwidth-bound scenarios, if you work with a lot of text.  Consequently, there are several more-compact encodings for Unicode. The 32-bit integer encoding is officially called UTF-32 (UTF = “Unicode Transformation Format”), but it’s rarely used for storage. At most, it comes up sometimes as a temporary internal representation, for examining or operating on the code points in a string.  Much more commonly, you’ll see Unicode text encoded as either UTF-8 or UTF-16. These are both variable-length encodings, made up of 8-bit or 16-bit units, respectively. In these schemes, code points with smaller index values take up fewer bytes, which saves a lot of memory for typical texts. The trade-off is that processing UTF-8/16 texts is more programmatically involved, and likely slower.  UTF-8 In UTF-8, each code point is stored using 1 to 4 bytes, based on its index value.  UTF-8 uses a system of binary prefixes, in which the high bits of each byte mark whether it’s a single byte, the beginning of a multi-byte sequence, or a continuation byte; the remaining bits, concatenated, give the code point index. This table shows how it works:  UTF-8 (binary)\tCode point (binary)\tRange 0xxxxxxx\txxxxxxx\tU+0000–U+007F 110xxxxx 10yyyyyy\txxxxxyyyyyy\tU+0080–U+07FF 1110xxxx 10yyyyyy 10zzzzzz\txxxxyyyyyyzzzzzz\tU+0800–U+FFFF 11110xxx 10yyyyyy 10zzzzzz 10wwwwww\txxxyyyyyyzzzzzzwwwwww\tU+10000–U+10FFFF A handy property of UTF-8 is that code points below 128 (ASCII characters) are encoded as single bytes, and all non-ASCII code points are encoded using sequences of bytes 128–255. This has a couple of nice consequences. First, any strings or files out there that are already in ASCII can also be interpreted as UTF-8 without any conversion. Second, lots of widely-used string programming idioms—such as null termination, or delimiters (newlines, tabs, commas, slashes, etc.)—will just work on UTF-8 strings. ASCII bytes never occur inside the encoding of non-ASCII code points, so searching byte-wise for a null terminator or a delimiter will do the right thing.  Thanks to this convenience, it’s relatively simple to extend legacy ASCII programs and APIs to handle UTF-8 strings. UTF-8 is very widely used in the Unix/Linux and Web worlds, and many programmers argue UTF-8 should be the default encoding everywhere.  However, UTF-8 isn’t a drop-in replacement for ASCII strings in all respects. For instance, code that iterates over the “characters” in a string will need to decode UTF-8 and iterate over code points (or maybe grapheme clusters—more about those later), not bytes. When you measure the “length” of a string, you’ll need to think about whether you want the length in bytes, the length in code points, the width of the text when rendered, or something else.  UTF-16 The other encoding that you’re likely to encounter is UTF-16. It uses 16-bit words, with each code point stored as either 1 or 2 words.  Like UTF-8, we can express the UTF-16 encoding rules in the form of binary prefixes:  UTF-16 (binary)\tCode point (binary)\tRange xxxxxxxxxxxxxxxx\txxxxxxxxxxxxxxxx\tU+0000–U+FFFF 110110xxxxxxxxxx 110111yyyyyyyyyy\txxxxxxxxxxyyyyyyyyyy + 0x10000\tU+10000–U+10FFFF A more common way that people talk about UTF-16 encoding, though, is in terms of code points called “surrogates”. All the code points in the range U+D800–U+DFFF—or in other words, the code points that match the binary prefixes 110110 and 110111 in the table above—are reserved specifically for UTF-16 encoding, and don’t represent any valid characters on their own. They’re only meant to occur in the 2-word encoding pattern above, which is called a “surrogate pair”. Surrogate code points are illegal in any other context! They’re not allowed in UTF-8 or UTF-32 at all.  Historically, UTF-16 is a descendant of the original, pre-1996 versions of Unicode, in which there were only 65,536 code points. The original intention was that there would be no different “encodings”; Unicode was supposed to be a straightforward 16-bit character set. Later, the codespace was expanded to make room for a long tail of less-common (but still important) Han characters, which the Unicode designers didn’t originally plan for. Surrogates were then introduced, as—to put it bluntly—a kludge, allowing 16-bit encodings to access the new code points.  Today, Javascript uses UTF-16 as its standard string representation: if you ask for the length of a string, or iterate over it, etc., the result will be in UTF-16 words, with any code points outside the BMP expressed as surrogate pairs. UTF-16 is also used by the Microsoft Win32 APIs; though Win32 supports either 8-bit or 16-bit strings, the 8-bit version unaccountably still doesn’t support UTF-8—only legacy code-page encodings, like ANSI. This leaves UTF-16 as the only way to get proper Unicode support in Windows. (Update: in Win10 version 1903, they finally added UTF-8 support to the 8-bit APIs! 😊)  By the way, UTF-16’s words can be stored either little-endian or big-endian. Unicode has no opinion on that issue, though it does encourage the convention of putting U+FEFF zero width no-break space at the top of a UTF-16 file as a byte-order mark, to disambiguate the endianness. (If the file doesn’t match the system’s endianness, the BOM will be decoded as U+FFFE, which isn’t a valid code point.)  Combining Marks In the story so far, we’ve been focusing on code points. But in Unicode, a “character” can be more complicated than just an individual code point!  Unicode includes a system for dynamically composing characters, by combining multiple code points together. This is used in various ways to gain flexibility without causing a huge combinatorial explosion in the number of code points.  In European languages, for example, this shows up in the application of diacritics to letters. Unicode supports a wide range of diacritics, including acute and grave accents, umlauts, cedillas, and many more. All these diacritics can be applied to any letter of any alphabet—and in fact, multiple diacritics can be used on a single letter.  If Unicode tried to assign a distinct code point to every possible combination of letter and diacritics, things would rapidly get out of hand. Instead, the dynamic composition system enables you to construct the character you want, by starting with a base code point (the letter) and appending additional code points, called “combining marks”, to specify the diacritics. When a text renderer sees a sequence like this in a string, it automatically stacks the diacritics over or under the base letter to create a composed character.  For example, the accented character “Á” can be expressed as a string of two code points: U+0041 “A” latin capital letter a plus U+0301 “◌́” combining acute accent. This string automatically gets rendered as a single character: “Á”.  Now, Unicode does also include many “precomposed” code points, each representing a letter with some combination of diacritics already applied, such as U+00C1 “Á” latin capital letter a with acute or U+1EC7 “ệ” latin small letter e with circumflex and dot below. I suspect these are mostly inherited from older encodings that were assimilated into Unicode, and kept around for compatibility. In practice, there are precomposed code points for most of the common letter-with-diacritic combinations in European-script languages, so they don’t use dynamic composition that much in typical text.  Still, the system of combining marks does allow for an arbitrary number of diacritics to be stacked on any base character. The reductio-ad-absurdum of this is Zalgo text, which works by ͖͟ͅr͞aṋ̫̠̖͈̗d͖̻̹óm̪͙͕̗̝ļ͇̰͓̳̫ý͓̥̟͍ ̕s̫t̫̱͕̗̰̼̘͜a̼̩͖͇̠͈̣͝c̙͍k̖̱̹͍͘i̢n̨̺̝͇͇̟͙ģ̫̮͎̻̟ͅ ̕n̼̺͈͞u̮͙m̺̭̟̗͞e̞͓̰̤͓̫r̵o̖ṷs҉̪͍̭̬̝̤ ̮͉̝̞̗̟͠d̴̟̜̱͕͚i͇̫̼̯̭̜͡ḁ͙̻̼c̲̲̹r̨̠̹̣̰̦i̱t̤̻̤͍͙̘̕i̵̜̭̤̱͎c̵s ͘o̱̲͈̙͖͇̲͢n͘ ̜͈e̬̲̠̩ac͕̺̠͉h̷̪ ̺̣͖̱ḻ̫̬̝̹ḙ̙̺͙̭͓̲t̞̞͇̲͉͍t̷͔̪͉̲̻̠͙e̦̻͈͉͇r͇̭̭̬͖,̖́ ̜͙͓̣̭s̘̘͈o̱̰̤̲ͅ ̛̬̜̙t̼̦͕̱̹͕̥h̳̲͈͝ͅa̦t̻̲ ̻̟̭̦̖t̛̰̩h̠͕̳̝̫͕e͈̤̘͖̞͘y҉̝͙ ̷͉͔̰̠o̞̰v͈͈̳̘͜er̶f̰͈͔ḻ͕̘̫̺̲o̲̭͙͠ͅw̱̳̺ ͜t̸h͇̭͕̳͍e̖̯̟̠ ͍̞̜͔̩̪͜ļ͎̪̲͚i̝̲̹̙̩̹n̨̦̩̖ḙ̼̲̼͢ͅ ̬͝s̼͚̘̞͝p͙̘̻a̙c҉͉̜̤͈̯̖i̥͡n̦̠̱͟g̸̗̻̦̭̮̟ͅ ̳̪̠͖̳̯̕a̫͜n͝d͡ ̣̦̙ͅc̪̗r̴͙̮̦̹̳e͇͚̞͔̹̫͟a̙̺̙ț͔͎̘̹ͅe̥̩͍ a͖̪̜̮͙̹n̢͉̝ ͇͉͓̦̼́a̳͖̪̤̱p̖͔͔̟͇͎͠p̱͍̺ę̲͎͈̰̲̤̫a̯͜r̨̮̫̣̘a̩̯͖n̹̦̰͎̣̞̞c̨̦̱͔͎͍͖e̬͓͘ ̤̰̩͙̤̬͙o̵̼̻̬̻͇̮̪f̴ ̡̙̭͓͖̪̤“̸͙̠̼c̳̗͜o͏̼͙͔̮r̞̫̺̞̥̬ru̺̻̯͉̭̻̯p̰̥͓̣̫̙̤͢t̳͍̳̖ͅi̶͈̝͙̼̙̹o̡͔n̙̺̹̖̩͝ͅ”̨̗͖͚̩.̯͓  A few other places where dynamic character composition shows up in Unicode:  Vowel-pointing notation in Arabic and Hebrew. In these languages, words are normally spelled with some of their vowels left out. They then have diacritic notation to indicate the vowels (used in dictionaries, language-teaching materials, children’s books, and such). These diacritics are expressed with combining marks.  A Hebrew example, with niqqud:\tאֶת דַלְתִּי הֵזִיז הֵנִיעַ, קֶטֶב לִשְׁכַּתִּי יָשׁוֹד Normal writing (no niqqud):\tאת דלתי הזיז הניע, קטב לשכתי ישוד Devanagari, the script used to write Hindi, Sanskrit, and many other South Asian languages, expresses certain vowels as combining marks attached to consonant letters. For example, “ह” + “\u200bि” = “हि” (“h” + “i” = “hi”). Korean characters stand for syllables, but they are composed of letters called jamo that stand for the vowels and consonants in the syllable. While there are code points for precomposed Korean syllables, it’s also possible to dynamically compose them by concatenating their jamo. For example, “ᄒ” + “ᅡ” + “ᆫ” = “한” (“h” + “a” + “n” = “han”). Canonical Equivalence In Unicode, precomposed characters exist alongside the dynamic composition system. A consequence of this is that there are multiple ways to express “the same” string—different sequences of code points that result in the same user-perceived characters. For example, as we saw earlier, we can express the character “Á” either as the single code point U+00C1, or as the string of two code points U+0041 U+0301.  Another source of ambiguity is the ordering of multiple diacritics in a single character. Diacritic order matters visually when two diacritics apply to the same side of the base character, e.g. both above: “ǡ” (dot, then macron) is different from “ā̇” (macron, then dot). However, when diacritics apply to different sides of the character, e.g. one above and one below, then the order doesn’t affect rendering. Moreover, a character with multiple diacritics might have one of the diacritics precomposed and others expressed as combining marks.  For example, the Vietnamese letter “ệ” can be expressed in five different ways:  Fully precomposed: U+1EC7 “ệ” Partially precomposed: U+1EB9 “ẹ” + U+0302 “◌̂” Partially precomposed: U+00EA “ê” + U+0323 “◌̣” Fully decomposed: U+0065 “e” + U+0323 “◌̣” + U+0302 “◌̂” Fully decomposed: U+0065 “e” + U+0302 “◌̂” + U+0323 “◌̣” Unicode refers to set of strings like this as “canonically equivalent”. Canonically equivalent strings are supposed to be treated as identical for purposes of searching, sorting, rendering, text selection, and so on. This has implications for how you implement operations on text. For example, if an app has a “find in file” operation and the user searches for “ệ”, it should, by default, find occurrences of any of the five versions of “ệ” above!  Normalization Forms To address the problem of “how to handle canonically equivalent strings”, Unicode defines several normalization forms: ways of converting strings into a canonical form so that they can be compared code-point-by-code-point (or byte-by-byte).  The “NFD” normalization form fully decomposes every character down to its component base and combining marks, taking apart any precomposed code points in the string. It also sorts the combining marks in each character according to their rendered position, so e.g. diacritics that go below the character come before the ones that go above the character. (It doesn’t reorder diacritics in the same rendered position, since their order matters visually, as previously mentioned.)  The “NFC” form, conversely, puts things back together into precomposed code points as much as possible. If an unusual combination of diacritics is called for, there may not be any precomposed code point for it, in which case NFC still precomposes what it can and leaves any remaining combining marks in place (again ordered by rendered position, as in NFD).  There are also forms called NFKD and NFKC. The “K” here refers to compatibility decompositions, which cover characters that are “similar” in some sense but not visually identical. However, I’m not going to cover that here.  Grapheme Clusters As we’ve seen, Unicode contains various cases where a thing that a user thinks of as a single “character” might actually be made up of multiple code points under the hood. Unicode formalizes this using the notion of a grapheme cluster: a string of one or more code points that constitute a single “user-perceived character”.  UAX #29 defines the rules for what, precisely, qualifies as a grapheme cluster. It’s approximately “a base code point followed by any number of combining marks”, but the actual definition is a bit more complicated; it accounts for things like Korean jamo, and emoji ZWJ sequences.  The main thing grapheme clusters are used for is text editing: they’re often the most sensible unit for cursor placement and text selection boundaries. Using grapheme clusters for these purposes ensures that you can’t accidentally chop off some diacritics when you copy-and-paste text, that left/right arrow keys always move the cursor by one visible character, and so on.  Another place where grapheme clusters are useful is in enforcing a string length limit—say, on a database field. While the true, underlying limit might be something like the byte length of the string in UTF-8, you wouldn’t want to enforce that by just truncating bytes. At a minimum, you’d want to “round down” to the nearest code point boundary; but even better, round down to the nearest grapheme cluster boundary. Otherwise, you might be corrupting the last character by cutting off a diacritic, or interrupting a jamo sequence or ZWJ sequence.  And More… There’s much more that could be said about Unicode from a programmer’s perspective! I haven’t gotten into such fun topics as case mapping, collation, compatibility decompositions and confusables, Unicode-aware regexes, or bidirectional text. Nor have I said anything yet about implementation issues—how to efficiently store and look-up data about the sparsely-assigned code points, or how to optimize UTF-8 decoding, string comparison, or NFC normalization. Perhaps I’ll return to some of those things in future posts.  Unicode is a fascinating and complex system. It has a many-to-one mapping between bytes and code points, and on top of that a many-to-one (or, under some circumstances, many-to-many) mapping between code points and “characters”. It has oddball special cases in every corner. But no one ever claimed that representing all written languages was going to be easy, and it’s clear that we’re never going back to the bad old days of a patchwork of incompatible encodings.  Further reading:  The Unicode Standard UTF-8 Everywhere Manifesto Dark corners of Unicode by Eevee ICU (International Components for Unicode)—C/C++/Java libraries implementing many Unicode algorithms and related things Python 3 Unicode Howto Google Noto Fonts—set of fonts intended to cover all assigned code points"""
tokens = text.encode("utf-8") # raw bytes
tokens = list(map(int, tokens)) # convert to a list of integers in range 0..255 for convenience
len(tokens)

24597

In [12]:
old_tokens = list(tokens)
len(old_tokens)

24597

In [13]:
desired_vocab_size = 300
bpe_tokens, new_tokens_dict = bpe(tokens, orig_vocab_size=256, desired_vocab_size=276)
len(bpe_tokens)

Iteration    0 | Sequence length =      24597
Iteration    1 | Sequence length =      23951 | Pair compressed: {(101, 32): 256}
Iteration    2 | Sequence length =      23505 | Pair compressed: {(105, 110): 257}
Iteration    3 | Sequence length =      23081 | Pair compressed: {(115, 32): 258}
Iteration    4 | Sequence length =      22744 | Pair compressed: {(116, 104): 259}
Iteration    5 | Sequence length =      22450 | Pair compressed: {(101, 114): 260}
Iteration    6 | Sequence length =      22160 | Pair compressed: {(99, 111): 261}
Iteration    7 | Sequence length =      21875 | Pair compressed: {(116, 32): 262}
Iteration    8 | Sequence length =      21621 | Pair compressed: {(226, 128): 263}
Iteration    9 | Sequence length =      21378 | Pair compressed: {(44, 32): 264}
Iteration   10 | Sequence length =      21149 | Pair compressed: {(97, 110): 265}
Iteration   11 | Sequence length =      20935 | Pair compressed: {(111, 114): 266}
Iteration   12 | Sequence length =      20722 | 

19438

In [14]:
print(f"Compression ratio: {(len(old_tokens) / len(bpe_tokens)):.2f}X")

Compression ratio: 1.27X


In [15]:
new_tokens_dict

{(101, 32): 256,
 (105, 110): 257,
 (115, 32): 258,
 (116, 104): 259,
 (101, 114): 260,
 (99, 111): 261,
 (116, 32): 262,
 (226, 128): 263,
 (44, 32): 264,
 (97, 110): 265,
 (111, 114): 266,
 (100, 32): 267,
 (97, 114): 268,
 (101, 110): 269,
 (257, 103): 270,
 (261, 100): 271,
 (121, 32): 272,
 (46, 32): 273,
 (97, 108): 274,
 (259, 256): 275}

### decoding and encoding

Decode a list given the `ids` as a list of integers

In [16]:
vocab_dict = {idx: bytes([idx]) for idx in range(256)}
# now the merges
for k, v in new_tokens_dict.items():
    vocab_dict[v] = vocab_dict[k[0]] + vocab_dict[k[1]]

vocab_dict

{0: b'\x00',
 1: b'\x01',
 2: b'\x02',
 3: b'\x03',
 4: b'\x04',
 5: b'\x05',
 6: b'\x06',
 7: b'\x07',
 8: b'\x08',
 9: b'\t',
 10: b'\n',
 11: b'\x0b',
 12: b'\x0c',
 13: b'\r',
 14: b'\x0e',
 15: b'\x0f',
 16: b'\x10',
 17: b'\x11',
 18: b'\x12',
 19: b'\x13',
 20: b'\x14',
 21: b'\x15',
 22: b'\x16',
 23: b'\x17',
 24: b'\x18',
 25: b'\x19',
 26: b'\x1a',
 27: b'\x1b',
 28: b'\x1c',
 29: b'\x1d',
 30: b'\x1e',
 31: b'\x1f',
 32: b' ',
 33: b'!',
 34: b'"',
 35: b'#',
 36: b'$',
 37: b'%',
 38: b'&',
 39: b"'",
 40: b'(',
 41: b')',
 42: b'*',
 43: b'+',
 44: b',',
 45: b'-',
 46: b'.',
 47: b'/',
 48: b'0',
 49: b'1',
 50: b'2',
 51: b'3',
 52: b'4',
 53: b'5',
 54: b'6',
 55: b'7',
 56: b'8',
 57: b'9',
 58: b':',
 59: b';',
 60: b'<',
 61: b'=',
 62: b'>',
 63: b'?',
 64: b'@',
 65: b'A',
 66: b'B',
 67: b'C',
 68: b'D',
 69: b'E',
 70: b'F',
 71: b'G',
 72: b'H',
 73: b'I',
 74: b'J',
 75: b'K',
 76: b'L',
 77: b'M',
 78: b'N',
 79: b'O',
 80: b'P',
 81: b'Q',
 82: b'R',
 83: b'

In [17]:
def decode(ids):
    byte_array = b''.join(vocab_dict[id] for id in ids)
    decoded_str = byte_array.decode('utf-8', errors='replace')
    return decoded_str

In [None]:
print(vocab_dict[128])

b'\x80'


In [None]:
decode([97])

'a'

In [18]:
def encode(text):
    # given the string words convert to a list of ids
    tokens = text.encode("utf-8")
    tokens = list(map(int, tokens))

    while True:
        old_tokens = list(tokens)
        i = 0
        new_tokens = []
        while i < len(old_tokens):
            if (i < len(old_tokens) - 1) and (new_tokens_dict.get((old_tokens[i], old_tokens[i+1])) is not None):
                new_tokens.append(new_tokens_dict.get((old_tokens[i], old_tokens[i+1])))
                i += 2
            else:
                new_tokens.append(old_tokens[i])
                i += 1

        tokens = list(new_tokens)
        if len(old_tokens) == len(tokens):
            return tokens

In [19]:
text = """The swine flu pandemic sweeping the world might not have happened without a laboratory accident in the 1970s , a new study claims .
China 's exports and imports declined in May , but this was offset by better-than-expected growth in urban fixed-asset investment - indicating the world 's third-largest economy is well on the road to recovery .
This season Coach Ben Howland has waved that white flag and gone zone .
" After Simon , I can handle Jason , " Sinitta says .
He met Burlington in Italy , and their passion for classical forms can be seen at Chiswick in a grand processional avenue lined with stone urns and sphinxes .
In New York , critics referenced a pair of big Johns , Cheever and Updike , in their rave reviews , while here the play seemed a more jagged and spelt-out version of Pinter 's dreamy Old Times , with the triangle rearranged against the distaff side .
But the overall business climate is starting to change , as investors eye Africa 's growth potential , and that could lead to more options for expatriates wanting to return home , Patel from HSBC said .
During that time , David Cameron and William Hague have repeatedly said that the undertakings were being met .
" Unpredictable " seems to be the word used most often by experts to describe the outbreak of swine flu , writes Clive Cookson .
Mercado has dodged more eliminations than any finalist in " Idol " history , so she 's no stranger to danger .
And now you know how silly , plastic-surgery addicted Jacko came to his end .
" Golf is all about performing in the majors but I 'm not going to this major and thinking it has to happen , " Harrington said of his bid for a rare third major win in a row .
If you walk down Karl Johans Gate , the main drag of central Oslo , a tree-lined promenade bordered by restaurants , cafes and upscale stores , you 'll eventually find yourself face-to-face with the Royal Palace , the mammoth , cream-colored home of the Norwegian royal family .
The title search did not find the lien .
As for Ucas points , these are used mainly as a general guide to applicants .
The 12-time All-Star was appearing in his first game since admitting he took performance-enhancing drugs while playing for the Texas Rangers from 2001 to 2003 .
The company added that it was slashing staff in its North American development division from 70 people to 10 .
Fox , 8 p.m.
Schools are legally required to arrange full-time , suitable education for pupils excluded for six days or more .
The MNF said the three were thought to be involved in the construction and distribution of IEDs in the Baghdad region .
Stay in school .
He said there was no case to answer following defence submissions at the end of two months of prosecution evidence .
The island is home to the Chincoteague National Wildlife Refuge , the Assateague Island National Seashore and Assateague State Park .
He would love to get his hands on Sea The Stars .
Casa Ferreirinha 's Fernando Nicolau de Almeida created Barca Velha in 1952 , using the port grapes to make the region 's first high-quality table wine and in five decades it has only been made in 15 vintages , the most recent being the 2000 .
Foxx , who is best-known for his performances in " Ray " and " Dreamgirls , " will play Prentice Earl Sanders , one of two black detectives determined to crack a series of racially motivated serial killings in 1973-74 San Francisco , the trade paper reported .
"""
print(len(text))
bpe_text = encode(text)
print(len(bpe_text))

3429
2698


In [20]:
text2 = decode(bpe_text)
assert text == text2

In [21]:
text = """If you are a fan of Test cricket, one of those days as an Indian Cricket fan, when waking up at 5 in the morning felt so so rewarding!!!
Alarm's set for 4.55 tomorrow with the hope that this core can collectively perform as a team, one final time!
 'Do it for Jassi Bhai'"""
assert text == decode(encode(text))

In [22]:
decode(encode("h"))

'h'

### Extra pre-processing done in GPT-2

GPT-2 does some extra preprocessing of the text. Essentially, they want to seperate semantics from punctuations, spaces and any such other cases. Eg: `dogs.`, `dogs!` and `dogs?` should not be single tokens. Instead, we want `dogs` ad punctuation seperate

In [24]:
import regex as re

## Explaination:
## - First few are to have a seperate token from '<char> type cases across languages
## - Next ' ?\p{L}+' means optional white space and then characters. Thsi ensures our tokens look like ' you', ' hi' (space followed by letters)
## - Same for ' ?\p{N}+' except for numbers
## - Next ' ?[^\s\p{L}\p{N}]+' means any punctuations to be captures (means apart from letters and numbers)
## - \s+(?!\S) to handle extra white spaces in between
## - \s+ for trailing white spaces
gpt2pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""", flags=re.IGNORECASE)
text = "Hello world1234 how'S the'll josh've       been???      !   "
gpt2pat.findall(text)

['Hello',
 ' world',
 '1234',
 ' how',
 "'S",
 ' the',
 "'ll",
 ' josh',
 "'ve",
 '      ',
 ' been',
 '???',
 '     ',
 ' !',
 '   ']

### Explore GPT-2 and GPT-4o tokenizer

We will see a difference in white spaces

In [25]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.3/1.2 MB[0m [31m8.4 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.2/1.2 MB[0m [31m20.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.8.0


In [27]:
import tiktoken
text = "Hello world1234 how'S the'll josh've       been???      !   "

In [28]:
# To get the tokeniser corresponding to a specific model in the OpenAI API:
enc_gpt2 = tiktoken.encoding_for_model("gpt-2")
gpt2_enc_text = enc_gpt2.encode(text)
gpt2_enc_text, len(gpt2_enc_text)

([15496,
  995,
  1065,
  2682,
  703,
  6,
  50,
  262,
  1183,
  474,
  3768,
  1053,
  220,
  220,
  220,
  220,
  220,
  220,
  587,
  28358,
  220,
  220,
  220,
  220,
  220,
  5145,
  220,
  220,
  220],
 29)

In [29]:
# To get the tokeniser corresponding to a specific model in the OpenAI API:
enc_gpt4 = tiktoken.encoding_for_model("gpt-4")
gpt4_enc_text = enc_gpt4.encode(text)
gpt4_enc_text, len(gpt4_enc_text) # will merge whitsepaces

([9906,
  1917,
  4513,
  19,
  1268,
  13575,
  279,
  3358,
  503,
  9451,
  3077,
  996,
  1027,
  34115,
  415,
  758,
  262],
 17)

In [30]:
# To get the tokeniser corresponding to a specific model in the OpenAI API:
enc_gpt4o = tiktoken.encoding_for_model("gpt-4o")
gpt4o_enc_text = enc_gpt4o.encode(text)
gpt4o_enc_text, len(gpt4o_enc_text) # will merge whitsepaces

([13225,
  2375,
  7633,
  19,
  1495,
  31233,
  290,
  6090,
  441,
  12601,
  7341,
  1699,
  1339,
  33110,
  530,
  1073,
  271],
 17)

In [52]:
example = """
for i in range(1, 101):
    if i % 3 == 0 and i % 5 == 0:
        print("FizzBuzz")
    elif i % 3 == 0:
        print("Fizz")
    elif i % 5 == 0:
        print("Buzz")
    else:
        print(i)
"""
gpt2_enc_example = enc_gpt2.encode(example)
gpt4_enc_example = enc_gpt4.encode(example)
gpt4o_enc_example = enc_gpt4o.encode(example)

len(gpt2_enc_example), len(gpt4_enc_example), len(gpt4o_enc_example)

(109, 72, 72)

In [33]:
# Visualize the vocab for these tokens
import tiktoken

# Load the tokenizer for GPT-2
gpt2_tokenizer = tiktoken.encoding_for_model("gpt2")

# Load the tokenizer for GPT-4
gpt4_tokenizer = tiktoken.encoding_for_model("gpt-4")

# Load the tokenizer for GPT-4o (if supported by the library)
gpt4o_tokenizer = tiktoken.encoding_for_model("gpt-4o")  # Replace with the correct name if different

In [38]:
# Access GPT-2 vocabulary
gpt2_vocab_size = gpt2_tokenizer.n_vocab
print("GPT-2 Vocabulary Size:", gpt2_vocab_size)

# Access GPT-4 vocabulary
gpt4_vocab_size = gpt4_tokenizer.n_vocab
print("GPT-4 Vocabulary Size:", gpt4_vocab_size)

# Access GPT-4o vocabulary (if available)
gpt4o_vocab_size = gpt4o_tokenizer.n_vocab
print("GPT-4o Vocabulary Size:", gpt4o_vocab_size)

GPT-2 Vocabulary Size: 50257
GPT-4 Vocabulary Size: 100277
GPT-4o Vocabulary Size: 200019


In [41]:
gpt2_vocab = {token_id: gpt2_tokenizer.decode([token_id]) for token_id in range(gpt2_vocab_size)}

import random
random.seed(42 + 1)
# Print the first 10 tokens and their IDs
print("Random 10 tokens in GPT-2 vocabulary:")
for _ in range(20):
    idx = random.randint(0, gpt2_vocab_size)
    print(f"Token ID: {idx}, Token: {repr(gpt2_vocab[idx])}")

Random 10 tokens in GPT-2 vocabulary:
Token ID: 2526, Token: ' risk'
Token ID: 18748, Token: 'ilitary'
Token ID: 45627, Token: ' rationality'
Token ID: 49983, Token: ' Fundamental'
Token ID: 9432, Token: ' objective'
Token ID: 30312, Token: 'ermanent'
Token ID: 24240, Token: ' tomato'
Token ID: 44017, Token: ' pend'
Token ID: 45772, Token: ' Byzantine'
Token ID: 6310, Token: 'Inst'
Token ID: 29700, Token: 'inders'
Token ID: 39243, Token: ' sensed'
Token ID: 32654, Token: ' secretive'
Token ID: 39910, Token: ' horrend'
Token ID: 1255, Token: ' result'
Token ID: 33690, Token: '510'
Token ID: 28337, Token: 'ortality'
Token ID: 37789, Token: ' mailed'
Token ID: 24403, Token: '227'
Token ID: 40927, Token: 'tile'


In [83]:
gpt4o_vocab = {}

for token_id in range(gpt4o_vocab_size):
  try:
    gpt4o_vocab[token_id] = gpt4o_tokenizer.decode([token_id])
  except:
    print(token_id)

import random
random.seed(42)
# Print the first 10 tokens and their IDs
print("Random 10 tokens in GPT-4o vocabulary:")
for _ in range(20):
    idx = random.randint(0, gpt4o_vocab_size)
    print(f"Token ID: {idx}, Token: {repr(gpt4o_vocab[idx])}")

199998
200000
200001
200002
200003
200004
200005
200006
200007
200008
200009
200010
200011
200012
200013
200014
200015
200016
200017
Random 10 tokens in GPT-4o vocabulary:
Token ID: 167621, Token: ' extracurricular'
Token ID: 29184, Token: '(boolean'
Token ID: 6556, Token: ' ger'
Token ID: 194393, Token: ' застос'
Token ID: 72097, Token: ' autism'
Token ID: 64196, Token: ' relevance'
Token ID: 58513, Token: ' directamente'
Token ID: 36579, Token: 'двэр'
Token ID: 193061, Token: ' QUICK'
Token ID: 26868, Token: 'chem'
Token ID: 177392, Token: ' deram'
Token ID: 194161, Token: ' стомат'
Token ID: 142964, Token: ' Bahnhof'
Token ID: 22790, Token: 'falls'
Token ID: 154794, Token: '-Cal'
Token ID: 110604, Token: 'Shipment'
Token ID: 8331, Token: '(res'
Token ID: 7811, Token: '�'
Token ID: 24561, Token: 'eed'
Token ID: 57314, Token: 'Kon'


In [98]:
# Get some long tokens from here
sorted(list(gpt4o_vocab.values()), key=len, reverse=True)[200:300]

[' สำนักเลขานุการองค์กร',
 '\t\t                   ',
 '                    ',
 '\t                   ',
 '                   \n',
 '        \r\n        \r\n',
 '    \n    \n    \n    \n',
 '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t',
 '                   ',
 '                  \n',
 ' telecommunications',
 ' //////////////////',
 ' __________________',
 ' selbstverständlich',
 ' วิเคราะห์บอลวันนี้',
 ' //----------------',
 '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t',
 '.onreadystatechange',
 '__________________\n',
 ' significativamente',
 ' Telecommunications',
 ' Wahrscheinlichkeit',
 ' disproportionately',
 '                  ',
 '        \n        \n',
 '                \r\n',
 '\n                \n',
 ' unterschiedlichen',
 '                 \n',
 ' interdisciplinary',
 '("----------------',
 '.githubusercontent',
 '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t',
 ' responsabilidades',
 ' Herausforderungen',
 ' multidisciplinary',
 ' STDMETHODCALLTYPE',
 ' _________________',
 '             

In [100]:
list(gpt4o_vocab.values()).index(" ")

220

### Explore the `encoder.json` and `vocab.bpe` files

In [53]:
!wget https://openaipublic.blob.core.windows.net/gpt-2/models/1558M/vocab.bpe
!wget https://openaipublic.blob.core.windows.net/gpt-2/models/1558M/encoder.json

--2025-01-04 08:55:45--  https://openaipublic.blob.core.windows.net/gpt-2/models/1558M/vocab.bpe
Resolving openaipublic.blob.core.windows.net (openaipublic.blob.core.windows.net)... 57.150.97.129
Connecting to openaipublic.blob.core.windows.net (openaipublic.blob.core.windows.net)|57.150.97.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 456318 (446K) [application/octet-stream]
Saving to: ‘vocab.bpe’


2025-01-04 08:55:46 (1.23 MB/s) - ‘vocab.bpe’ saved [456318/456318]

--2025-01-04 08:55:46--  https://openaipublic.blob.core.windows.net/gpt-2/models/1558M/encoder.json
Resolving openaipublic.blob.core.windows.net (openaipublic.blob.core.windows.net)... 57.150.97.129
Connecting to openaipublic.blob.core.windows.net (openaipublic.blob.core.windows.net)|57.150.97.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1042301 (1018K) [application/json]
Saving to: ‘encoder.json’


2025-01-04 08:55:47 (1.89 MB/s) - ‘encoder.json’ saved [104

In [54]:
import os, json

with open('encoder.json', 'r') as f:
    encoder = json.load(f) # <--- ~equivalent to our "vocab"

with open('vocab.bpe', 'r', encoding="utf-8") as f:
    bpe_data = f.read()

In [55]:
encoder # maps string to token id

{'!': 0,
 '"': 1,
 '#': 2,
 '$': 3,
 '%': 4,
 '&': 5,
 "'": 6,
 '(': 7,
 ')': 8,
 '*': 9,
 '+': 10,
 ',': 11,
 '-': 12,
 '.': 13,
 '/': 14,
 '0': 15,
 '1': 16,
 '2': 17,
 '3': 18,
 '4': 19,
 '5': 20,
 '6': 21,
 '7': 22,
 '8': 23,
 '9': 24,
 ':': 25,
 ';': 26,
 '<': 27,
 '=': 28,
 '>': 29,
 '?': 30,
 '@': 31,
 'A': 32,
 'B': 33,
 'C': 34,
 'D': 35,
 'E': 36,
 'F': 37,
 'G': 38,
 'H': 39,
 'I': 40,
 'J': 41,
 'K': 42,
 'L': 43,
 'M': 44,
 'N': 45,
 'O': 46,
 'P': 47,
 'Q': 48,
 'R': 49,
 'S': 50,
 'T': 51,
 'U': 52,
 'V': 53,
 'W': 54,
 'X': 55,
 'Y': 56,
 'Z': 57,
 '[': 58,
 '\\': 59,
 ']': 60,
 '^': 61,
 '_': 62,
 '`': 63,
 'a': 64,
 'b': 65,
 'c': 66,
 'd': 67,
 'e': 68,
 'f': 69,
 'g': 70,
 'h': 71,
 'i': 72,
 'j': 73,
 'k': 74,
 'l': 75,
 'm': 76,
 'n': 77,
 'o': 78,
 'p': 79,
 'q': 80,
 'r': 81,
 's': 82,
 't': 83,
 'u': 84,
 'v': 85,
 'w': 86,
 'x': 87,
 'y': 88,
 'z': 89,
 '{': 90,
 '|': 91,
 '}': 92,
 '~': 93,
 '¡': 94,
 '¢': 95,
 '£': 96,
 '¤': 97,
 '¥': 98,
 '¦': 99,
 '§': 100

In [56]:
len(encoder)

50257

In [59]:
len(bpe_data)

420572

In [60]:
bpe_merges = [tuple(merge_str.split()) for merge_str in bpe_data.split('\n')[1:-1]]
# ^---- ~equivalent to our "merges"

In [61]:
bpe_merges # cases where we merge

[('Ġ', 't'),
 ('Ġ', 'a'),
 ('h', 'e'),
 ('i', 'n'),
 ('r', 'e'),
 ('o', 'n'),
 ('Ġt', 'he'),
 ('e', 'r'),
 ('Ġ', 's'),
 ('a', 't'),
 ('Ġ', 'w'),
 ('Ġ', 'o'),
 ('e', 'n'),
 ('Ġ', 'c'),
 ('i', 't'),
 ('i', 's'),
 ('a', 'n'),
 ('o', 'r'),
 ('e', 's'),
 ('Ġ', 'b'),
 ('e', 'd'),
 ('Ġ', 'f'),
 ('in', 'g'),
 ('Ġ', 'p'),
 ('o', 'u'),
 ('Ġa', 'n'),
 ('a', 'l'),
 ('a', 'r'),
 ('Ġt', 'o'),
 ('Ġ', 'm'),
 ('Ġo', 'f'),
 ('Ġ', 'in'),
 ('Ġ', 'd'),
 ('Ġ', 'h'),
 ('Ġan', 'd'),
 ('i', 'c'),
 ('a', 's'),
 ('l', 'e'),
 ('Ġt', 'h'),
 ('i', 'on'),
 ('o', 'm'),
 ('l', 'l'),
 ('en', 't'),
 ('Ġ', 'n'),
 ('Ġ', 'l'),
 ('s', 't'),
 ('Ġ', 're'),
 ('v', 'e'),
 ('Ġ', 'e'),
 ('r', 'o'),
 ('l', 'y'),
 ('Ġb', 'e'),
 ('Ġ', 'g'),
 ('Ġ', 'T'),
 ('c', 't'),
 ('Ġ', 'S'),
 ('i', 'd'),
 ('o', 't'),
 ('Ġ', 'I'),
 ('u', 't'),
 ('e', 't'),
 ('Ġ', 'A'),
 ('Ġ', 'is'),
 ('Ġ', 'on'),
 ('i', 'm'),
 ('a', 'm'),
 ('o', 'w'),
 ('a', 'y'),
 ('a', 'd'),
 ('s', 'e'),
 ('Ġth', 'at'),
 ('Ġ', 'C'),
 ('i', 'g'),
 ('Ġf', 'or'),
 ('a', 'c'),
 ('Ġ

In [62]:
len(bpe_merges) # they do 50000 merges

50000

In [65]:
encoder['Ġwe'] # they merge the characters in this

356

In [66]:
# but where does 1 extra token come from ??
# 256 (byte token) + 50,000 merges + 1 special token :)
encoder["<|endoftext|>"]

50256

In [None]:
# GPT-4o has 2 special tokens and 199998 (256 + 199,742 merges)

## Sentencepiece tokenizer

(From Karpathy's notebook)

Commonly used because (unlike tiktoken) it can efficiently both train and inference BPE tokenizers. It is used in both Llama and Mistral series.

[sentencepiece on Github link](https://github.com/google/sentencepiece).

**The big difference**: sentencepiece runs BPE on the Unicode code points directly! It then has an option `character_coverage` for what to do with very very rare codepoints that appear very few times, and it either maps them onto an UNK token, or if `byte_fallback` is turned on, it encodes them with utf-8 and then encodes the raw bytes instead.

TLDR:

- tiktoken encodes to utf-8 and then BPEs bytes
- sentencepiece BPEs the code points and optionally falls back to utf-8 bytes for rare code points (rarity is determined by character_coverage hyperparameter), which then get translated to byte tokens.

(Personally I think the tiktoken way is a lot cleaner...)

In [67]:
!pip install sentencepiece



In [68]:
with open("toy.txt", 'w') as file:
  file.write("""It is in her mind and within her calculated realm of possibilities .
Clicking to one of these sites can trigger ads selling fake anti-spyware or turn the visitor 's PC into a hub for clicking on Web ads , while routing the ad payment to the intruder .
The Housing Federation 's Belinda Porich said : " Even by London standards , these are astronomical prices and many people - especially young , first time buyers - can only dream of owning a home .
Wachovia Corp. , the bank purchased by Wells Fargo & Co . , will pay more than $ 4.5 million to settle a brokerage regulator 's claims that it failed to ensure that clients received discounts on investment trusts and mutual funds .
" This is the archive 's coming of age , in a way , because it 's now so accessible , " said Robert Browning , director of the archives .
" It 's got much worse in the last five years .
That eliminates the messy middle step where Carr gets sacked six times , then benched .
Opposing the passage of the Senate bill --Obama 's bill-- from the left means dealing with accusations like these .
For one thing , there are significant differences between the House and Senate proposals to address the plight of the jobless , differences that would have to be worked out before a bill could be sent to President Bush .
There can 't be any more .
Officer Rob Gibbs , a spokesman for the Albuquerque Police Department , said he was not aware of any consultations with the department regarding security for Boyd 's clinic .
But the department had also said that it supported many of the benefits deriving from the agreement , including the potential to make millions of books more accessible .
The millionaire activist , who will be a candidate for the Tories in the next General Election , told BBC1 's The Andrew Marr Show : " We can achieve massive reductions from very little investment , and from the home-owners ' point of view that does pretty quickly lead to savings on your bills .
What gives Democrats hope is that McDonnell 's numbers haven 't moved much over several months , while Deeds is weak in voter groups where he has room to grow : among young Democrats and African Americans .
Their rooting section was celebrating early in the rout of Germany after Mexico took a 6-0 lead after two innings .
27 , 1995 , at the massive Army base in eastern North Carolina when he opened fire with a rifle from a concealed position .
It was a tough crowd Tuesday night at the senior center in Dartmouth , Mass .
While investors may be demanding higher yields for their US bonds , there is slim evidence of an impending buyers ' strike , he notes .
Use a glass or ceramic bowl , or even a stout plastic bag .
This is the thing - all that party infighting , who said what when or why - THAT DOESN 'T MATTER EXCEPT TO THEM AND YOU GUYS .
Mr. Long most recently served as Manager of Acquisitions and Evaluations for Phoenix Exploration Company .
This implicit sadness sets Cyrano apart even more than usual from the soldiers of his Gascon regiment and the bons vivants of Paris .
Black remains incensed at Radler for " double-crossing " his friends .
The Progressive Liberal Party issued a statement apologizing for Forbes 's outburst , saying his announcement of an acquittal was " incorrect . "
After eight months to 10 months of gradual dose increases , most can eat the peanut-flour equivalent of 15 peanuts daily , said Burks , who two years ago began reporting these signs of desensitization as long as children took their daily medicine .
The Pleasanton-based company said Thursday that it earned $ 194.6 million , or 44 cents per share , during the three months ended Sept .
parties - without jeopardizing network security. the address into your Internet browser. requirements and even cause a data breach .
The Department of Food Science and Human Nutrition has a nationally recognized food stamp education program that helps food stamp recipients learn how to maximize their food stamp benefits by getting items with the best nutritional quality for their family .
He said the dogs were later rounded up and were doing fine .
If you threw Kings of Leon , Scissor Sisters and Arcade Fire into a blender , this -- with an obligatory shot of wheatgrass -- is exactly what you would get in your musical glass .
Most medical experts agree that there is no evidence to suggest that children who take multivitamins encounter any negative consequences from ingesting high levels of nutrients .
The incident was included in the post-match report of Alan Wiley , the referee , even though he took no action at the time .
MILLETS-owner Blacks Leisure yesterday said it was on the recovery trail after like-for-like sales nudged up 3 per cent in the 26 weeks to the start of September .
Trading houses rose on higher crude oil prices , with crude oil futures edging up in New York to $ 81.30 a barrel overnight .
Last month the United Nations urged Russia and ex-Soviet Central Asia to stem drug trafficking from Afghanistan to Europe , saying the proceeds from a record opium crop were funding global terrorism . ( c ) Reuters 2007 .
Henin collected $ 95,500 for the win while Kuznetsova settled for $ 50,500 .
" This is a solid slate of candidates with extensive banking and financial services experience , a deep understanding of international credit and equity markets , and first-hand knowledge of the governing regulatory system , " said Richard Parsons , chairman of Citi 's board .
The pair had been penalised five places on the grid for allegedly blocking rival drivers during qualifying , following complaints from Renault and the improving BMW Sauber team .
Wateridge , of St Clement , Jersey , was the first person to be charged in connection with a multimillion-pound historic abuse investigation on the Channel Island .
Burley insisted he now had the " full support " of the SFA , despite ambiguous comments from their president , George Peat , in recent weeks .
The LCCC however , boasts the Old Trafford Lodge , with hotel rooms that have balconies looking out over the ground , which could be ideal for those looking to catch a game of cricket or the Green Day or Muse concerts that the ground will host later this year .
I mean , that is the agreement when you are a NATO ally , is if another country is attacked , you 're going to be expected to be called upon and help .
" Increasing access to AEDs and CPR / AED training is an essential part of our mission and we hope the Senate will embrace this issue as well , " said Scott Conner , senior vice president for the American Red Cross 's Preparedness , Health and Safety Services .
Middle East , Asia ( excluding Japan ) and Central / South America countries. which the company is ranked as the sixth largest in the world market .
On the wider RPI measure , which includes mortgage repayments and is often used in wage negotiations , inflation also rose sharply , to 3.7 % in January , from 2.4 % in December .
Maybe some of you even remember sitting in one of those giant , shaking Afterburner arcade cabinets , firing off missiles , yelling " THERE 'S A BOGEY ON MAH TAIL ! " while the mums on the way to the candyfloss stand gave you a wide berth .
It 's not a problem that can be solved by TV commercials featuring GM 's government-appointed auto-industry-newbie chairman , former AT & T chief executive Ed Whitacre , telling us how much he loves GM .
The bottle was thrown into the ocean 24 years ago as part of a contest in Ocean city that promised a prize to the person whose note went the farthest .
Describing his daily routine , he said he gets up just after 7am , except on weekends and holidays .
Panic ensued at a Fort Worth , Texas , Bank of America call center Wednesday after a few workers complained of headaches , dizziness and shortness of breath .
A small 2006 study from a University of Washington researcher found that young Indians living in Bangalore used cell phones to get to know partners introduced to them by their parents .
COLUMBUS , Ohio - For the second day in a row , rock star Bruce Springsteen sang a few songs and urged thousands of potential voters in a battleground state to register and support Democrat Barack Obama .
At events I regularly meet young Muslims and non-Muslims who have simply never heard arguments put for why liberal democracy is , though not perfect , our only achievable , messy , hope .
" I 'm hanging on by my fingernails . ... You can never get ahead , " she said .
It was a quiet finale otherwise for Flintoff , hero of the 2005 success , but England will find it hard to cover for such an inspirational figure .
Thatʼs bad news for the party that controls the White House and Congress at a time of near 10 percent unemployment and the slow economic recovery .
" All our attention is focused on making sure this thing never , never , ever becomes law , " said John Boehner , Republican House leader .
The downturn is helping Intel Capital 's portfolio .
Police are waiting for further forensic evidence before deciding whether to charge them .
He wanted all three of them to live together .
Richard Leakey , who was given a government brief to clean up the crooked national wildlife department , bemoans the lack of progress .
Mohamed Elibiary is a counter terrorism advisor who has advised President Barack Obama 's Homeland Security Council on home grown terror .
It pointed out that neither the CoT study nor other investigations have ever found a causal link " between the presence of cabin air contamination and the symptoms complained of by a very small minority of cabin and flight deck crew . "
Apart from his ornate Western military dress , pith helmet and monocle , he also dresses in Savile Row suits , wears a sword and is driven around the island in a London taxi .
The new study is based on an online survey of parents with children 17 and younger .
The left foot of Diamanti was also a constant threat for the hosts and Hart had to be alive midway through the half to palm over a stinging drive .
Scotland 's airports were warning flights may be disrupted due to the bad weather .
Geffen plans to meet with GE Chairman Jeffrey Immelt next week , with Jeff Zucker , NBC Universal 's president and chief executive , and Universal Studios President Ron Meyer also expected to attend , the newspaper reported .
Jimmy Kimmel has never been on the show before , or if he has I don 't remember it .
FOR years , when the artist Steven Parrino wasn 't jamming power chords on his electric guitar or tinkering with his motorcycle in his garagelike studio in Brooklyn , he was recycling his unsold paintings : twisting them into eccentric new shapes , smashing their stretcher bars or stabbing them repeatedly with scissors .
Many of these risks and uncertainties are beyond Oncothyreon 's control .
Beckham received a frosty reception at the match -- his first home game since returning from a five-month spell at Milan .
Tivoli ( 00 45 3315 1001 , www.tivoli.dk ) is open from April 8 to September 20 and then for Hallowe 'en and Christmas .
Clinton has indicated that she might take the fight to the convention in August .
Her three pooches also wore matching Dolce & Gabbana floral lace collars .
This article was first published on guardian.co.uk at 13.07 BST on Monday 20 April 2009 .
Priestland 's conversion ensured the visitors were unable to even claim a bonus point , leaving them with an uphill task against Pool 6 leaders Leinster next weekend while the Scarlets travel to Brive .
Russert was married to Maureen Orth , a writer for Vanity Fair magazine .
It is true , however , that having worked so hard to keep Ronaldo last summer , United 's disinclination to offer him a new contract , given his current one expires in three years and does not even put the World Player of the Year among the top five-paid players in England , never mind the world , is unusual .
America 's gold-medal chasing swimmers say teammate Eric Shanteau 's battle with cancer has inspired them and made them realise there is more to life than the Olympic Games .
We will do it for Christina-Taylor Green , Dorothy Morris , Phyllis Schenk ( sic / Schneck ) , Dorwan Stoddard : ordinary citizens who died participating in their democracy .
Dallas had a 13-point lead heading into the fourth quarter , but Sacramento closed to within 102-100 and had a chance to tie it as the final minute began .
A : That he did it all .
And now he was faced with the fact that his new executive compensation policy , which only applied to a narrow subset of executives at a few institutions , had been powerless to stop the worst violators at AIG from getting their undeserved payday .
With its proposed spending cuts , revealed Thursday as the White House released a more detailed picture of its $ 3.6-trillion budget , the administration is aiming to cast itself as a careful custodian of tax dollars .
British Foreign Secretary David Miliband has urged the Afghan government to exploit the military success to reconcile with moderate Taliban guerrillas .
If this proves insufficient , technical measures will be introduced - including the powers to disconnect pirates .
That solved his problem in Algeria .
The DPJ opposes the refuelling , claiming it violates Japan 's pacifist constitution .
Wright will go to prison and Riseborough will go to a young offenders ' institution .
X Factor : Should we switch off ? 10The ultimate herbal remedy : Can cannabis improve autism ?
A reconciliation of net income applicable to common shareholders to FFO is contained in the table accompanying this release .
Because the plant is an attempt to develop new technology with potentially significant public benefits , the Energy Department is supposed to pay 74 percent of the costs , but because of growing costs , the government is now seeking to have the industry pick up a larger share of expenses , about $ 1.3 billion .
Reporting from Altar , Mexico -- On a cloudless afternoon in northern Sonora , migrants and drug runners lounge in equal numbers under scattered mesquite trees , playing cards or sipping water .
The unions representing Hollywood 's actors have reached tentative agreement with advertisers on a new contract covering work in commercials .
Such a buildup within a country 's own banking system makes it all the more difficult for a government to propose a debt restructuring , given that the country 's local banks and its many unionized employees will suffer , not just faceless hedge funds abroad .
But the Beetlejuice legend had some concerns about his character .
Nowadays that 's no longer the case .
Despite how the season has gone so far , on and off the court , Wizards players Antawn Jamison and Brendan Haywood both spoke Thursday about holding out hope of contending for a playoff spot .
The New York Times obituary of Maryanne Amacher .
" We have to make sure we 're at the races and put South Africa under pressure again , " he said .
Xinhua said , according to Ma , China is fully prepared to discuss human rights with other countries as long as such talks take place in a state of mutual respect among all participants .
The agency said the approval of Kari 's pay was done after consulting with the Treasury Department .
Smith | 14.10.08 , 08 : 43 GMT - I 'm afraid that this may well be the case .
An ecumenical service in memory of the victims of 32 different nationalities is to be held in Notre Dame Cathedral in Paris today .
Croatia coach Slaven Bilic has emerged as a major contender for the vacant manager 's job at Schalke .
Based not on a show or even a song or a game , " Kit Kittredge : An American Girl " is based on a doll .
" Everyone who purchased a card received a letter giving them a temporary code , and explaining that they could still get their discount by using the code and who to contact .
However cliquey they might be , the highest achievers can still be friends with the worst bullies .
" You don 't realize there 's lead in it , you eat a cookie , you eat something without washing your hands , that exposure builds up in your body over time , " said Dr. James Menoutis , who runs the lab at Quantex .
Google 's Chrome browser will hog your computer 's processor more than the new Internet Explorer 8 .
Shouldn 't the father be thrown in jail ?
Hence the fierce backlash in Ms Hussein 's favour .
Consumer advocacy blog The Consumerist phrased Facebook 's fresh policy as " We Can Do Anything We Want With Your Content .
But even with the discount , mountain rescue teams and the RNLI said the increase would be significant and put added pressures on funds .
Crowds look closely and slowly .
The reductions would be in addition to the 44,000 jobs already cut through widespread buyouts .
House Democrats said they would not approve any money until President Obama has presented a detailed plan for how the shutdown would be handled .
Formal agreements will be made with El Salvador , Australia , Romania and Estonia once a long-awaited security pact with the United States , which was approved by Parliament on Thursday , becomes law .
But those wanting to go will also have had to pre-register their details , a system introduced in 2007 to stop ticket touting .
The mass street demonstrations , polished campaign slogans and televised debates more closely resembled Western elections than the scripted campaigns in most other Middle Eastern countries .
A senior US official said " human rights , of course , will be discussed , " citing Tibet and unrest in China .
This is hardly a surprise to those of us who have worked the steroid beat over the years .
" It 's an emotional thing .
Those fees , designed as a disincentive for firms to get too large and complex , would also serve as working capital for a government mechanism to wind down troubled systemic firms .
According to the IOC Blogging Guidelines for the 2010 Games , athletes and other accredited people must keep their posts confined to their personal experiences .
Persistently cold feet and hands are very common , and even healthy non-smokers can experience the symptoms .
Here in the US it seems that only now are we beginning to take it seriously .
" The rioters , armed with weapons from the U.S. , Israel and England , opened fire on people in a futile attempt to accuse the police and the Basiji , with the cooperation of foreign media , " Firouzabadi said in an open letter addressed to Imam Mahdi , a venerated Shiite Muslim who disappeared hundreds of years ago and whose messianic return , it is believed , will herald a new age .
The median household income for a family of four in New Jersey is $ 94,441 , according to the U.S. Department of Health and Human Services .
NEW YORK ( AP ) - Ron Gardenhire threw a couple of lefties at the New York Yankees and nothing went right .
In September 2006 , Mr Johnson went one better than irritating the people of a city by managing to annoy an entire country .
LONDON - Reviled by the public and spurned in private , bankers have been looking for solace in adultery , according to a dating Web site for people seeking affairs .
McChrystal and General David Petraeus , head of the U.S. Central Command , were participating in the talks .
The Pennsylvania Senator has endured two bouts of Hodgkin 's lymphoma and the chemotherapy that goes with it , a couple of procedures for a recurrent benign brain tumor , and heart-bypass surgery that sent him into cardiac arrest .
Homegrown signs like Johnny Cueto ( Dominican ) and Joey Votto ( Canada ) should be regulars in the Queen City for some time , and they soon could be joined by the likes of Cuban Yonder Alonso , Puerto Rican Neftali Soto , Venezuelan Yorman Rodriguez , and Dominicans , Juan Francisco and Juan Duran .
When he maintained that Vidal 's writing was " no more interesting than the contents of the stomach of an intellectual cow , " they booed heartily .
Three policemen and three assailants were killed in a gunfight in front of the consulate .
( AP ) Police say a large explosion has killed at least three people and heavily damaged a building housing a federal investigative agency in Pakistan 's east .
He beat them 6-3 in his season debut , then got a no-decision on May 12 .
A This all goes back to your lease and how well you read through it !
In some cases , shirt and skirt were conjoined , no longer separates , which looked like a speedy solution for the time-poor getting up in the morning .
Reformist Web sites said he had been held at Kahrizak prison , where much of the alleged prisoner abuse took place , and his jaw was broken when his father received his body .
3 ( UPI ) -- Authorities in Georgia said a man dressed as an elf was arrested for allegedly telling a mall Santa that he was carrying explosives .
Swedish midfielder Ljungberg had already given Arsenal the lead in this FA Cup quarterfinal but , with Bolton steaming forward for an equalizer , he then astonishingly scooped the ball over the crossbar from two yards out with the goal gaping .
" Democrats have realized if there is any time they can steal this seat , now is the time to do it , " Hoffman said .
Ferrell took notes as the messaging gurus outlined their options : focus on opinion-makers inside the Beltway with high-frequency ads in Capitol Hill papers .
The deal should become a firm order shortly .
England will carefully watch the early game today -- between Ireland and Sri Lanka -- and consider whether to play an additional pace bowler , Ryan Sidebottom .
COLUMBUS , Ohio ( AP ) - Penn State , unbeaten and unbowed , proved it belongs in the middle of any national championship talk .
The field , once developed , will have 70 wells -- half production and half injection , two-thirds of which will be drilled from an artificially constructed island 2.8 miles from the coast .
That is the first development from the release of the Bowl Championship Series standings yesterday , as Ohio State holds the top spot and South Florida is No. 2 thanks to strong numbers in the computer polls .
Especially given that spaffing is very much frowned upon in the classroom .
That said , the outlook remains broadly encouraging .
LOUIS - Wicked thunderstorms with wind reaching speeds of 120 mph pushed through parts of the Midwest on Friday , leaving four people dead , collapsing a church and knocking out power to thousands , authorities said .
A long-troubled $ 14 billion program to build a landing craft for the Marine Corps is destined for the chopping block , defense officials and analysts said Wednesday , part of $ 100 billion in savings that Defense Secretary Robert M. Gates has pledged to squeeze from the Pentagon 's budget .
Has become disappointing after showing early promise , but he would hardly be the first to turn over a new leaf after entering the care of Tim Vaughan .
Marina rarely leaves her two-room home in northern Israel these days .
Lopez acknowledged that votes for Cruz were nullified , but claims they added up to only 8 ballots of about 100 cast in this largely unpaved village of about 1,500 people .
The map remained in the Wolfegg collection for the next hundred years - until 2003 , when the US Library of Congress announced , with great fanfare , that it had acquired the map from the castle 's owner for the staggering sum of $ 10m .
He appeared in four games before being optioned back to Triple-A Louisville .
We support action that would lead to more places being made available in February .
17 / PRNewswire-USNewswire / -- Quitting is hard and staying quit is even harder .
Also present were some of Africa 's most controversial leaders .
Why did you ever leave ?
But that doesn 't mean they won 't have anything to say .
He took part in a charity ' guess the weight ' contest , where organisers had to use specialist gear from a stone landscaping company usually used to weigh lorries to calculate his immense mass .
On Thursday , employees at Quinn Insurance said a meeting with administrators in Enniskillen earlier was " positive . "
The service is being upgraded to give more homes channel Five , but 460,000 households are expected to lose access to ITV3 and ITV4 instead .
For outdoorsy dads , think beyond the standard barbecue accessories .
The strong sequential increase in NI instrument control revenue indicates that the overall test and measurement industry may have bottomed out in Q2 and begun a sequential recovery in Q3 . Product revenue was $ 152 million , down 24 percent from Q3 2008 , and software maintenance revenue was $ 13 million , down 9 percent year-over-year .
He said NASA 's funding plans would put NASA back on track as a " big-picture innovator " in technology development that could create future growth .
Jane Seymour also recounted for WebMD her struggles conceiving twin boys at age 44 .
Investors sold off shares , unimpressed by the economic stimulus plan President Bush announced Friday .
There is also a certain innocence and sweetness about Moretz 's performance .
Morgan Stanley said it had $ 198.2 billion in capital as of Feb .
The dead man was wearing a wetsuit when his remains were found on Tuesday afternoon at Camasunary on Loch Slapin .
The FTSE 100 index had fallen nearly 17 per cent since September 30 by Wednesday 's close , and the market opened a further 5 per cent lower on Thursday .
Then they found and beat another white man so bad that over a month later , he is still in the hospital .
In a March 2006 attack in Mahmoudiya , about 20 miles south of Baghdad , Green and three other soldiers went to the home of 14-year-old Abeer Qassim al-Janabi .
If others don 't step in quickly , Washington will need to twist their arms or do even more itself .
" We had the old brass bell from the previous lifeboat on board , so we upended it to use as the christening font .
Yet , the more effective the present clean-up seems , the more likely is it that central bankers will draw that lesson .
Playing the outsourcing card ( " It 's not our fault -- it 's this company we work with " ) ?
" The substance of life doesn 't change much from one culture to another , " she observes , " but the human soul requires a beautiful wrapper . "
Alaibi and her family , meanwhile , paid a heavy price for helping Americans .
The debate tells the story : " Now that Hoffman has emerged as the GOP 's best bet for holding the Republican seat , the Democratic candidate , Bill Owens , used a Thursday debate to tie Hoffman to the Club for Growth , an anti-tax group which has backed Hoffman , and ignored the Republican candidate , Dede Scozzafava , " ABC 's Teddy Davis writes .
CAVALIERS ' WALLACE DOUBTFUL Cavaliers forward Ben Wallace is doubtful for Game 3 of Cleveland 's playoff series against the Boston Celtics because of allergies and a left inner ear infection .
Instead , NASA would be asked to monitor climate change and develop a new rocket .
Speaking in oratorical cadences that evoke Sean Connery by way of John Huston , Mr. Day-Lewis 's oil baron , Daniel Plainview , dripping with black slime in the early scenes , is a fearsome creature whose soul rots as his income multiplies .
So far , AT & T , Verizon Communications and Sprint Nextel , which were embarrassed in 2005 after it was discovered they were cooperating with the National Security Agency in a warrantless surveillance scheme by the Bush administration , have not signed on .
Lincoln Cycling Grand Prix paid £ 10,000 for its police costs .
Emphasizing that it doesn 't need American firepower , a Pakistani general said an offensive along the frontier has killed more than 1,000 militants and predicted the region would be " stabilized " within two months .
14 ( UPI ) -- Barack Obama license plates are now available to Illinois residents who can 't get enough of the new U.S. president , state officials say .
Here in Vero Beach , at least there was , there is , sometimes , Koufax .
Reject portions of writings you don 't like and compliment others ; if you can. joint efforts .
Although Andrew Flintoff was periodically a handful , Hilfenhaus has been the best speedster in the contest .
On Saturday , the man - who was rolled into court for a remand hearing on a stretcher with his arm in bandages and leg in a splint - denied all charges against him , although he admitted to having been in the area .
It has also offered a $ 100,000 grant for animal habitat improvements at the Alaska Zoo if the relocation deal goes through .
And I think it would eventually drive the Taliban out of these areas , much the way it 's been done in some of the urban areas in Iraq through the inkblot strategy .
""")

In [79]:
# train a sentencepiece model on it
# the settings here are (best effort) those used for training Llama 2
import os
import sentencepiece as spm

options = dict(
  # input spec
  input="toy.txt",
  input_format="text",
  # output spec
  model_prefix="tok400", # output filename prefix
  # algorithm spec
  # BPE alg
  model_type="bpe",
  vocab_size=400,
  # normalization
  normalization_rule_name="identity", # ew, turn off normalization
  remove_extra_whitespaces=True,
  input_sentence_size=200000000, # max number of training sentences
  max_sentence_length=4192, # max number of bytes per sentence
  seed_sentencepiece_size=1000000,
  shuffle_input_sentence=True,
  # rare word treatment
  character_coverage=0.99995,
  byte_fallback=True, # so that we do not get <unk> for unknown tokens
  # merge rules
  split_digits=True,
  split_by_unicode_script=True,
  split_by_whitespace=True,
  split_by_number=True,
  max_sentencepiece_length=16,
  add_dummy_prefix=False, # this puts a space in front every "sentence" so that 'world' and ' world' are same
  allow_whitespace_only_pieces=True,
  # special tokens
  unk_id=0, # the UNK token MUST exist
  bos_id=1, # the others are optional, set to -1 to turn off
  eos_id=2,
  pad_id=-1,
  # systems
  num_threads=os.cpu_count(), # use ~all system resources
)

spm.SentencePieceTrainer.train(**options)


In [80]:
sp = spm.SentencePieceProcessor()
sp.load('tok400.model')
vocab = [[sp.id_to_piece(idx), idx] for idx in range(sp.get_piece_size())]
vocab

[['<unk>', 0],
 ['<s>', 1],
 ['</s>', 2],
 ['<0x00>', 3],
 ['<0x01>', 4],
 ['<0x02>', 5],
 ['<0x03>', 6],
 ['<0x04>', 7],
 ['<0x05>', 8],
 ['<0x06>', 9],
 ['<0x07>', 10],
 ['<0x08>', 11],
 ['<0x09>', 12],
 ['<0x0A>', 13],
 ['<0x0B>', 14],
 ['<0x0C>', 15],
 ['<0x0D>', 16],
 ['<0x0E>', 17],
 ['<0x0F>', 18],
 ['<0x10>', 19],
 ['<0x11>', 20],
 ['<0x12>', 21],
 ['<0x13>', 22],
 ['<0x14>', 23],
 ['<0x15>', 24],
 ['<0x16>', 25],
 ['<0x17>', 26],
 ['<0x18>', 27],
 ['<0x19>', 28],
 ['<0x1A>', 29],
 ['<0x1B>', 30],
 ['<0x1C>', 31],
 ['<0x1D>', 32],
 ['<0x1E>', 33],
 ['<0x1F>', 34],
 ['<0x20>', 35],
 ['<0x21>', 36],
 ['<0x22>', 37],
 ['<0x23>', 38],
 ['<0x24>', 39],
 ['<0x25>', 40],
 ['<0x26>', 41],
 ['<0x27>', 42],
 ['<0x28>', 43],
 ['<0x29>', 44],
 ['<0x2A>', 45],
 ['<0x2B>', 46],
 ['<0x2C>', 47],
 ['<0x2D>', 48],
 ['<0x2E>', 49],
 ['<0x2F>', 50],
 ['<0x30>', 51],
 ['<0x31>', 52],
 ['<0x32>', 53],
 ['<0x33>', 54],
 ['<0x34>', 55],
 ['<0x35>', 56],
 ['<0x36>', 57],
 ['<0x37>', 58],
 ['<0x38>', 5

In [81]:
ids = sp.encode("hello 안녕하세요")
print(ids)

[262, 330, 330, 326, 320, 239, 152, 139, 238, 136, 152, 240, 152, 155, 239, 135, 187, 239, 157, 151]


In [82]:
print([sp.id_to_piece(idx) for idx in ids])

['he', 'l', 'l', 'o', '▁', '<0xEC>', '<0x95>', '<0x88>', '<0xEB>', '<0x85>', '<0x95>', '<0xED>', '<0x95>', '<0x98>', '<0xEC>', '<0x84>', '<0xB8>', '<0xEC>', '<0x9A>', '<0x94>']
