# Introduction

This notebook follows Andrej Karpathy's [Let's build the GPT Tokenizer
](https://www.youtube.com/watch?v=zduSFxRajkE) Youtube video, with some interesting links and plenty of explanatory notes generated by ChatGPT

[Andrej Karpathy's colab](https://colab.research.google.com/drive/1y0KnCFZvGVf_odSfcNAws6kcDD7HsI0L?usp=sharing#scrollTo=pkAPaUCXOhvW)

As always, you should watch the video and work through it yourself for maximum benefit.


Other Links:


[GPT-2 paper: Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf)

[Neural Machine Translation of Rare Words with Subword Units](https://arxiv.org/pdf/1508.07909.pdf)

# Tokenization webapp

Good tokenization web app: [https://tiktokenizer.vercel.app](https://tiktokenizer.vercel.app)


# Unicode

[A Programmer’s Introduction to Unicode](https://www.reedbeta.com/blog/programmers-intro-to-unicode/)

[UTF-8 Everywhere](https://utf8everywhere.org/)

In [1]:
[ord(x)for x in "hello"]

[104, 101, 108, 108, 111]

In [30]:
chr(104)

'h'

In [9]:
print(list("h".encode("utf-8")))
print(list("h".encode("utf-16")))
print(list("h".encode("utf-32")))
print(list("hello".encode("utf-8")))
print(list("hello".encode("utf-16")))
print(list("hello".encode("utf-32")))
print(list("😄".encode("utf-8")))

[104]
[255, 254, 104, 0]
[255, 254, 0, 0, 104, 0, 0, 0]
[104, 101, 108, 108, 111]
[255, 254, 104, 0, 101, 0, 108, 0, 108, 0, 111, 0]
[255, 254, 0, 0, 104, 0, 0, 0, 101, 0, 0, 0, 108, 0, 0, 0, 108, 0, 0, 0, 111, 0, 0, 0]
[240, 159, 152, 132]


In [8]:
# text from https://www.reedbeta.com/blog/programmers-intro-to-unicode/
text = "😄 hello"
tokens = text.encode("utf-8") # raw bytes
tokens = list(map(int, tokens)) # convert to a list of integers in range 0..255 for convenience
print('---')
print(list(tokens))
print('---')
print(text)
print("length:", len(text))
print('---')
print(tokens)
print("length:", len(tokens))

---
[240, 159, 152, 132, 32, 104, 101, 108, 108, 111]
---
😄 hello
length: 7
---
[240, 159, 152, 132, 32, 104, 101, 108, 108, 111]
length: 10


In [11]:
# text from https://www.reedbeta.com/blog/programmers-intro-to-unicode/
text = "Ｕｎｉｃｏｄｅ! 🅤🅝🅘🅒🅞🅓🅔‽ 🇺‌🇳‌🇮‌🇨‌🇴‌🇩‌🇪! 😄 The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to “support Unicode” in our software (whatever that means—like using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don’t blame programmers for still finding the whole thing mysterious, even 30 years after Unicode’s inception."
tokens = text.encode("utf-8") # raw bytes
tokens = list(map(int, tokens)) # convert to a list of integers in range 0..255 for convenience
print('---')
print(text)
print("length:", len(text))
print('---')
print(tokens)
print("length:", len(tokens))

---
Ｕｎｉｃｏｄｅ! 🅤🅝🅘🅒🅞🅓🅔‽ 🇺‌🇳‌🇮‌🇨‌🇴‌🇩‌🇪! 😄 The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to “support Unicode” in our software (whatever that means—like using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don’t blame programmers for still finding the whole thing mysterious, even 30 years after Unicode’s inception.
length: 533
---
[239, 188, 181, 239, 189, 142, 239, 189, 137, 239, 189, 131, 239, 189, 143, 239, 189, 132, 239, 189, 133, 33, 32, 240, 159, 133, 164, 240, 159, 133, 157, 240, 159, 133, 152, 240, 159, 133, 146, 240, 159, 133, 158, 240, 159, 133, 147, 240, 159, 133, 148, 226, 128, 189, 32, 240, 159, 135, 186, 226, 128, 140, 240, 159, 135, 179, 226, 128, 140, 240, 159, 135, 174, 226, 128, 140, 240, 159, 135, 168, 226, 128, 140, 240, 159, 135, 180, 226, 128, 140

# Byte-pair encoding

[https://en.wikipedia.org/wiki/Byte_pair_encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding)

In [12]:
def get_stats(ids):
    counts = {}
    for pair in zip(ids, ids[1:]): # Pythonic way to iterate consecutive elements
        counts[pair] = counts.get(pair, 0) + 1
    return counts

stats = get_stats(tokens)
# print(stats)
print(sorted(((v,k) for k,v in stats.items()), reverse=True))

[(20, (101, 32)), (15, (240, 159)), (12, (226, 128)), (12, (105, 110)), (10, (115, 32)), (10, (97, 110)), (10, (32, 97)), (9, (32, 116)), (8, (116, 104)), (7, (159, 135)), (7, (159, 133)), (7, (97, 114)), (6, (239, 189)), (6, (140, 240)), (6, (128, 140)), (6, (116, 32)), (6, (114, 32)), (6, (111, 114)), (6, (110, 103)), (6, (110, 100)), (6, (109, 101)), (6, (104, 101)), (6, (101, 114)), (6, (32, 105)), (5, (117, 115)), (5, (115, 116)), (5, (110, 32)), (5, (100, 101)), (5, (44, 32)), (5, (32, 115)), (4, (116, 105)), (4, (116, 101)), (4, (115, 44)), (4, (114, 105)), (4, (111, 117)), (4, (111, 100)), (4, (110, 116)), (4, (110, 105)), (4, (105, 99)), (4, (104, 97)), (4, (103, 32)), (4, (101, 97)), (4, (100, 32)), (4, (99, 111)), (4, (97, 109)), (4, (85, 110)), (4, (32, 119)), (4, (32, 111)), (4, (32, 102)), (4, (32, 85)), (3, (118, 101)), (3, (116, 115)), (3, (116, 114)), (3, (116, 111)), (3, (114, 116)), (3, (114, 115)), (3, (114, 101)), (3, (111, 102)), (3, (111, 32)), (3, (108, 108)), (

In [13]:
top_pair = max(stats, key=stats.get)
top_pair

(101, 32)

In [14]:
def merge(ids, pair, idx):
  # in the list of ints (ids), replace all consecutive occurences of pair with the new token idx
  newids = []
  i = 0
  while i < len(ids):
    # if we are not at the very last position AND the pair matches, replace it
    if i < len(ids) - 1 and ids[i] == pair[0] and ids[i+1] == pair[1]:
      newids.append(idx)
      i += 2
    else:
      newids.append(ids[i])
      i += 1
  return newids

print(merge([5, 6, 6, 7, 9, 1], (6, 7), 99))

tokens2 = merge(tokens, top_pair, 256)
print(tokens2)
print("length:", len(tokens2))

[5, 6, 99, 9, 1]
[239, 188, 181, 239, 189, 142, 239, 189, 137, 239, 189, 131, 239, 189, 143, 239, 189, 132, 239, 189, 133, 33, 32, 240, 159, 133, 164, 240, 159, 133, 157, 240, 159, 133, 152, 240, 159, 133, 146, 240, 159, 133, 158, 240, 159, 133, 147, 240, 159, 133, 148, 226, 128, 189, 32, 240, 159, 135, 186, 226, 128, 140, 240, 159, 135, 179, 226, 128, 140, 240, 159, 135, 174, 226, 128, 140, 240, 159, 135, 168, 226, 128, 140, 240, 159, 135, 180, 226, 128, 140, 240, 159, 135, 169, 226, 128, 140, 240, 159, 135, 170, 33, 32, 240, 159, 152, 132, 32, 84, 104, 256, 118, 101, 114, 121, 32, 110, 97, 109, 256, 115, 116, 114, 105, 107, 101, 115, 32, 102, 101, 97, 114, 32, 97, 110, 100, 32, 97, 119, 256, 105, 110, 116, 111, 32, 116, 104, 256, 104, 101, 97, 114, 116, 115, 32, 111, 102, 32, 112, 114, 111, 103, 114, 97, 109, 109, 101, 114, 115, 32, 119, 111, 114, 108, 100, 119, 105, 100, 101, 46, 32, 87, 256, 97, 108, 108, 32, 107, 110, 111, 119, 32, 119, 256, 111, 117, 103, 104, 116, 32, 116, 111, 

In [15]:
# @title Full tokenizer training text

# making the training text longer to have more representative token statistics
# text from https://www.reedbeta.com/blog/programmers-intro-to-unicode/
text = """A Programmer’s Introduction to Unicode March 3, 2017 · Coding · 22 Comments  Ｕｎｉｃｏｄｅ! 🅤🅝🅘🅒🅞🅓🅔‽ 🇺\u200c🇳\u200c🇮\u200c🇨\u200c🇴\u200c🇩\u200c🇪! 😄 The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to “support Unicode” in our software (whatever that means—like using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don’t blame programmers for still finding the whole thing mysterious, even 30 years after Unicode’s inception.  A few months ago, I got interested in Unicode and decided to spend some time learning more about it in detail. In this article, I’ll give an introduction to it from a programmer’s point of view.  I’m going to focus on the character set and what’s involved in working with strings and files of Unicode text. However, in this article I’m not going to talk about fonts, text layout/shaping/rendering, or localization in detail—those are separate issues, beyond my scope (and knowledge) here.  Diversity and Inherent Complexity The Unicode Codespace Codespace Allocation Scripts Usage Frequency Encodings UTF-8 UTF-16 Combining Marks Canonical Equivalence Normalization Forms Grapheme Clusters And More… Diversity and Inherent Complexity As soon as you start to study Unicode, it becomes clear that it represents a large jump in complexity over character sets like ASCII that you may be more familiar with. It’s not just that Unicode contains a much larger number of characters, although that’s part of it. Unicode also has a great deal of internal structure, features, and special cases, making it much more than what one might expect a mere “character set” to be. We’ll see some of that later in this article.  When confronting all this complexity, especially as an engineer, it’s hard not to find oneself asking, “Why do we need all this? Is this really necessary? Couldn’t it be simplified?”  However, Unicode aims to faithfully represent the entire world’s writing systems. The Unicode Consortium’s stated goal is “enabling people around the world to use computers in any language”. And as you might imagine, the diversity of written languages is immense! To date, Unicode supports 135 different scripts, covering some 1100 languages, and there’s still a long tail of over 100 unsupported scripts, both modern and historical, which people are still working to add.  Given this enormous diversity, it’s inevitable that representing it is a complicated project. Unicode embraces that diversity, and accepts the complexity inherent in its mission to include all human writing systems. It doesn’t make a lot of trade-offs in the name of simplification, and it makes exceptions to its own rules where necessary to further its mission.  Moreover, Unicode is committed not just to supporting texts in any single language, but also to letting multiple languages coexist within one text—which introduces even more complexity.  Most programming languages have libraries available to handle the gory low-level details of text manipulation, but as a programmer, you’ll still need to know about certain Unicode features in order to know when and how to apply them. It may take some time to wrap your head around it all, but don’t be discouraged—think about the billions of people for whom your software will be more accessible through supporting text in their language. Embrace the complexity!  The Unicode Codespace Let’s start with some general orientation. The basic elements of Unicode—its “characters”, although that term isn’t quite right—are called code points. Code points are identified by number, customarily written in hexadecimal with the prefix “U+”, such as U+0041 “A” latin capital letter a or U+03B8 “θ” greek small letter theta. Each code point also has a short name, and quite a few other properties, specified in the Unicode Character Database.  The set of all possible code points is called the codespace. The Unicode codespace consists of 1,114,112 code points. However, only 128,237 of them—about 12% of the codespace—are actually assigned, to date. There’s plenty of room for growth! Unicode also reserves an additional 137,468 code points as “private use” areas, which have no standardized meaning and are available for individual applications to define for their own purposes.  Codespace Allocation To get a feel for how the codespace is laid out, it’s helpful to visualize it. Below is a map of the entire codespace, with one pixel per code point. It’s arranged in tiles for visual coherence; each small square is 16×16 = 256 code points, and each large square is a “plane” of 65,536 code points. There are 17 planes altogether.  Map of the Unicode codespace (click to zoom)  White represents unassigned space. Blue is assigned code points, green is private-use areas, and the small red area is surrogates (more about those later). As you can see, the assigned code points are distributed somewhat sparsely, but concentrated in the first three planes.  Plane 0 is also known as the “Basic Multilingual Plane”, or BMP. The BMP contains essentially all the characters needed for modern text in any script, including Latin, Cyrillic, Greek, Han (Chinese), Japanese, Korean, Arabic, Hebrew, Devanagari (Indian), and many more.  (In the past, the codespace was just the BMP and no more—Unicode was originally conceived as a straightforward 16-bit encoding, with only 65,536 code points. It was expanded to its current size in 1996. However, the vast majority of code points in modern text belong to the BMP.)  Plane 1 contains historical scripts, such as Sumerian cuneiform and Egyptian hieroglyphs, as well as emoji and various other symbols. Plane 2 contains a large block of less-common and historical Han characters. The remaining planes are empty, except for a small number of rarely-used formatting characters in Plane 14; planes 15–16 are reserved entirely for private use.  Scripts Let’s zoom in on the first three planes, since that’s where the action is:  Map of scripts in Unicode planes 0–2 (click to zoom)  This map color-codes the 135 different scripts in Unicode. You can see how Han () and Korean () take up most of the range of the BMP (the left large square). By contrast, all of the European, Middle Eastern, and South Asian scripts fit into the first row of the BMP in this diagram.  Many areas of the codespace are adapted or copied from earlier encodings. For example, the first 128 code points of Unicode are just a copy of ASCII. This has clear benefits for compatibility—it’s easy to losslessly convert texts from smaller encodings into Unicode (and the other direction too, as long as no characters outside the smaller encoding are used).  Usage Frequency One more interesting way to visualize the codespace is to look at the distribution of usage—in other words, how often each code point is actually used in real-world texts. Below is a heat map of planes 0–2 based on a large sample of text from Wikipedia and Twitter (all languages). Frequency increases from black (never seen) through red and yellow to white.  Heat map of code point usage frequency in Unicode planes 0–2 (click to zoom)  You can see that the vast majority of this text sample lies in the BMP, with only scattered usage of code points from planes 1–2. The biggest exception is emoji, which show up here as the several bright squares in the bottom row of plane 1.  Encodings We’ve seen that Unicode code points are abstractly identified by their index in the codespace, ranging from U+0000 to U+10FFFF. But how do code points get represented as bytes, in memory or in a file?  The most convenient, computer-friendliest (and programmer-friendliest) thing to do would be to just store the code point index as a 32-bit integer. This works, but it consumes 4 bytes per code point, which is sort of a lot. Using 32-bit ints for Unicode will cost you a bunch of extra storage, memory, and performance in bandwidth-bound scenarios, if you work with a lot of text.  Consequently, there are several more-compact encodings for Unicode. The 32-bit integer encoding is officially called UTF-32 (UTF = “Unicode Transformation Format”), but it’s rarely used for storage. At most, it comes up sometimes as a temporary internal representation, for examining or operating on the code points in a string.  Much more commonly, you’ll see Unicode text encoded as either UTF-8 or UTF-16. These are both variable-length encodings, made up of 8-bit or 16-bit units, respectively. In these schemes, code points with smaller index values take up fewer bytes, which saves a lot of memory for typical texts. The trade-off is that processing UTF-8/16 texts is more programmatically involved, and likely slower.  UTF-8 In UTF-8, each code point is stored using 1 to 4 bytes, based on its index value.  UTF-8 uses a system of binary prefixes, in which the high bits of each byte mark whether it’s a single byte, the beginning of a multi-byte sequence, or a continuation byte; the remaining bits, concatenated, give the code point index. This table shows how it works:  UTF-8 (binary)\tCode point (binary)\tRange 0xxxxxxx\txxxxxxx\tU+0000–U+007F 110xxxxx 10yyyyyy\txxxxxyyyyyy\tU+0080–U+07FF 1110xxxx 10yyyyyy 10zzzzzz\txxxxyyyyyyzzzzzz\tU+0800–U+FFFF 11110xxx 10yyyyyy 10zzzzzz 10wwwwww\txxxyyyyyyzzzzzzwwwwww\tU+10000–U+10FFFF A handy property of UTF-8 is that code points below 128 (ASCII characters) are encoded as single bytes, and all non-ASCII code points are encoded using sequences of bytes 128–255. This has a couple of nice consequences. First, any strings or files out there that are already in ASCII can also be interpreted as UTF-8 without any conversion. Second, lots of widely-used string programming idioms—such as null termination, or delimiters (newlines, tabs, commas, slashes, etc.)—will just work on UTF-8 strings. ASCII bytes never occur inside the encoding of non-ASCII code points, so searching byte-wise for a null terminator or a delimiter will do the right thing.  Thanks to this convenience, it’s relatively simple to extend legacy ASCII programs and APIs to handle UTF-8 strings. UTF-8 is very widely used in the Unix/Linux and Web worlds, and many programmers argue UTF-8 should be the default encoding everywhere.  However, UTF-8 isn’t a drop-in replacement for ASCII strings in all respects. For instance, code that iterates over the “characters” in a string will need to decode UTF-8 and iterate over code points (or maybe grapheme clusters—more about those later), not bytes. When you measure the “length” of a string, you’ll need to think about whether you want the length in bytes, the length in code points, the width of the text when rendered, or something else.  UTF-16 The other encoding that you’re likely to encounter is UTF-16. It uses 16-bit words, with each code point stored as either 1 or 2 words.  Like UTF-8, we can express the UTF-16 encoding rules in the form of binary prefixes:  UTF-16 (binary)\tCode point (binary)\tRange xxxxxxxxxxxxxxxx\txxxxxxxxxxxxxxxx\tU+0000–U+FFFF 110110xxxxxxxxxx 110111yyyyyyyyyy\txxxxxxxxxxyyyyyyyyyy + 0x10000\tU+10000–U+10FFFF A more common way that people talk about UTF-16 encoding, though, is in terms of code points called “surrogates”. All the code points in the range U+D800–U+DFFF—or in other words, the code points that match the binary prefixes 110110 and 110111 in the table above—are reserved specifically for UTF-16 encoding, and don’t represent any valid characters on their own. They’re only meant to occur in the 2-word encoding pattern above, which is called a “surrogate pair”. Surrogate code points are illegal in any other context! They’re not allowed in UTF-8 or UTF-32 at all.  Historically, UTF-16 is a descendant of the original, pre-1996 versions of Unicode, in which there were only 65,536 code points. The original intention was that there would be no different “encodings”; Unicode was supposed to be a straightforward 16-bit character set. Later, the codespace was expanded to make room for a long tail of less-common (but still important) Han characters, which the Unicode designers didn’t originally plan for. Surrogates were then introduced, as—to put it bluntly—a kludge, allowing 16-bit encodings to access the new code points.  Today, Javascript uses UTF-16 as its standard string representation: if you ask for the length of a string, or iterate over it, etc., the result will be in UTF-16 words, with any code points outside the BMP expressed as surrogate pairs. UTF-16 is also used by the Microsoft Win32 APIs; though Win32 supports either 8-bit or 16-bit strings, the 8-bit version unaccountably still doesn’t support UTF-8—only legacy code-page encodings, like ANSI. This leaves UTF-16 as the only way to get proper Unicode support in Windows. (Update: in Win10 version 1903, they finally added UTF-8 support to the 8-bit APIs! 😊)  By the way, UTF-16’s words can be stored either little-endian or big-endian. Unicode has no opinion on that issue, though it does encourage the convention of putting U+FEFF zero width no-break space at the top of a UTF-16 file as a byte-order mark, to disambiguate the endianness. (If the file doesn’t match the system’s endianness, the BOM will be decoded as U+FFFE, which isn’t a valid code point.)  Combining Marks In the story so far, we’ve been focusing on code points. But in Unicode, a “character” can be more complicated than just an individual code point!  Unicode includes a system for dynamically composing characters, by combining multiple code points together. This is used in various ways to gain flexibility without causing a huge combinatorial explosion in the number of code points.  In European languages, for example, this shows up in the application of diacritics to letters. Unicode supports a wide range of diacritics, including acute and grave accents, umlauts, cedillas, and many more. All these diacritics can be applied to any letter of any alphabet—and in fact, multiple diacritics can be used on a single letter.  If Unicode tried to assign a distinct code point to every possible combination of letter and diacritics, things would rapidly get out of hand. Instead, the dynamic composition system enables you to construct the character you want, by starting with a base code point (the letter) and appending additional code points, called “combining marks”, to specify the diacritics. When a text renderer sees a sequence like this in a string, it automatically stacks the diacritics over or under the base letter to create a composed character.  For example, the accented character “Á” can be expressed as a string of two code points: U+0041 “A” latin capital letter a plus U+0301 “◌́” combining acute accent. This string automatically gets rendered as a single character: “Á”.  Now, Unicode does also include many “precomposed” code points, each representing a letter with some combination of diacritics already applied, such as U+00C1 “Á” latin capital letter a with acute or U+1EC7 “ệ” latin small letter e with circumflex and dot below. I suspect these are mostly inherited from older encodings that were assimilated into Unicode, and kept around for compatibility. In practice, there are precomposed code points for most of the common letter-with-diacritic combinations in European-script languages, so they don’t use dynamic composition that much in typical text.  Still, the system of combining marks does allow for an arbitrary number of diacritics to be stacked on any base character. The reductio-ad-absurdum of this is Zalgo text, which works by ͖͟ͅr͞aṋ̫̠̖͈̗d͖̻̹óm̪͙͕̗̝ļ͇̰͓̳̫ý͓̥̟͍ ̕s̫t̫̱͕̗̰̼̘͜a̼̩͖͇̠͈̣͝c̙͍k̖̱̹͍͘i̢n̨̺̝͇͇̟͙ģ̫̮͎̻̟ͅ ̕n̼̺͈͞u̮͙m̺̭̟̗͞e̞͓̰̤͓̫r̵o̖ṷs҉̪͍̭̬̝̤ ̮͉̝̞̗̟͠d̴̟̜̱͕͚i͇̫̼̯̭̜͡ḁ͙̻̼c̲̲̹r̨̠̹̣̰̦i̱t̤̻̤͍͙̘̕i̵̜̭̤̱͎c̵s ͘o̱̲͈̙͖͇̲͢n͘ ̜͈e̬̲̠̩ac͕̺̠͉h̷̪ ̺̣͖̱ḻ̫̬̝̹ḙ̙̺͙̭͓̲t̞̞͇̲͉͍t̷͔̪͉̲̻̠͙e̦̻͈͉͇r͇̭̭̬͖,̖́ ̜͙͓̣̭s̘̘͈o̱̰̤̲ͅ ̛̬̜̙t̼̦͕̱̹͕̥h̳̲͈͝ͅa̦t̻̲ ̻̟̭̦̖t̛̰̩h̠͕̳̝̫͕e͈̤̘͖̞͘y҉̝͙ ̷͉͔̰̠o̞̰v͈͈̳̘͜er̶f̰͈͔ḻ͕̘̫̺̲o̲̭͙͠ͅw̱̳̺ ͜t̸h͇̭͕̳͍e̖̯̟̠ ͍̞̜͔̩̪͜ļ͎̪̲͚i̝̲̹̙̩̹n̨̦̩̖ḙ̼̲̼͢ͅ ̬͝s̼͚̘̞͝p͙̘̻a̙c҉͉̜̤͈̯̖i̥͡n̦̠̱͟g̸̗̻̦̭̮̟ͅ ̳̪̠͖̳̯̕a̫͜n͝d͡ ̣̦̙ͅc̪̗r̴͙̮̦̹̳e͇͚̞͔̹̫͟a̙̺̙ț͔͎̘̹ͅe̥̩͍ a͖̪̜̮͙̹n̢͉̝ ͇͉͓̦̼́a̳͖̪̤̱p̖͔͔̟͇͎͠p̱͍̺ę̲͎͈̰̲̤̫a̯͜r̨̮̫̣̘a̩̯͖n̹̦̰͎̣̞̞c̨̦̱͔͎͍͖e̬͓͘ ̤̰̩͙̤̬͙o̵̼̻̬̻͇̮̪f̴ ̡̙̭͓͖̪̤“̸͙̠̼c̳̗͜o͏̼͙͔̮r̞̫̺̞̥̬ru̺̻̯͉̭̻̯p̰̥͓̣̫̙̤͢t̳͍̳̖ͅi̶͈̝͙̼̙̹o̡͔n̙̺̹̖̩͝ͅ”̨̗͖͚̩.̯͓  A few other places where dynamic character composition shows up in Unicode:  Vowel-pointing notation in Arabic and Hebrew. In these languages, words are normally spelled with some of their vowels left out. They then have diacritic notation to indicate the vowels (used in dictionaries, language-teaching materials, children’s books, and such). These diacritics are expressed with combining marks.  A Hebrew example, with niqqud:\tאֶת דַלְתִּי הֵזִיז הֵנִיעַ, קֶטֶב לִשְׁכַּתִּי יָשׁוֹד Normal writing (no niqqud):\tאת דלתי הזיז הניע, קטב לשכתי ישוד Devanagari, the script used to write Hindi, Sanskrit, and many other South Asian languages, expresses certain vowels as combining marks attached to consonant letters. For example, “ह” + “\u200bि” = “हि” (“h” + “i” = “hi”). Korean characters stand for syllables, but they are composed of letters called jamo that stand for the vowels and consonants in the syllable. While there are code points for precomposed Korean syllables, it’s also possible to dynamically compose them by concatenating their jamo. For example, “ᄒ” + “ᅡ” + “ᆫ” = “한” (“h” + “a” + “n” = “han”). Canonical Equivalence In Unicode, precomposed characters exist alongside the dynamic composition system. A consequence of this is that there are multiple ways to express “the same” string—different sequences of code points that result in the same user-perceived characters. For example, as we saw earlier, we can express the character “Á” either as the single code point U+00C1, or as the string of two code points U+0041 U+0301.  Another source of ambiguity is the ordering of multiple diacritics in a single character. Diacritic order matters visually when two diacritics apply to the same side of the base character, e.g. both above: “ǡ” (dot, then macron) is different from “ā̇” (macron, then dot). However, when diacritics apply to different sides of the character, e.g. one above and one below, then the order doesn’t affect rendering. Moreover, a character with multiple diacritics might have one of the diacritics precomposed and others expressed as combining marks.  For example, the Vietnamese letter “ệ” can be expressed in five different ways:  Fully precomposed: U+1EC7 “ệ” Partially precomposed: U+1EB9 “ẹ” + U+0302 “◌̂” Partially precomposed: U+00EA “ê” + U+0323 “◌̣” Fully decomposed: U+0065 “e” + U+0323 “◌̣” + U+0302 “◌̂” Fully decomposed: U+0065 “e” + U+0302 “◌̂” + U+0323 “◌̣” Unicode refers to set of strings like this as “canonically equivalent”. Canonically equivalent strings are supposed to be treated as identical for purposes of searching, sorting, rendering, text selection, and so on. This has implications for how you implement operations on text. For example, if an app has a “find in file” operation and the user searches for “ệ”, it should, by default, find occurrences of any of the five versions of “ệ” above!  Normalization Forms To address the problem of “how to handle canonically equivalent strings”, Unicode defines several normalization forms: ways of converting strings into a canonical form so that they can be compared code-point-by-code-point (or byte-by-byte).  The “NFD” normalization form fully decomposes every character down to its component base and combining marks, taking apart any precomposed code points in the string. It also sorts the combining marks in each character according to their rendered position, so e.g. diacritics that go below the character come before the ones that go above the character. (It doesn’t reorder diacritics in the same rendered position, since their order matters visually, as previously mentioned.)  The “NFC” form, conversely, puts things back together into precomposed code points as much as possible. If an unusual combination of diacritics is called for, there may not be any precomposed code point for it, in which case NFC still precomposes what it can and leaves any remaining combining marks in place (again ordered by rendered position, as in NFD).  There are also forms called NFKD and NFKC. The “K” here refers to compatibility decompositions, which cover characters that are “similar” in some sense but not visually identical. However, I’m not going to cover that here.  Grapheme Clusters As we’ve seen, Unicode contains various cases where a thing that a user thinks of as a single “character” might actually be made up of multiple code points under the hood. Unicode formalizes this using the notion of a grapheme cluster: a string of one or more code points that constitute a single “user-perceived character”.  UAX #29 defines the rules for what, precisely, qualifies as a grapheme cluster. It’s approximately “a base code point followed by any number of combining marks”, but the actual definition is a bit more complicated; it accounts for things like Korean jamo, and emoji ZWJ sequences.  The main thing grapheme clusters are used for is text editing: they’re often the most sensible unit for cursor placement and text selection boundaries. Using grapheme clusters for these purposes ensures that you can’t accidentally chop off some diacritics when you copy-and-paste text, that left/right arrow keys always move the cursor by one visible character, and so on.  Another place where grapheme clusters are useful is in enforcing a string length limit—say, on a database field. While the true, underlying limit might be something like the byte length of the string in UTF-8, you wouldn’t want to enforce that by just truncating bytes. At a minimum, you’d want to “round down” to the nearest code point boundary; but even better, round down to the nearest grapheme cluster boundary. Otherwise, you might be corrupting the last character by cutting off a diacritic, or interrupting a jamo sequence or ZWJ sequence.  And More… There’s much more that could be said about Unicode from a programmer’s perspective! I haven’t gotten into such fun topics as case mapping, collation, compatibility decompositions and confusables, Unicode-aware regexes, or bidirectional text. Nor have I said anything yet about implementation issues—how to efficiently store and look-up data about the sparsely-assigned code points, or how to optimize UTF-8 decoding, string comparison, or NFC normalization. Perhaps I’ll return to some of those things in future posts.  Unicode is a fascinating and complex system. It has a many-to-one mapping between bytes and code points, and on top of that a many-to-one (or, under some circumstances, many-to-many) mapping between code points and “characters”. It has oddball special cases in every corner. But no one ever claimed that representing all written languages was going to be easy, and it’s clear that we’re never going back to the bad old days of a patchwork of incompatible encodings.  Further reading:  The Unicode Standard UTF-8 Everywhere Manifesto Dark corners of Unicode by Eevee ICU (International Components for Unicode)—C/C++/Java libraries implementing many Unicode algorithms and related things Python 3 Unicode Howto Google Noto Fonts—set of fonts intended to cover all assigned code points"""
tokens = text.encode("utf-8") # raw bytes
tokens = list(map(int, tokens)) # convert to a list of integers in range 0..255 for convenience

# Training the tokenizer using Byte-Pair Encoding

In [16]:
def get_stats(ids):
    counts = {}
    for pair in zip(ids, ids[1:]):
        counts[pair] = counts.get(pair, 0) + 1
    return counts

def merge(ids, pair, idx):
  newids = []
  i = 0
  while i < len(ids):
    if i < len(ids) - 1 and ids[i] == pair[0] and ids[i+1] == pair[1]:
      newids.append(idx)
      i += 2
    else:
      newids.append(ids[i])
      i += 1
  return newids

# ---
vocab_size = 276 # the desired final vocabulary size
num_merges = vocab_size - 256
ids = list(tokens) # copy so we don't destroy the original list

merges = {} # (int, int) -> int
for i in range(num_merges):
  stats = get_stats(ids)
  pair = max(stats, key=stats.get)
  idx = 256 + i
  print(f"merging {pair} into a new token {idx}")
  ids = merge(ids, pair, idx)
  merges[pair] = idx

merging (101, 32) into a new token 256
merging (105, 110) into a new token 257
merging (115, 32) into a new token 258
merging (116, 104) into a new token 259
merging (101, 114) into a new token 260
merging (99, 111) into a new token 261
merging (116, 32) into a new token 262
merging (226, 128) into a new token 263
merging (44, 32) into a new token 264
merging (97, 110) into a new token 265
merging (111, 114) into a new token 266
merging (100, 32) into a new token 267
merging (97, 114) into a new token 268
merging (101, 110) into a new token 269
merging (257, 103) into a new token 270
merging (261, 100) into a new token 271
merging (121, 32) into a new token 272
merging (46, 32) into a new token 273
merging (97, 108) into a new token 274
merging (259, 256) into a new token 275


In [17]:
print("tokens length:", len(tokens))
print("ids length:", len(ids))
print(f"compression ratio: {len(tokens) / len(ids):.2f}X")

tokens length: 24597
ids length: 19438
compression ratio: 1.27X


# Vocab

In [26]:
print(merges)

{(101, 32): 256, (105, 110): 257, (115, 32): 258, (116, 104): 259, (101, 114): 260, (99, 111): 261, (116, 32): 262, (226, 128): 263, (44, 32): 264, (97, 110): 265, (111, 114): 266, (100, 32): 267, (97, 114): 268, (101, 110): 269, (257, 103): 270, (261, 100): 271, (121, 32): 272, (46, 32): 273, (97, 108): 274, (259, 256): 275}


In [18]:
# Given a sequence of integers in the range [0, vocab_size], what is the text?

vocab = {idx: bytes([idx]) for idx in range(256)}
for (p0, p1), idx in merges.items():
    vocab[idx] = vocab[p0] + vocab[p1]




�


In [25]:
print(vocab)

{0: b'\x00', 1: b'\x01', 2: b'\x02', 3: b'\x03', 4: b'\x04', 5: b'\x05', 6: b'\x06', 7: b'\x07', 8: b'\x08', 9: b'\t', 10: b'\n', 11: b'\x0b', 12: b'\x0c', 13: b'\r', 14: b'\x0e', 15: b'\x0f', 16: b'\x10', 17: b'\x11', 18: b'\x12', 19: b'\x13', 20: b'\x14', 21: b'\x15', 22: b'\x16', 23: b'\x17', 24: b'\x18', 25: b'\x19', 26: b'\x1a', 27: b'\x1b', 28: b'\x1c', 29: b'\x1d', 30: b'\x1e', 31: b'\x1f', 32: b' ', 33: b'!', 34: b'"', 35: b'#', 36: b'$', 37: b'%', 38: b'&', 39: b"'", 40: b'(', 41: b')', 42: b'*', 43: b'+', 44: b',', 45: b'-', 46: b'.', 47: b'/', 48: b'0', 49: b'1', 50: b'2', 51: b'3', 52: b'4', 53: b'5', 54: b'6', 55: b'7', 56: b'8', 57: b'9', 58: b':', 59: b';', 60: b'<', 61: b'=', 62: b'>', 63: b'?', 64: b'@', 65: b'A', 66: b'B', 67: b'C', 68: b'D', 69: b'E', 70: b'F', 71: b'G', 72: b'H', 73: b'I', 74: b'J', 75: b'K', 76: b'L', 77: b'M', 78: b'N', 79: b'O', 80: b'P', 81: b'Q', 82: b'R', 83: b'S', 84: b'T', 85: b'U', 86: b'V', 87: b'W', 88: b'X', 89: b'Y', 90: b'Z', 91: b'[',

In [29]:
print(b'a'+b'c')

b'ac'


# Decoder

In [31]:
def decode(ids):
  # given ids (list of integers), return Python string
  tokens = b"".join(vocab[idx] for idx in ids)
  text = tokens.decode("utf-8", errors="replace")
  return text

print(decode([269, 265]))

enan


# Encoder

In [32]:
merges

{(101, 32): 256,
 (105, 110): 257,
 (115, 32): 258,
 (116, 104): 259,
 (101, 114): 260,
 (99, 111): 261,
 (116, 32): 262,
 (226, 128): 263,
 (44, 32): 264,
 (97, 110): 265,
 (111, 114): 266,
 (100, 32): 267,
 (97, 114): 268,
 (101, 110): 269,
 (257, 103): 270,
 (261, 100): 271,
 (121, 32): 272,
 (46, 32): 273,
 (97, 108): 274,
 (259, 256): 275}

In [40]:
def encode(text):
    # Encode the input text as a list of UTF-8 bytes (integers).
    # This initial list represents the initial tokens, each corresponding to a single character.
    tokens = list(text.encode("utf-8"))

    # Continue attempting to merge tokens as long as there are at least two tokens left.
    # This allows for the possibility of further merging.
    while len(tokens) >= 2:
        # Calculate the frequency of adjacent pairs of tokens in the current list of tokens.
        stats = get_stats(tokens)

        # Find the pair with the lowest index in the 'merges' dictionary, indicating the next merge.
        # If a pair doesn't exist in 'merges', assign it an infinite index to prioritize pairs that can be merged.
        pair = min(stats, key=lambda p: merges.get(p, float("inf")))

        # If the chosen pair does not exist in the merges dictionary, no further merging is possible.
        # Exit the loop to prevent infinite iteration.
        if pair not in merges:
            break

        # Retrieve the index that represents the merged pair from the merges dictionary.
        idx = merges[pair]

        # Merge the identified pair throughout the tokens, replacing occurrences of the pair with the new token.
        tokens = merge(tokens, pair, idx)

    # Return the final list of tokens after all possible merges have been applied.
    return tokens



In [41]:
print(decode(encode("hello world")))

hello world


In [42]:
text2 = decode(encode(text))
print(text2 == text)

True


# Examining GPT-2 tokenizer implementation


[GPT-2 paper: Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf)

[GPT-2 Tokenizer implementation](https://github.com/openai/gpt-2/blob/master/src/encoder.py)



In [43]:
import regex as re
gpt2pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

print(re.findall(gpt2pat, "Hello've world123 how's are you!!!?"))

['Hello', "'ve", ' world', '123', ' how', "'s", ' are', ' you', '!!!?']


In [44]:
example = """
for i in range(1, 101):
    if i % 3 == 0 and i % 5 == 0:
        print("FizzBuzz")
    elif i % 3 == 0:
        print("Fizz")
    elif i % 5 == 0:
        print("Buzz")
    else:
        print(i)
"""
print(re.findall(gpt2pat, example))

['\n', 'for', ' i', ' in', ' range', '(', '1', ',', ' 101', '):', '\n   ', ' if', ' i', ' %', ' 3', ' ==', ' 0', ' and', ' i', ' %', ' 5', ' ==', ' 0', ':', '\n       ', ' print', '("', 'FizzBuzz', '")', '\n   ', ' elif', ' i', ' %', ' 3', ' ==', ' 0', ':', '\n       ', ' print', '("', 'Fizz', '")', '\n   ', ' elif', ' i', ' %', ' 5', ' ==', ' 0', ':', '\n       ', ' print', '("', 'Buzz', '")', '\n   ', ' else', ':', '\n       ', ' print', '(', 'i', ')', '\n']


# Tiktoken library from OpenAI

https://github.com/openai/tiktoken/blob/main/tiktoken_ext/openai_public.py

In [2]:
!pip install tiktoken # added for colab

Collecting tiktoken
  Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.6.0


In [46]:
import tiktoken

# GPT-2 (does not merge spaces)
enc = tiktoken.get_encoding("gpt2")
print(enc.encode("    hello world!!!"))

# GPT-4 (merges spaces)
enc = tiktoken.get_encoding("cl100k_base")
print(enc.encode("    hello world!!!"))

[220, 220, 220, 23748, 995, 10185]
[262, 24748, 1917, 12340]


# GPT-2 Vocab and Merges



In [1]:
!wget https://openaipublic.blob.core.windows.net/gpt-2/models/1558M/vocab.bpe
!wget https://openaipublic.blob.core.windows.net/gpt-2/models/1558M/encoder.json

--2024-03-22 09:52:01--  https://openaipublic.blob.core.windows.net/gpt-2/models/1558M/vocab.bpe
Resolving openaipublic.blob.core.windows.net (openaipublic.blob.core.windows.net)... 20.209.18.33
Connecting to openaipublic.blob.core.windows.net (openaipublic.blob.core.windows.net)|20.209.18.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 456318 (446K) [application/octet-stream]
Saving to: ‘vocab.bpe’


2024-03-22 09:52:01 (1.59 MB/s) - ‘vocab.bpe’ saved [456318/456318]

--2024-03-22 09:52:01--  https://openaipublic.blob.core.windows.net/gpt-2/models/1558M/encoder.json
Resolving openaipublic.blob.core.windows.net (openaipublic.blob.core.windows.net)... 20.209.18.33
Connecting to openaipublic.blob.core.windows.net (openaipublic.blob.core.windows.net)|20.209.18.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1042301 (1018K) [application/json]
Saving to: ‘encoder.json’


2024-03-22 09:52:02 (2.35 MB/s) - ‘encoder.json’ saved [1042301

In [2]:
import os, json

with open('encoder.json', 'r') as f:
    encoder = json.load(f) # <--- ~equivalent to our "vocab"

with open('vocab.bpe', 'r', encoding="utf-8") as f:
    bpe_data = f.read()
bpe_merges = [tuple(merge_str.split()) for merge_str in bpe_data.split('\n')[1:-1]]
# ^---- ~equivalent to our "merges"

# Special Tokens

In [3]:
len(encoder) # 256 raw byte tokens. 50,000 merges. +1 special token

50257

In [4]:
encoder['<|endoftext|>'] # the only special token in use for the GPT-2 base model

50256

In [None]:
# https://github.com/openai/tiktoken/blob/main/src/lib.rs


# minbpe

[https://github.com/karpathy/minbpe](https://github.com/karpathy/minbpe)

In [3]:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base") # GPT-4 tokenizer
print(enc.encode("안녕하세요 👋 (hello in Korean!)"))
print(enc.decode(enc.encode("안녕하세요 👋 (hello in Korean!)")) == "안녕하세요 👋 (hello in Korean!)")
# match the above for your own tokenizer, and also implement a train() function

[31495, 230, 75265, 243, 92245, 62904, 233, 320, 15339, 304, 16526, 16715]
True


# Sentence Piece

[https://github.com/google/sentencepiece](https://github.com/google/sentencepiece)



In [4]:
import sentencepiece as spm

In [5]:
# write a toy.txt file with some random text
with open("toy.txt", "w", encoding="utf-8") as f:
  f.write("SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.")

In [6]:
# train a sentencepiece model on it
# the settings here are (best effort) those used for training Llama 2
import os

options = dict(
  # input spec
  input="toy.txt",
  input_format="text",
  # output spec
  model_prefix="tok400", # output filename prefix
  # algorithm spec
  # BPE alg
  model_type="bpe",
  vocab_size=400,
  # normalization
  normalization_rule_name="identity", # ew, turn off normalization
  remove_extra_whitespaces=False,
  input_sentence_size=200000000, # max number of training sentences
  max_sentence_length=4192, # max number of bytes per sentence
  seed_sentencepiece_size=1000000,
  shuffle_input_sentence=True,
  # rare word treatment
  character_coverage=0.99995,
  byte_fallback=True,
  # merge rules
  split_digits=True,
  split_by_unicode_script=True,
  split_by_whitespace=True,
  split_by_number=True,
  max_sentencepiece_length=16,
  add_dummy_prefix=True,
  allow_whitespace_only_pieces=True,
  # special tokens
  unk_id=0, # the UNK token MUST exist
  bos_id=1, # the others are optional, set to -1 to turn off
  eos_id=2,
  pad_id=-1,
  # systems
  num_threads=os.cpu_count(), # use ~all system resources
)

spm.SentencePieceTrainer.train(**options)


In [7]:
sp = spm.SentencePieceProcessor()
sp.load('tok400.model')
vocab = [[sp.id_to_piece(idx), idx] for idx in range(sp.get_piece_size())]
vocab

[['<unk>', 0],
 ['<s>', 1],
 ['</s>', 2],
 ['<0x00>', 3],
 ['<0x01>', 4],
 ['<0x02>', 5],
 ['<0x03>', 6],
 ['<0x04>', 7],
 ['<0x05>', 8],
 ['<0x06>', 9],
 ['<0x07>', 10],
 ['<0x08>', 11],
 ['<0x09>', 12],
 ['<0x0A>', 13],
 ['<0x0B>', 14],
 ['<0x0C>', 15],
 ['<0x0D>', 16],
 ['<0x0E>', 17],
 ['<0x0F>', 18],
 ['<0x10>', 19],
 ['<0x11>', 20],
 ['<0x12>', 21],
 ['<0x13>', 22],
 ['<0x14>', 23],
 ['<0x15>', 24],
 ['<0x16>', 25],
 ['<0x17>', 26],
 ['<0x18>', 27],
 ['<0x19>', 28],
 ['<0x1A>', 29],
 ['<0x1B>', 30],
 ['<0x1C>', 31],
 ['<0x1D>', 32],
 ['<0x1E>', 33],
 ['<0x1F>', 34],
 ['<0x20>', 35],
 ['<0x21>', 36],
 ['<0x22>', 37],
 ['<0x23>', 38],
 ['<0x24>', 39],
 ['<0x25>', 40],
 ['<0x26>', 41],
 ['<0x27>', 42],
 ['<0x28>', 43],
 ['<0x29>', 44],
 ['<0x2A>', 45],
 ['<0x2B>', 46],
 ['<0x2C>', 47],
 ['<0x2D>', 48],
 ['<0x2E>', 49],
 ['<0x2F>', 50],
 ['<0x30>', 51],
 ['<0x31>', 52],
 ['<0x32>', 53],
 ['<0x33>', 54],
 ['<0x34>', 55],
 ['<0x35>', 56],
 ['<0x36>', 57],
 ['<0x37>', 58],
 ['<0x38>', 5

In [8]:
ids = sp.encode("hello 안녕하세요")
print(ids)

[362, 378, 361, 372, 358, 362, 239, 152, 139, 238, 136, 152, 240, 152, 155, 239, 135, 187, 239, 157, 151]


In [9]:
print([sp.id_to_piece(idx) for idx in ids])

['▁', 'h', 'e', 'l', 'lo', '▁', '<0xEC>', '<0x95>', '<0x88>', '<0xEB>', '<0x85>', '<0x95>', '<0xED>', '<0x95>', '<0x98>', '<0xEC>', '<0x84>', '<0xB8>', '<0xEC>', '<0x9A>', '<0x94>']


# Notes

## a feature of UTF-8 that is particularly beneficial for software development

The author is explaining a feature of UTF-8, a character encoding system, that is particularly beneficial for software development, especially when dealing with strings (sequences of characters) and files. Let's break down the points made with examples for clarity.

### 1. ASCII Compatibility

- **UTF-8 Property**: In UTF-8, characters from the ASCII set (which includes English letters, digits, and some common symbols) are represented as single bytes, just like in the ASCII encoding. These characters have values ranging from 0 to 127. Meanwhile, characters that are not part of the ASCII set are encoded using sequences of two or more bytes, each within the range of 128 to 255.

- **Example**: The character 'A' is represented in ASCII (and thus also in UTF-8) as the decimal value 65. In contrast, the character '€' (Euro sign) is not in the ASCII set and is represented in UTF-8 by a sequence of bytes, specifically three bytes with values 226, 130, and 172.

### 2. No Conversion Required for ASCII

- **Point Explained**: Since UTF-8 and ASCII represent characters from the ASCII set in the same way, any file or string that is encoded in ASCII does not need to be converted when it is treated as UTF-8. This ensures compatibility and simplifies handling text data that only contains these characters.

- **Example**: If you have a text file containing only English letters and punctuation (all within the ASCII set), it can be opened, read, and processed by a program expecting UTF-8 encoded text without any issues or special handling.

### 3. Programming Idioms Work as Expected

- **Idioms Mentioned**: Programming idioms related to strings like null termination (using a byte with value 0 to mark the end of a string) and delimiters (using specific characters like newlines, tabs, commas, etc., to separate data) work seamlessly with UTF-8.

- **Why They Work**: In UTF-8, the bytes representing ASCII characters (including control characters used as delimiters or terminators) cannot appear as part of the sequence representing a non-ASCII character. This means that when a program searches a UTF-8 string for a specific ASCII character (like a newline or a null byte), it can be sure that it's found an actual delimiter or end of the string, not just part of a different, multi-byte character.

- **Example**: Consider a UTF-8 encoded text file with a mix of English text and emoji. Even though emojis are encoded with multiple bytes, you can still reliably use a single byte representing the newline character (value 10) to find the end of lines, because this byte value will not be part of any multi-byte sequence representing an emoji.

In summary, the author highlights UTF-8's backward compatibility with ASCII and how this compatibility ensures that traditional string handling techniques continue to work as expected, even in a system that supports a much wider range of characters.

## Byte-Pair Encoding (BPE) as a solution to the challenges of tokenizing text data for language models

The lecturer introduces Byte-Pair Encoding (BPE) as a solution to the challenges of tokenizing text data for language models. The intuition behind this approach stems from a few key considerations:

1. **Need for Tokenization**: Language models, particularly those based on the Transformer architecture, require input in the form of integers, not raw strings. These integers correspond to tokens derived from the text, which in turn map to vectors in a lookup table. The vectors are what actually get fed into the Transformer.

2. **Support for Multilingual and Special Characters**: The goal is to accommodate a wide range of languages and special characters (including emojis), not just the English alphabet. This requires a system that can handle the vast array of Unicode characters, which are standardized representations of text characters globally.

3. **Challenges with Direct Unicode Utilization**: While Unicode provides a comprehensive set of over 150,000 characters across various languages and symbols, using their code points (unique numbers assigned to each character) directly as tokens is impractical. The reasons are twofold: the Unicode standard is continuously evolving, adding new characters and thus making the set of code points unstable over time; and directly using Unicode code points would result in an excessively large vocabulary size for the model to handle efficiently.

4. **Encoding Methods**: UTF-8, UTF-16, and UTF-32 are different methods of encoding Unicode characters into binary data. UTF-8 is preferred for its efficiency and compatibility with ASCII, but using it (or any encoding scheme) directly for tokenization presents issues. Specifically, UTF-8 encodes characters into a variable number of bytes, which would lead to an impractically small vocabulary if each byte were treated as a separate token. This would also result in excessively long sequences of tokens for relatively short strings of text, making processing by the Transformer inefficient.

5. **Byte-Pair Encoding as a Solution**: BPE addresses these challenges by compressing sequences of bytes into tokens in a way that balances the size of the vocabulary with the length of the tokenized sequences. It operates by iteratively merging the most frequent pairs of bytes (or byte sequences) in the text data into single new tokens. This method allows for efficient encoding of common sequences (including those representing multilingual text and special characters) into a manageable number of tokens, thereby facilitating efficient processing by language models without sacrificing the ability to represent a wide range of text.

In summary, the lecturer's intuition for introducing BPE is to find a pragmatic middle ground that enables language models to process diverse text efficiently, by overcoming the limitations of direct Unicode utilization and taking advantage of the flexibility and efficiency of UTF-8, all while maintaining a manageable vocabulary size and sequence length.

## The lecturer's explanation on Byte-Pair Encoding (BPE)

The lecturer's explanation on Byte-Pair Encoding (BPE) and its implementation for compressing text sequences into a manageable size for language models illuminates several main goals and intuitions:

### 1. **Efficiency in Text Representation**:
- **Goal**: To represent a lengthy sequence of characters or bytes in a more compact form without losing information.
- **Intuition**: By merging frequently occurring pairs of characters (or bytes) into single new tokens and adding these to the vocabulary, the original sequence can be significantly compressed. This process reduces the length of the input sequence while slightly expanding the vocabulary.

### 2. **Adaptability to Language Variability**:
- **Goal**: To handle the vast diversity of languages and special characters (including emojis) efficiently in text processing for language models.
- **Intuition**: BPE's dynamic vocabulary construction through the iterative merging process is adaptable to the peculiarities of different languages and symbols. This flexibility ensures that the encoding scheme can efficiently represent varied linguistic elements without overly inflating the sequence length or vocabulary size.

### 3. **Optimization of Computational Resources**:
- **Goal**: To manage the trade-off between vocabulary size and sequence length for the sake of computational efficiency in training and using language models.
- **Intuition**: A balanced approach to tokenization, where the vocabulary size is not too large to handle but the sequence length is not excessively long, optimizes the use of computational resources. It ensures that language models can process data efficiently, benefiting from both a rich vocabulary and manageable sequence lengths.

### 4. **Preparation for Language Model Training**:
- **Goal**: To preprocess text data in a way that makes it suitable for training state-of-the-art language models, particularly those based on Transformer architectures.
- **Intuition**: By converting raw text into a sequence of tokens that a language model can understand and process, BPE facilitates the efficient training and operation of these models. This preprocessing stage is crucial for enabling models to learn from and generate human-like text.

### 5. **Support for Dynamic and Evolving Language Use**:
- **Goal**: To ensure the tokenization scheme can evolve with the language, accommodating new characters, symbols, or usage patterns that emerge over time.
- **Intuition**: The iterative nature of BPE allows for continuous updates to the vocabulary in response to changes in language use. This adaptability is key to maintaining the relevance and effectiveness of language models trained on diverse and evolving text data.

In conclusion, the lecturer's exploration of BPE and its implementation underscores a strategic approach to text tokenization aimed at enhancing the performance and efficiency of language models. This methodology is pivotal for dealing with the complexity and diversity of human language in computational contexts.

## a byte string in Python

The notation `b'the '` represents a byte string in Python. A byte string is a sequence of bytes – essentially, a sequence of 8-bit values – that is used for storing binary data, including text that is encoded in a specific encoding scheme like UTF-8, ASCII, etc.

### Byte-String Explained

- **Syntax**: In Python, a byte string is denoted by a leading `b` before the opening quote of the string literal, e.g., `b'the '`. This indicates that what follows is not a standard string (which in Python 3 is Unicode) but a sequence of bytes.
- **Content**: For the example `b'the '`, it contains the bytes that represent the characters `t`, `h`, `e`, and a space in ASCII encoding. The ASCII encoding represents each character as a single byte, so this byte string is composed of 4 bytes in total.

### Comparison with Standard Strings

- **Standard String (`'the '`)**: In Python 3, a standard string is a sequence of Unicode characters. It can include characters from practically any writing system in the world, from Latin letters to Chinese characters and emojis. These characters are encoded internally by Python using a variable-length encoding, which can require one or more bytes per character.
- **Byte String (`b'the '`)**: A byte string is explicitly a sequence of bytes, not characters. Each element in a byte string is a byte (an integer in the range 0-255), and it does not inherently carry any encoding for converting it into characters. When you use a byte string to store text, you must know what encoding was used (e.g., ASCII, UTF-8) to interpret it correctly back into text.

### Key Differences

- **Type**: Standard strings are of type `str`, while byte strings are of type `bytes` in Python 3.
- **Usage**: Standard strings are used for text processing and manipulation in Python, taking advantage of Python's rich set of text-handling features. Byte strings are used when dealing with raw binary data, encoding-specific data, or interfacing with systems and files that require a specific byte format.
- **Encoding**: Standard strings can represent text in any language supported by Unicode, using a uniform encoding (UTF-8, UTF-16, etc.) that Python manages internally. Byte strings, on the other hand, represent raw bytes, which can be used to store encoded text (in UTF-8, ASCII, etc.) but are also used for binary data that is not text at all.

In summary, `b'the '` is a byte string containing the ASCII-encoded bytes for the text "the ", and it differs from the standard string `'the '` in that it explicitly represents raw bytes rather than Unicode text characters.

## adjacent byte pairs

Yes, in the context of a byte-pair encoding (BPE) vocabulary, it's entirely expected to see merged values that consist of varying numbers of characters, which correspondingly vary in the number of bytes. This is indeed related to UTF-8 and the nature of byte-pair encoding itself. Here’s why:

### Understanding UTF-8 Encoding

UTF-8 is a variable-width character encoding for Unicode. It can encode a single character using one to four bytes, depending on the character:

- **ASCII characters** (the most common in English text) are encoded using a single byte.
- **Characters from other alphabets and symbols** might require two, three, or even four bytes.

### Byte-Pair Encoding Merges

In BPE, merges are based on the frequency of adjacent byte pairs. Initially, each byte in the text is considered a separate token, and BPE doesn't differentiate between single-byte characters and multi-byte character sequences:

- **Single-byte characters**: These are typically ASCII characters, which can be directly represented by a single byte (e.g., `b'e'`, `b's'`).
- **Multi-byte characters or sequences**: These result from characters outside the basic ASCII range or from combining multiple bytes (characters) into a single token through the BPE process. For example, `b'\xe2\x80'` represents a two-byte sequence that is part of a character or punctuation in UTF-8 encoding that requires more than one byte.

### Examples of Merges

- **Two ASCII characters**: For example, `b'in'`, `b'th'`. Each character is a single byte, so these merges result in two-byte sequences.
- **Combination involving non-ASCII**: The merge `b'\xe2\x80'` involves bytes that individually do not represent ASCII characters but together might represent a single character or part of a character in UTF-8. For instance, `\xe2\x80` are the beginning bytes of characters like the em dash or quotation marks in their UTF-8 encoding.
- **Longer sequences**: Merges like `b'ing'`, `b'the '` involve more than two characters because, as the BPE algorithm progresses, it doesn't just merge single bytes but also previously merged sequences. This is why you see some merges resulting in three or more bytes, as they combine longer sequences of characters that were frequent in the dataset.

### Is This Expected?

Absolutely. The variability in the number of bytes per merge reflects the algorithm’s adaptation to the specific frequencies of character (byte) sequences in the text data. It's optimizing the encoding to be more efficient for common sequences, which can include both simple character pairs and more complex combinations resulting from UTF-8's variable-length encoding of characters.

This variability is a key feature of UTF-8 and BPE working together: UTF-8 allows for the efficient encoding of a vast range of characters using variable-length bytes, while BPE leverages this by dynamically merging frequent sequences, regardless of their original length in bytes.

## Is the vocab a mapping from integers to byte strings?

Yes, in the context of the byte-pair encoding (BPE) implementation you've provided, the vocabulary (`vocab`) is indeed a mapping from integers to byte strings. Let's break down your questions for clarity:

### Is the vocab a mapping from integers to byte strings?

- **Yes.** The vocabulary (`vocab`) is a dictionary where each key is an integer representing a token's unique ID, and the associated value is a byte string representing the token's actual content. Initially, the vocabulary is populated with single-byte tokens (for bytes 0 through 255), directly mapping each possible byte value to a corresponding byte string. As the BPE process merges pairs of tokens, new entries are added to the vocabulary, mapping newly created token IDs (integers starting from 256) to their corresponding byte strings, which represent merged sequences of bytes.

### Are these byte strings also known as tokens?

- **Yes.** In the context of this implementation and in general terms within text processing and NLP, the byte strings in the `vocab` can indeed be considered tokens. A token, in this sense, is a unit of text, which can range from a single character to more complex sequences of characters. Initially, tokens correspond directly to single bytes (characters in ASCII or start of UTF-8 sequences), but as the algorithm progresses, tokens represent longer sequences of bytes that reflect merged character sequences.

### Are the tokens byte strings rather than bytes?

- **Yes, with clarification.** Initially, each token corresponds to a single byte and is represented as a byte string of length 1. However, as the BPE algorithm merges tokens, new tokens represent sequences of bytes, thus becoming byte strings of length greater than 1. It's important to note the distinction between a single byte and a byte string:
    - A **single byte** is an 8-bit value.
    - A **byte string** is a sequence of bytes. In Python, this is represented by the `bytes` type, which can contain one or more bytes.

The process of encoding text into tokens involves converting the original text into a sequence of these byte strings (tokens), each of which can represent one or more original bytes/characters, especially after merges. The decoding process, as implemented in the `decode` function, reverses this by concatenating the byte strings corresponding to a sequence of token IDs and then decoding the resulting byte string into a UTF-8 encoded text string.

In summary, the BPE process here creates a dynamic mapping of integers to byte strings (tokens), where these tokens can represent increasingly complex sequences of characters as the algorithm progresses, starting from single bytes and potentially growing to represent entire syllables or commonly occurring word parts.

## Discussion: GPT-2 tokenizer

The transcript discusses the nuances of how the GPT-2 tokenizer handles text and why specific design choices were made to optimize its performance. Here are the key points to note, especially in the context of Byte-Pair Encoding (BPE) work:

1. **Use of BPE on Byte-Level Representation**: The transcript confirms that GPT-2 employs the Byte-Pair Encoding algorithm but operates at the byte level of UTF-8 encoded text. This approach allows GPT-2 to handle a wide variety of languages and symbols efficiently.

2. **Avoiding Suboptimal Merges**: A critical insight shared is the potential suboptimality of merging common words with adjacent punctuation (e.g., "dog," "dog!"), leading to an inflated vocabulary with semantically similar but technically different tokens. The GPT-2 team sought to avoid such merges, which could otherwise cluster tokens in a way that mixes semantic content with punctuation, potentially hampering the model's ability to understand and generate text accurately.

3. **Manual Enforcement of Merging Rules**: To address the issue of suboptimal merges, GPT-2 introduces manual rules that prevent certain types of characters from being merged together. This approach aims to maintain a clear distinction between different types of tokens (e.g., letters, numbers, punctuation) and ensure that merges are meaningful and contribute positively to the model's performance.

4. **Use of Regex Patterns**: The transcript dives into the implementation details, showing how regular expressions (regex) are used to enforce the aforementioned rules. The pattern matching is designed to segment the text into chunks that can be independently processed by the tokenizer, ensuring that merges do not occur across predefined boundaries (e.g., between letters and punctuation).

5. **Extension of Python's `re` Module**: It's noted that GPT-2 uses the `regex` module (imported as `re` for familiarity), which is an extension of Python's standard `re` module, offering more powerful pattern matching capabilities. This choice allows for more complex regex patterns that are crucial for implementing GPT-2's specific tokenization rules.

6. **Tokenization and Encoding Details**: The process described in the transcript involves first breaking down the input text into chunks based on the regex pattern, then encoding these chunks into tokens using the BPE algorithm, and finally concatenating the results. This method ensures that certain merges, deemed suboptimal, are explicitly prevented.

7. **Inference vs. Training Code**: The transcript mentions that the code discussed pertains to the inference phase of the tokenizer, not the training phase. The training code, which would detail exactly how the tokenizer was initially built and trained, was not released by OpenAI. This distinction highlights that the inference code applies pre-determined merges to new text, while the training code, which determined those merges, remains proprietary.

In summary, the transcript sheds light on the thoughtful considerations and technical strategies behind GPT-2's tokenizer design, particularly how it balances the flexibility of BPE with the need for structured and meaningful tokenization through manual rules and advanced regex patterns.

## Discussion: from GPT-2 to GPT-4 tokenizer

The transcript discusses the transition from GPT-2 to GPT-4 tokenizer implementations by OpenAI, highlighting the changes made in the latter to improve tokenization. Here are the key points:

1. **TikToken Library Introduction**: OpenAI's TikToken library is introduced as the official tool for tokenization tasks related to their language models, particularly GPT-4. This library provides inference capabilities for tokenization, meaning it can convert text into tokens based on pre-trained models but is not designed for training new tokenization models.

2. **Whitespaces Handling Changes**: A notable difference between GPT-2 and GPT-4 tokenizers is how they handle whitespace. In GPT-4, whitespace characters are merged more aggressively, a change from the approach in GPT-2 where whitespace remained largely unmerged. This adjustment potentially impacts how text is segmented and could influence the model's understanding of text structure.

3. **Regular Expression (Regex) Modifications**: The transcript points out that GPT-4's tokenizer utilizes a modified regex pattern for chunking text compared to GPT-2. This change is central to the new tokenizer's performance and affects how text is preprocessed for tokenization.

4. **Case Insensitivity and Apostrophe Handling**: An important update in GPT-4's regex pattern is the introduction of case-insensitive matching, indicated by the 'i' flag. This alteration ensures that contractions and possessives using apostrophes are consistently recognized regardless of letter casing, addressing a limitation observed in the GPT-2 tokenizer.

5. **Numeric Sequences Treatment**: GPT-4's tokenizer limits the merging of numeric sequences to those with up to three digits. This decision prevents the formation of tokens from longer numeric sequences, which could be less useful and more sparsely represented in the training data. This change suggests an optimization aimed at improving token efficiency and model performance on numerical data.

6. **Documentation and Transparency**: The transcript expresses a concern over the lack of detailed documentation and rationale behind specific changes and decisions in the tokenizer's development. It underscores the challenges in understanding the motivations and implications of these updates without official explanations from OpenAI.

7. **Vocabulary Size Increase**: GPT-4's tokenizer expands the vocabulary size from approximately 50,000 to around 100,000 tokens. This significant increase allows for a richer representation of text, potentially enabling more nuanced understanding and generation capabilities in the GPT-4 model.

8. **Exploration and Analysis Approach**: The transcript conveys a hands-on approach to exploring the tokenizer's behavior and changes through examination of the code and regex patterns. However, it also highlights the complexity and nuanced understanding required to fully grasp the tokenizer's inner workings, suggesting the use of tools like ChatGPT and regex documentation for deeper analysis.

In summary, the transition from GPT-2 to GPT-4 tokenizers involves strategic adjustments to whitespace handling, regex patterns, and numeric sequence treatment, alongside an increase in vocabulary size. These changes reflect OpenAI's ongoing efforts to refine and optimize their models' language processing capabilities.

## Discussion: GPT-2 vocab and merges

The transcript provides a walkthrough of the GPT-2 `encoder.py` implementation by OpenAI, focusing on how the encoding and decoding processes work within the GPT-2 framework. Here are the key points to note:

1. **File Structure and Initial Setup**: The GPT-2 encoder setup involves loading two critical files: `encoder.json` and `vocab.bpe`. These files are essential for constructing the tokenizer. The `encoder.json` maps tokens to their corresponding integers, similar to a vocabulary object, while `vocab.bpe` contains the merge operations used during Byte Pair Encoding (BPE) tokenization.

2. **Encoder and Decoder Objects**: The core of the tokenization process is encapsulated in the `Encoder` class. This class handles the conversion between text and a sequence of token IDs (encoding) and vice versa (decoding). It employs a dictionary for encoding and its reverse for decoding, facilitating efficient translation between text and token IDs.

3. **Byte Encoder and Decoder**: Apart from the standard tokenizer, GPT-2 employs an additional layer of byte encoding and decoding. This layer converts between UTF-8 bytes and unicode strings, aiming to ensure a reversible process that doesn't rely solely on the standard Unicode characters. This mechanism helps to manage the vast array of characters and symbols that GPT-2 might encounter in diverse datasets.

4. **BPE Mechanism**: The Byte Pair Encoding process is detailed in the `bpe` function within the `Encoder` class. This function iteratively merges the most frequent pairs of symbols (initially individual characters) into single new symbols, following the ranks determined during the BPE training phase. This process reduces the size of the input text representation by combining commonly occurring sequences into single tokens.

5. **Regex Pattern for Tokenization**: GPT-2 uses a complex regex pattern to identify tokens within the text. This pattern aims to segment text into manageable pieces, including handling apostrophes, letters, numbers, and various punctuation marks, ensuring that the tokenization process respects linguistic and structural boundaries.

6. **Encoding and Decoding Functions**: The `encode` and `decode` functions illustrate how text is converted into a sequence of token IDs and how these IDs are mapped back to text, respectively. The encoding process involves finding matches for the regex pattern, byte encoding the text, and then applying the BPE process. The decoding process reverses this, turning token IDs back into text.

7. **Implementation Nuances**: The transcript points out that while the GPT-2 code might appear somewhat "messy" and includes some implementation details that might not seem immediately relevant (such as the additional layer of byte encoding/decoding), understanding the core algorithmic components—particularly the BPE process and how encoding/decoding operates—is crucial for grasping how GPT-2's tokenizer functions.

In summary, the GPT-2 `encoder.py` file showcases a sophisticated approach to text tokenization, leveraging Byte Pair Encoding and additional encoding layers to handle a wide range of text data. Understanding this implementation provides valuable insights into the mechanisms behind GPT-2's language processing capabilities.

## Discussion: Special Tokens

[https://github.com/openai/tiktoken/blob/main/src/lib.rs](https://github.com/openai/tiktoken/blob/main/src/lib.rs)

[Efficient Training of Language Models to
Fill in the Middle](https://arxiv.org/pdf/2207.14255.pdf)

The transcript elaborates on the role and handling of special tokens in GPT-2 and GPT-4, detailing the intricacies of their implementation and their significance in language modeling and tokenization. Here are the essential takeaways:

1. **Purpose of Special Tokens**: Special tokens serve to delimit different parts of data or to introduce a special structure within the token streams. They play a crucial role in signaling to the language model the boundaries between distinct sections of text, such as the end of a document or the delineation of conversational turns.

2. **Special Token in GPT-2**: GPT-2 incorporates a specific special token, notably the "end of text" token, which is used to mark the boundary between separate documents in the training dataset. This token is crucial for training as it informs the model when one piece of content ends, and another begins, ensuring that content is treated as distinct and unrelated.

3. **Handling of Special Tokens**: Special tokens are managed outside the regular byte pair encoding (BPE) process. They are recognized and inserted explicitly into the token stream at appropriate places, such as between documents or conversational messages. The handling of these tokens involves special case instructions in the tokenization code, which can identify and replace specific string patterns with the corresponding special token IDs.

4. **Special Tokens in GPT-4**: The GPT-4 tokenizer introduces additional special tokens beyond the "end of text" token found in GPT-2. These include tokens designed to support more complex structures, such as "fill in the middle" (FIM) and tokens for prefix, middle, and suffix positions. These enhancements reflect a broader range of uses for special tokens, enabling more nuanced control over the model's input and output.

5. **Extending Tokenizers with Special Tokens**: The TikToken library and related tooling from OpenAI allow for the extension of base tokenizers by adding new special tokens. This flexibility is crucial for fine-tuning and adapting pre-trained models to specific tasks or conversational formats, where additional structural markers can significantly enhance performance.

6. **Model Surgery for Special Tokens**: Incorporating new special tokens into a pre-trained model requires adjustments to the model's architecture, specifically the embedding matrix and the final classifier layer. This process, often referred to as "model surgery," involves extending these components to accommodate the additional tokens, typically initializing them with small random values to integrate smoothly into the existing model structure.

In summary, special tokens are a fundamental aspect of modern language models like GPT-2 and GPT-4, enabling more sophisticated handling of textual data and structural nuances. Their implementation and the ability to extend tokenizers with new special tokens provide significant flexibility in customizing models for a wide range of applications, from basic language understanding to complex conversational AI.

---

The discussion on the GPT-2 tokenizer and byte-pair encoding (BPE) alongside the vocabulary, including special tokens, offers several insightful facets:

1. **Complex Vocabulary Structure**: The GPT-2 tokenizer's vocabulary size is 50,257. This includes 256 raw byte tokens that directly map to single ASCII characters, 50,000 tokens generated through the BPE merges process, and one special token, the "end of text" token. The precise structure and size of this vocabulary reflect a sophisticated approach to encoding textual data, balancing between raw character representation and merged sequences to efficiently cover the linguistic patterns observed in the training data.

2. **Role of Special Tokens**: Special tokens, particularly the "end of text" token in GPT-2, play a pivotal role in delineating boundaries within the training data. By inserting this token between documents, it signals to the model the conclusion of one document and the start of another, unrelated piece of text. This organization helps the model learn contextual boundaries and manage sequence dependencies effectively, improving its ability to generate coherent and contextually appropriate text.

3. **Encoding and Decoding Process**: The encoding process involves converting text to a sequence of integers representing tokens, while decoding reverses this process. The addition of byte encoding and decoding layers introduces an additional step in both processes, converting between UTF-8 bytes and unicode strings. This layer helps manage the wide array of characters and symbols, ensuring reversible and robust encoding.

4. **Special Case Instructions for Handling Special Tokens**: Beyond the typical BPE process, the tokenizer includes special case instructions for handling special tokens. These instructions ensure that tokens like "end of text" are recognized and handled correctly during tokenization, maintaining their intended function as delimiters or structural markers within the text data.

5. **Extensibility with Special Tokens in GPT-4**: Moving to GPT-4, the tokenizer evolves to include more special tokens and changes in regex patterns for chunking text. This evolution indicates ongoing efforts to refine the tokenizer's capability to understand and manage text structure more effectively. The ability to add and handle new special tokens allows for customization and fine-tuning of the tokenizer to specific tasks or conversational formats, enhancing the model's applicability and performance.

6. **Model Surgery for Adding Special Tokens**: Incorporating new special tokens into a pre-trained model necessitates adjustments to the model's architecture, such as extending the embedding matrix and the final classifier layer. This "model surgery" underscores the dynamic nature of LLMs and their tokenizers, where modifications can be made to tailor the model's understanding and generation capabilities for specific applications.

In conclusion, the intricacies of the GPT-2 and GPT-4 tokenizers, from their comprehensive vocabularies to the handling of special tokens, reflect a nuanced approach to text tokenization. This complexity enables the models to capture and generate human language with remarkable coherence and relevance, showcasing the depth of technology behind OpenAI's language models.

## Special Tokens

Special tokens in tokenizers and Large Language Models (LLMs) serve as crucial markers or signals that help the models understand and manage the structure and context of the text they process. These tokens are not part of the natural language text itself but are inserted into the text data to convey specific meanings or instructions to the model. Here's a breakdown of their roles:

1. **Segmentation and Delimitation**: Special tokens often mark the beginning or end of a segment of text. For example, an "end of text" token can signal that a document or an input sequence has concluded, helping the model distinguish between separate pieces of text. This is particularly important in training and inference to prevent the model from blending the context of unrelated text segments.

2. **Providing Contextual Cues**: In conversational AI or task-specific models (like question-answering systems), special tokens can indicate the start of a question, the end of an answer, or transitions between turns in dialogue. This helps the model understand the structure of the conversation and respond appropriately.

3. **Control and Instructions**: Some special tokens are used to control the behavior of the model or instruct it on how to generate text. For instance, tokens might be used to switch between languages, dictate a certain style or tone, or prompt the model to perform a specific task like summarization.

4. **Enhancing Model Performance**: By clearly delineating different parts of the data, special tokens can improve a model's ability to learn from and generate coherent and contextually appropriate responses. They help in managing sequence lengths and ensuring that the model's attention mechanism focuses on relevant parts of the input.

5. **Facilitating Fine-tuning and Customization**: Special tokens allow developers to fine-tune pre-trained models for specific applications. By adding new special tokens and adjusting the model accordingly, developers can tailor the model's understanding and generation capabilities to suit particular needs or domains.

6. **Supporting Meta-information**: In some cases, special tokens embed meta-information within the text data, such as speaker IDs in dialogues or tags indicating the genre of text. This meta-information can be leveraged by the model to generate more nuanced and context-aware outputs.

In essence, special tokens are a versatile tool in the toolkit of NLP practitioners, offering a way to enhance and direct the capabilities of tokenizers and LLMs. Their strategic use enables more sophisticated handling of text data, supporting a wide range of applications from general-purpose language understanding to specialized conversational agents and beyond.

## Aside: Fill in the Middle

The concept of "Fill in the Middle" (FIM) in Large Language Models (LLMs) refers to a training and inference strategy where the model is tasked with generating text that logically connects two given text segments. Unlike traditional language model tasks, which typically focus on predicting the next word or sentence given a preceding context (autoregressive modeling) or completing a text given a prompt (text completion), FIM specifically aims to understand the relationship between two separate pieces of text and generate coherent content that fits between them.

### Key Points of Fill in the Middle:

1. **Contextual Bridging**: The model receives two pieces of text as input: an initial segment and a concluding segment. Its task is to generate one or more sentences that bridge these segments in a logical, coherent, and contextually appropriate manner. This requires a deep understanding of the content and context of both input segments.

2. **Training Methodology**: To train a model on the FIM task, the training data must be structured to support this kind of generation. This could involve taking longer passages of text and segmenting them into beginning and ending parts, with the middle portion serving as the target for generation during training. The model learns to predict this middle section based on the surrounding context.

3. **Applications**: FIM can be particularly useful in scenarios where generating contextually rich and coherent content is necessary. This includes creative writing assistance, where a writer needs to connect two plot points; content generation, where an article requires a detailed explanation between an introduction and conclusion; and conversational AI, where the model needs to maintain the flow of dialogue.

4. **Enhanced Contextual Understanding**: FIM tasks encourage models to develop a more nuanced understanding of context and narrative structure. Instead of focusing solely on what comes next, the model must consider both what came before and what comes after the target generation, ensuring that the generated text serves as a natural bridge.

5. **Challenges**: Successfully implementing a FIM approach in LLMs presents several challenges, including ensuring the relevance and coherence of the generated text, managing the potential for increased computational complexity, and developing training datasets that effectively support learning this task.

In summary, "Fill in the Middle" represents an advanced use case for LLMs, pushing beyond straightforward predictive text generation to embrace more complex narrative and contextual understanding tasks. By focusing on bridging two text segments, FIM tasks help develop models that can generate more contextually rich, coherent, and nuanced content, offering promising applications in various domains of natural language processing and generation.

## `replace_control_characters` function

The `replace_control_characters` function iterates over a given string `s` and replaces each control character with its Unicode escape sequence, while leaving all other characters unchanged. Control characters are non-printing characters that control or format the display of text, such as newline (`\n`), carriage return (`\r`), tab (`\t`), and others defined in the Unicode standard under various categories starting with "C" (e.g., "Cc" for control characters, "Cf" for format characters).

### How It Works:

- **Iterate Through Each Character**: The function loops through each character in the input string `s`.
- **Check Character Category**: It uses the `unicodedata.category(ch)` function to determine the Unicode category of each character. If the first letter of the category is "C", the character is identified as a control character.
- **Replace Control Characters**: For control characters, instead of adding them directly to the `chars` list, the function converts them into a Unicode escape sequence (`\uxxxx`, where `xxxx` represents the Unicode code point of the character in hexadecimal format). This conversion uses the `ord(ch)` function to get the character's Unicode code point and formats it as a four-digit hexadecimal value.
- **Build and Return the Modified String**: Non-control characters are added to the `chars` list unchanged. The function then returns a new string composed of the processed characters, effectively escaping all control characters while preserving the rest of the text.

### Example Issue Addressed by This Function:

Consider the string `Hello, world!\nThis is a test.\r\n` containing both visible characters and control characters like newline (`\n`) and carriage return plus newline (`\r\n`). If this string were to be displayed in a context where control characters could cause formatting issues (e.g., a single-line text box, logging system, or file names), it might lead to unwanted line breaks or other display problems.

By applying the `replace_control_characters` function to this string, all control characters would be replaced with their Unicode escape sequences, resulting in a string like `Hello, world!\\u000aThis is a test.\\u000d\\u000a`. This transformed string can then be safely displayed or processed in environments where control characters could cause issues, as all control characters have been converted to a visible representation that doesn't affect text formatting.

## Hugging Face tokenizers

In the context of Hugging Face tokenizers, several files are specifically associated with the tokenizer's configuration and operation. These files enable the tokenizer to encode input text into tokens understood by the model and decode tokens back into human-readable text. Here are the files related to the tokenizer and their purposes:

1. **`vocab.json`**: This file contains the mapping from tokens to their integer IDs. It's essential for both encoding (text to tokens) and decoding (tokens to text) processes.

2. **`merges.txt`**: In tokenizers that use Byte Pair Encoding (BPE) or similar algorithms (e.g., GPT-2), this file stores the merge operations that progressively combine individual characters or character sequences into tokens. It's crucial for the tokenizer to know which sequences of characters should be merged to form tokens.

3. **`tokenizer.json`**: This is a comprehensive file that can include the tokenizer's vocabulary, merges, and additional configurations in a single JSON file. It's especially useful for portability and ease of use, as it encapsulates the tokenizer's entire configuration.

4. **`tokenizer_config.json`**: This file contains configuration details specific to the tokenizer, such as whether to lower case the input text, the special tokens used, and other tokenizer settings. It helps in setting up the tokenizer correctly for encoding and decoding tasks.

5. **`special_tokens_map.json`**: This file maps special tokens (like padding token, unknown token, etc.) to their string representations. Special tokens are used for specific purposes, such as separating sentences, marking the beginning or end of a text, and padding sequences to a uniform length.

6. **`config.json`**: While primarily associated with the model's configuration, this file can also contain tokenizer-related settings, especially when the model and tokenizer are tightly integrated.

The files `vocab.json`, `merges.txt`, `tokenizer.json`, `tokenizer_config.json`, and `special_tokens_map.json` are specifically associated with the tokenizer. They work together to ensure the tokenizer can accurately and efficiently convert text to and from the token representations needed for the model to process.

Other files listed, like model weights (`pytorch_model.bin`, `tf_model.h5`, `flax_model.msgpack`, `rust_model.ot`), ONNX models, and TFLite models, are related to the model itself rather than the tokenizer. The `README.md` file provides documentation, and files like `config.json` and `generation_config.json` might contain both model and tokenizer configurations but are primarily associated with the model's operational parameters.

---

When training your own tokenizer or reusing an existing tokenizer with Hugging Face, the required files depend on the complexity of your tokenizer and whether you're making modifications or just leveraging it as-is. If you're fine-tuning a model like QLORA or any other model and plan to use an existing tokenizer, the minimal set of tokenizer-specific files you would need includes:

1. **`vocab.json`**: This is essential as it provides the mapping of tokens to their respective IDs. Even if you're not modifying the tokenizer, you need this file to ensure that the text can be correctly tokenized into the format expected by the model.

2. **`merges.txt`** (for BPE tokenizers like GPT-2 or GPT-3): If your tokenizer uses a Byte Pair Encoding mechanism or similar strategies, this file is necessary. It details how to merge pairs of characters or character sequences into larger tokens.

3. **`tokenizer.json`**: For some tokenizers, this file can act as a standalone configuration file that includes the vocabulary, merges, and other tokenizer settings. In cases where `tokenizer.json` is provided, it might be the only file you need, as it can encapsulate the entire tokenizer setup.

Optionally, if your tokenizer utilizes or defines special tokens beyond the standard set or if you've made modifications to how special tokens are handled:

4. **`special_tokens_map.json`**: This file specifies any special tokens (such as start-of-sequence, end-of-sequence, padding, etc.) that your model expects. It's useful for ensuring that these tokens are consistently represented and treated within your pipeline.

5. **`tokenizer_config.json`**: Contains additional configuration details specific to the tokenizer, such as whether to strip accents, use lowercasing, or other preprocessing steps. This file helps in customizing the tokenizer's behavior to match the model's requirements.

For a fine-tuning task, especially when reusing an existing tokenizer, you may not need to alter `vocab.json`, `merges.txt`, or `tokenizer.json`, but ensuring their presence is crucial for the tokenizer to function correctly. However, depending on the specific changes you're making or the nature of your task, you might need to adjust `special_tokens_map.json` and `tokenizer_config.json` to better suit your fine-tuning or application needs.

In summary, the minimal tokenizer-specific Hugging Face files needed would likely be `vocab.json` and possibly `merges.txt` if you're using a BPE tokenizer, with `tokenizer.json` serving as a comprehensive file that might replace the need for separate `vocab.json` and `merges.txt` files. `special_tokens_map.json` and `tokenizer_config.json` are optional and depend on whether you need to customize or modify the tokenizer's behavior and special tokens.

## SentencePiece

[https://github.com/google/sentencepiece](https://github.com/google/sentencepiece)

[Sentencepiece training options](https://github.com/google/sentencepiece/blob/master/doc/options.md)

Excellent articles for background and intuition:

- [Deep Learning, NLP, and Representations](https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/)

- [A Deep Dive into the Wonderful World of Preprocessing in NLP](https://web.archive.org/web/20201101005930/https://mlexplained.com/2019/11/06/a-deep-dive-into-the-wonderful-world-of-preprocessing-in-nlp/)


- [Understanding SentencePiece ([Under][Standing][_Sentence][Piece])](https://colabdoge.medium.com/understanding-sentencepiece-under-standing-sentence-piece-ac8da59f6b08)


---

SentencePiece is a software library for unsupervised text tokenization and detokenization mainly used in neural network models for natural language processing (NLP). It's particularly popular in the context of machine translation, text generation, and other tasks that require handling text data efficiently at scale. One of the key features of SentencePiece is its ability to treat the input text as a raw stream of Unicode characters, which means it doesn't rely on pre-tokenized or whitespace-separated words. This approach offers several advantages:

### Language Agnostic
SentencePiece is designed to be language-agnostic, making it particularly useful for processing text from languages where whitespace doesn't serve as a clear word delimiter, such as in Chinese or Japanese. It can be used for virtually any language, facilitating the development of multilingual models.

### Subword Tokenization
SentencePiece supports subword tokenization methods like Byte Pair Encoding (BPE) and Unigram Language Model. These methods break down words into smaller units (subwords or characters), allowing the model to handle unknown words more gracefully, share representations among related words, and reduce the vocabulary size, which, in turn, can lead to more efficient training and inference.

### Integration and Usage
SentencePiece has been widely adopted in major NLP frameworks and projects. It's particularly favored in scenarios where controlling the tokenization process directly within the model training and inference pipeline is crucial. By integrating SentencePiece, developers can ensure consistency in how text is tokenized and detokenized, avoiding issues that might arise from discrepancies between training and deployment environments.

### Benefits of Subword Tokenization
- **Handling of Unknown Words**: By breaking words into subwords or characters that are seen during training, the model can handle words it has never seen before.
- **Efficiency**: Reducing the vocabulary size to subwords or characters can significantly decrease the model's memory requirements.
- **Flexibility**: It allows for fine-grained control over the tokenization process, making it suitable for a wide range of languages and tasks.

### Customization
SentencePiece allows for customization of its tokenization process, letting users specify the vocabulary size, character coverage, and the specific tokenization algorithm (e.g., BPE or Unigram). This level of control enables users to tailor the tokenization to the specific needs of their applications and datasets.

### Conclusion
SentencePiece represents a powerful and flexible tool in the NLP toolkit, enabling efficient and effective text processing for a wide range of languages and tasks. Its language-agnostic approach to tokenization, along with support for subword models, helps tackle common challenges such as unknown words and large vocabularies, making it a popular choice among researchers and practitioners in the NLP community.

## a key distinction between TikToken and SentencePiece-based tokenizer

The explanation highlights a key distinction between TikToken and SentencePiece-based tokenizers, particularly in how they handle the initial text representation and the granularity of tokenization:

### TikToken Approach:
- **Initial Representation**: TikToken operates at the byte level, initially encoding text using UTF-8, which results in a sequence of bytes. This is particularly useful for handling a wide variety of languages and special characters uniformly.
- **Tokenization Granularity**: It merges bytes based on their frequency of occurrence next to each other in the training data. The Byte Pair Encoding (BPE) or similar algorithms are applied to these byte sequences, allowing for efficient handling of unknown words and subword tokenization.
- **Character Coverage**: TikToken's method does not directly consider character coverage or rare code points, as the encoding and merging occur at the byte level.

### SentencePiece Approach:
- **Initial Representation**: SentencePiece operates directly on Unicode code points. It does not convert the text to bytes but works with the text's original Unicode representation. This allows it to more directly consider the linguistic elements of the text as represented by their code points.
- **Tokenization Granularity**: SentencePiece applies BPE or similar algorithms directly to these code points. This means that the merging operations are conducted on a more linguistically meaningful level, potentially offering advantages for certain languages or specific character sets.
- **Character Coverage and Byte Fallback**: SentencePiece introduces the concept of character coverage to manage rare code points. If certain code points are deemed too rare (based on the character coverage hyperparameter), they may either be mapped to an unknown token or, if byte fallback is enabled, encoded into UTF-8 bytes and then tokenized. This provides a flexible mechanism for handling text's more infrequent linguistic or symbolic elements.

### Key Differences:
1. **Level of Operation**: TikToken begins with a byte-level representation of text, while SentencePiece starts with Unicode code points. This fundamental difference affects how each tokenizer views and processes the input text.
2. **Handling of Rare Characters**: SentencePiece's character coverage and byte fallback mechanisms offer a nuanced approach to managing rare characters or symbols, translating them into a mix of known tokens and byte-level encodings if necessary. In contrast, TikToken's byte-level approach inherently handles a wide range of characters without distinguishing between common and rare characters at the encoding stage.

### Conclusion:
- **TikToken** might be considered more straightforward and uniform in its approach, treating all text as sequences of bytes initially, which might simplify processing but potentially abstracts away some linguistic nuances.
- **SentencePiece** offers a nuanced, linguistically aware approach by working directly with Unicode code points and providing mechanisms to handle rare characters, which can be particularly beneficial for languages with complex scripts or when fine-grained control over tokenization behavior is needed.

Both approaches have their strengths and are chosen based on the specific requirements of the NLP tasks, languages involved, and preferences for handling linguistic diversity and character rarity.

## Discussion: SentencePiece library

The transcript provides a detailed overview of SentencePiece and its comparison to TikToken, including their applications, configurations, and differences in handling tokenization. Here are the most interesting takeaways:

1. **SentencePiece's Versatility**: SentencePiece is highlighted for its efficiency in both training and inference, supporting multiple tokenization algorithms, including Byte Pair Encoding (BPE). It's widely used in various language models like LLaMA and MISTral, indicating its robustness and adaptability.

2. **Direct Operation on Code Points**: Unlike TikToken, which first encodes text to bytes, SentencePiece operates directly on Unicode code points. This allows for tokenization that is more closely aligned with the linguistic elements represented by those code points, offering potentially more nuanced handling of languages with complex scripts.

3. **Handling of Rare Code Points**: SentencePiece has a unique approach to managing rare code points through a character coverage hyperparameter. Rare code points can be mapped to a special unknown token or, with byte fallback enabled, encoded into UTF-8 bytes and then tokenized. This mechanism provides flexibility in dealing with text's less frequent elements.

4. **Configuration Complexity and Historical Baggage**: The transcript notes that SentencePiece has accumulated historical baggage over time, resulting in a complex array of configuration options. While this flexibility allows SentencePiece to cater to a wide range of requirements, it can also make the tool daunting for new users due to its configuration complexity and the need for careful documentation review.

5. **Pre-processing and Normalization Rules**: SentencePiece's approach to pre-processing and normalization is mentioned as potentially overcomplicated, especially in the context of language models where preserving the raw form of data is often preferred. The tool includes numerous options for text simplification and normalization, which may not always be desirable for training modern NLP models.

6. **Concept of Sentences**: SentencePiece treats sentences as individual training examples, with various options for managing sentence length and shuffling. This concept may not align perfectly with the needs of language models, which often benefit from treating text as continuous streams without hard segmentations.

7. **Practical Application and Customization**: Despite its complexities, SentencePiece remains a powerful tool for training custom tokenizers, offering detailed control over tokenization behavior. However, achieving desired results can require significant effort to understand and appropriately set the multitude of available options.

In summary, SentencePiece stands out for its direct operation on Unicode code points, sophisticated handling of rare characters, and broad configurability. However, its complexity and the extensive array of options highlight a trade-off between versatility and user-friendliness, emphasizing the importance of clear documentation and thoughtful configuration in leveraging SentencePiece effectively for tokenization tasks.

## SentencePiece is widely used for two main reasons

The sentence highlights SentencePiece's notable capabilities in the context of Byte Pair Encoding (BPE) tokenizers, distinguishing it from TikToken. Specifically, it states that SentencePiece is widely used for two main reasons:

1. **Efficiency in Training**: SentencePiece can train new tokenizers from scratch. Training involves analyzing a large corpus of text to identify the most common pairs of characters (or tokens) and iteratively merging them to form a vocabulary of more complex tokens. This process is crucial for developing a tokenizer that is optimized for the specific linguistic patterns and vocabulary of the corpus it was trained on.

2. **Efficiency in Inference**: Beyond just training, SentencePiece can also perform inference efficiently. Inference, in this context, refers to the tokenizer's ability to take unseen text and break it down into the tokens defined during training. Efficient inference is critical for applying the tokenizer in practical NLP tasks, such as input preparation for language models during text generation, translation, or classification tasks.

The mention of "unlike TikToken" implies that, at the time of the statement, TikToken either does not support both training and inference processes as efficiently as SentencePiece does, or it might specialize in one aspect over the other. This makes SentencePiece a preferred choice for projects where both efficient training of custom BPE tokenizers and their subsequent application for text processing are essential.

## Pretokenization

Pretokenization is the process of splitting text into initial chunks (pretokens) before applying more sophisticated tokenization algorithms, such as Byte Pair Encoding (BPE), WordPiece, or Unigram language models. Pretokenization usually involves breaking down the text into words, spaces, and possibly punctuation, serving as a first step to simplify and standardize the input for further processing.

The key difference highlighted in the sentence is that BPE, WordPiece, and Unigram language model tokenizers typically require this initial step of breaking down the text into a more manageable form. In contrast, SentencePiece can operate directly on raw text without needing pretokenization. This is because SentencePiece handles the text at a more granular level (e.g., Unicode code points) and integrates both the splitting and subword tokenization steps into one process.

### Example:

Let's clarify with an example using the text "OpenAI's GPT-3 is amazing!"

**With Pretokenization (e.g., for BPE):**
1. **Pretokenization:** The text is initially split into words and punctuation:
   - "OpenAI's", "GPT-3", "is", "amazing", "!"
2. **BPE Tokenization:** These tokens are then further split or combined into subwords based on the BPE algorithm's learned merges, potentially resulting in something like:
   - "Open", "AI", "'", "s", "G", "PT-3", "is", "amaz", "ing", "!"

**Without Pretokenization (e.g., SentencePiece):**
1. **Direct Tokenization:** SentencePiece processes the raw text directly without explicitly separating words first. It might break down the text into subwords and characters based on its own learned patterns, potentially resulting in a different set of tokens that could include pieces of words, punctuation, or characters as tokens:
   - "▁Open", "AI", "'", "s", "▁GPT", "-", "3", "▁is", "▁amazing", "!"

Here, "▁" represents a space character that SentencePiece explicitly includes to handle word boundaries, allowing it to reconstruct the original text from tokens.

In summary, pretokenization is an initial step to simplify text into a consistent format for certain tokenizers, but SentencePiece's design allows it to bypass this step, working directly on the full text to produce tokens.

## `Protobuf`

Protocol Buffers (Protobuf) are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data, similar to XML or JSON but smaller, faster, and simpler. In Python, as well as in other programming languages, Protobuf is used for a variety of purposes:

1. **Cross-Language Services Communication**: Protobuf supports generating code in multiple languages from a single `.proto` schema. This feature is particularly useful in microservices architectures where services are written in different languages but need to exchange data in a robust and efficient manner.

2. **Data Storage and Transmission**: Due to its compact size and efficiency in serialization and deserialization, Protobuf is ideal for storing and transmitting data, especially in environments where bandwidth or storage capacity is limited.

3. **APIs and RPC Systems**: Protobuf can be used to define APIs and remote procedure call (RPC) interfaces. With tools like gRPC, developers can automatically generate client and server code from Protobuf definitions, facilitating easier and more reliable system integration.

4. **Configuration Files**: The structured nature of Protobuf makes it suitable for use in configuration files. Unlike traditional configuration files that might be prone to errors due to their free-form nature, Protobuf ensures that configurations are both structured and type-checked.

5. **Efficient Data Logging and Exchange in Distributed Systems**: Protobuf's compact binary format is beneficial for logging and exchanging data in distributed systems where efficiency is crucial for performance and scalability.

6. **Machine Learning and Data Analysis**: In machine learning workflows, Protobuf is used to serialize and deserialize datasets, model parameters, and metadata. Its efficiency helps in quickly loading and saving large volumes of data.

The main advantages of using Protobuf in these use cases include its strong typing, backward and forward compatibility, efficiency in both size and speed, and the automatic generation of data access classes, which reduce boilerplate code and potential for human error.

---

To give you a simple example of using Protocol Buffers in Python, let's create a basic `Person` data structure, compile it, and then write some Python code to serialize and deserialize it.

### Step 1: Define the Protobuf Schema

First, you need to define your data structure in a `.proto` file. Let's create a file named `person.proto`:

```proto
syntax = "proto3";

message Person {
  string name = 1;
  int32 id = 2;
  string email = 3;
}
```

This defines a `Person` message with three fields: `name`, `id`, and `email`.

### Step 2: Compile the `.proto` File

Next, you'll need to compile this `.proto` file to generate the Python classes. You'll need the Protocol Buffers compiler, `protoc`, installed on your machine for this step.

Run the following command in the terminal in the directory where your `person.proto` file is located:

```sh
protoc --python_out=. person.proto
```

This will generate a `person_pb2.py` file, which contains the generated Python classes.

### Step 3: Python Code to Serialize and Deserialize

Now, you can write Python code to create a `Person`, serialize it to a string, and then deserialize the string back into a `Person` object. Create a new Python file in the same directory, for example, `main.py`, and add the following code:

```python
from person_pb2 import Person

# Create a new Person
person = Person(name="John Doe", id=1234, email="johndoe@example.com")

# Serialize the Person to a binary string
person_binary = person.SerializeToString()
print(f"Serialized person: {person_binary}")

# Deserialize the binary string back into a Person
person_new = Person()
person_new.ParseFromString(person_binary)
print(f"Deserialized person: name={person_new.name}, id={person_new.id}, email={person_new.email}")
```

Run this Python script:

```sh
python main.py
```

You should see output showing the serialized binary string and the deserialized `Person` object's fields.

This simple example demonstrates defining a data structure using Protocol Buffers, generating the corresponding Python classes, and using those classes to serialize and deserialize data.