# 自作LLM (Tokenizer編)

## byte pair encoding 

Suppose the data to be encoded is

```
aaabdaaabac
```

11 token. 

The byte pair "aa" occurs most often, so it will be replaced by a byte that is not used in the data, such as "Z". Now there is the following data and replacement table:

```
ZabdZabac
Z=aa
```
Then the process is repeated with byte pair "ab", replacing it with "Y":

```
ZYdZYac
Y=ab
Z=aa
```
The only literal byte pair left occurs only once, and the encoding might stop here. Alternatively, the process could continue with recursive byte pair encoding, replacing "ZY" with "X":

```
XdXac
X=ZY
Y=ab
Z=aa
```

5 token 

This data cannot be compressed further by byte pair encoding because there are no pairs of bytes that occur more than once.

To decompress the data, simply perform the replacements in the reverse order.

[参考](https://en.wikipedia.org/wiki/Byte_pair_encoding)

In [17]:
# download text from wikipedia to make a dataset for training a language model
import wikipedia as wiki
import os
import glob

wiki.set_lang("en")

topics: list = ["Python (programming language)", "Attention Is All You Need","Harry Potter", "The Big Bang Theory"]

for topic in topics:
    try:
        if os.path.exists("data/{}.txt".format(topic.replace(" ", "_"))):
            print("Skipping \"{}\" as it already exists".format(topic))
            continue
        page = wiki.page(topic, auto_suggest=False)
        content = page.content
        os.makedirs("data", exist_ok=True)
        with open("data/{}.txt".format(topic.replace(" ", "_")), "w") as f:
            f.write(content)
        print("Downloaded \"{}\"".format(topic))
    except:
        print("Failed to download \"{}\"".format(topic))
        continue

data_paths = glob.glob("data/*.txt")
training_data_path = os.path.join("./","training_data.txt")

with open(training_data_path, "w") as f:
    for path in data_paths:
        with open(path, "r") as f2:
            content = f2.read()
            f.write(content)

with open(training_data_path, "r") as f:
    full_content = f.read()
    print("File: {} has {} characters".format(training_data_path, len(full_content)))



Skipping "Python (programming language)" as it already exists
Skipping "Attention Is All You Need" as it already exists
Skipping "Harry Potter" as it already exists
Skipping "The Big Bang Theory" as it already exists
File: ./training_data.txt has 148703 characters


In [19]:
# translate the text to unicode
unicode_text = full_content.encode("utf-8")
print("number of unicode characters: {} characters".format(len(unicode_text)))


number of unicode characters: 148854 characters


[Radford, Alec, et al. "Language models are unsupervised multitask learners." OpenAI blog 1.8 (2019): 9.](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)

```
Byte Pair Encoding (BPE) (Sennrich et al., 2015) is a practical middle ground between character and word level language modeling which effectively interpolates between word level inputs for frequent symbol sequences and char- acter level inputs for infrequent symbol sequences. Despite its name, reference BPE implementations often operate on Unicode code points and not byte sequences. These imple- mentations would require including the full space of Uni- code symbols in order to model all Unicode strings. This would result in a base vocabulary of over 130,000 before any multi-symbol tokens are added. This is prohibitively large compared to the 32,000 to 64,000 token vocabularies often used with BPE. In contrast, a byte-level version of BPE only requires a base vocabulary of size 256. However, directly applying BPE to the byte sequence results in sub- optimal merges due to BPE using a greedy frequency based heuristic for building the token vocabulary. We observed BPE including many versions of common words like dog since they occur in many variations such as dog. dog! dog? . This results in a sub-optimal allocation of limited vocabulary slots and model capacity. To avoid this, we pre- vent BPE from merging across character categories for any byte sequence. We add an exception for spaces which sig- nificantly improves the compression efficiency while adding only minimal fragmentation of words across multiple vocab tokens.
```