In [1]:
import copy
import json
import os
import regex as re

import sentencepiece as spm
import tiktoken

### **"Much glory awaits someone who can delete the need for tokenization" -- (Andrej Karpathy)**

The tokenizer is a completely separate, independent module from the LLM. It has its own training dataset of text (which could be different from that of the LLM), on which the vocabulary is trained using the Byte Pair Encoding (BPE) algorithm. It then translates back and forth between raw text and sequence of tokens. The LLM only ever sees the tokens and never directly deals with any text.

<div align="center">
  <img src="../assets/tokenizer-llm-diagram.jpg" width="500"/>
</div>

# 1. Strings in Python

According to Python's documentation, "strings are immutable *sequences* of *Unicode code points*". The function to access the Unicode code point of a character is `ord()`. The function to access the character of a Unicode code point is `chr()`. Also, Unicode text is processed and stored as binary data *using one of several encodings*: `UTF-8`, `UTF-16`, `UTF-32`, among others. Of these, `UTF-8` is the most widely used, in part due to its backwards-compatibility with ASCII. The function to encode a string into a binary data is `encode()`. The function to decode a binary data into a string is `decode()`.

`UTF-8` means *Unicode Transformation Format - 8 bit* and supports all valid Unicode code points using a *variable-width encoding* of one to four one-byte code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. In the following table, the characters `u` to `z` are replaced by the bits of the code point, from the positions U+uvwxyz:

<div align="center">
  <img src="../assets/utf8-encoding.jpg" width="700"/>
</div>

Examples:
- U+0041 (‘A’) → 01000001 → 01000001 (same as ASCII)
- U+00A9 (‘©’)	→ 1010001001 → 11010100 10010001

Now, considering that `UTF-8` is represented as byte streams, it implies a maximum vocabulary length of 256 possible tokens. This means tiny embedding tables, counterweighted by very long sequences of tokens, which can be a hindrance to context length in transformer-based neural networks, where each token needs to attend to all other tokens in the sequence.

In [2]:
unicode_enc = [ord(x) for x in '안녕하세요']
print('Unicode length: ', len(unicode_enc))
print('Unicode representation: ', unicode_enc)

Unicode length:  5
Unicode representation:  [50504, 45397, 54616, 49464, 50836]


In [3]:
utf8_enc = '안녕하세요'.encode('utf-8')
print('UTF-8 length: ', len(utf8_enc))
print('UTF-8 representation: ', list(utf8_enc))
print('UTF-8 byte string: ', utf8_enc)

UTF-8 length:  15
UTF-8 representation:  [236, 149, 136, 235, 133, 149, 237, 149, 152, 236, 132, 184, 236, 154, 148]
UTF-8 byte string:  b'\xec\x95\x88\xeb\x85\x95\xed\x95\x98\xec\x84\xb8\xec\x9a\x94'


# 2. Byte Pair Encoding (BPE)

This algorithm was first described in 1994, by Philip Gage, for encoding strings of text into smaller strings by creating and using a translation table. It builds "tokens" (units of recognition) that match varying amounts of source text, from single characters (including single digits or single punctuation marks) to whole words (even long compound words).

Suppose the data to be encoded is:

```
aaabdaaabac
```

The byte pair "aa" occurs most often, so it is merged into a single token:

```
ZabdZabac
Z = aa
```

The process is repeated with byte pair "ab", replacing it with Y:

```
ZYdZYac
Y = ab
Z = aa
```

Finally, the byte pair "ZY" is merged into a single token X:

```
XdXac
X = ZY
Y = ab
Z = aa
```

The data cannot be compressed further because there are no pairs of bytes that occur more than once. We started with 11 bytes and 4 tokens, and ended with 5 bytes and 6 tokens.

In [4]:
with open('../data/unicode.txt', 'r', encoding='utf-8') as f:
    text = f.read()

print('Number of characters in the text: ', len(text))

Number of characters in the text:  1414


In [5]:
# NOTE: unicode text encoded in utf-8 has up to 4 bytes per character
tokens = list(map(int, text.encode('utf-8')))
print('Number of single tokens in the text: ', len(tokens))

Number of single tokens in the text:  2058


In [6]:
unique_tokens = set(tokens)
print('Number of unique tokens in the text: ', len(unique_tokens))
print('Max token: ', max(unique_tokens))

Number of unique tokens in the text:  105
Max token:  240


In [7]:
def get_stats(ids):
    counts = {}
    for pair in zip(ids, ids[1:]):
        counts[pair] = counts.get(pair, 0) + 1
    return counts

stats = get_stats(tokens)
print('Number of unique bigrams: ', len(stats))
print('Most common bigrams: ', sorted(stats.items(), key=lambda x: x[1], reverse=True)[:5])
print('Most common bigram in text: ', (chr(101), chr(32)))

Number of unique bigrams:  617
Most common bigrams:  [((101, 32), 24), ((204, 173), 18), ((205, 153), 18), ((204, 178), 18), ((115, 32), 17)]
Most common bigram in text:  ('e', ' ')


In [8]:
# merging the most common pair
top_pair = max(stats, key=stats.get)
top_pair

(101, 32)

In [9]:
def merge(tokens: list, pair: tuple[int, int], new_token: int) -> list:
    """Merges the most common pair in the given list of tokens into a single token."""
    new_tokens = []
    i = 0
    while i < len(tokens):
        if i < len(tokens) - 1 and tokens[i] == pair[0] and tokens[i + 1] == pair[1]:
            new_tokens.append(new_token)
            i += 2
        else:
            new_tokens.append(tokens[i])
            i += 1
    return new_tokens

print('Example tokens before merging: ', ex_tokens := [5, 6, 6, 7, 9, 1])
print('Example tokens after merging: ', merge(ex_tokens, (6, 6), 10))

Example tokens before merging:  [5, 6, 6, 7, 9, 1]
Example tokens after merging:  [5, 10, 7, 9, 1]


In [10]:
merged_tokens = merge(tokens, top_pair, max(unique_tokens) + 1)
print('Number of tokens before merging: ', len(tokens))
print('Number of tokens after merging: ', len(merged_tokens))
print('Number of unique tokens before merging: ', len(unique_tokens))
print('Number of unique tokens after merging: ', len(set(merged_tokens)))
print('Max token before merging: ', max(unique_tokens))
print('Max token after merging: ', max(set(merged_tokens)))

Number of tokens before merging:  2058
Number of tokens after merging:  2034
Number of unique tokens before merging:  105
Number of unique tokens after merging:  106
Max token before merging:  240
Max token after merging:  241


### 2.1. Training the tokenizer

In [11]:
vocab_size = 276                 # desired number of unique tokens in vocabulary
max_tokens_per_byte = 2 ** 8     # encoding string into utf-8 converts characters into bytes
num_merges = vocab_size - max_tokens_per_byte
trainable_tokens = copy.deepcopy(tokens)

In [12]:
# `bpe_forest` is an inverted tree that stores merges: (int, int) -> int
bpe_forest = {}
for i in range(num_merges):
    stats = get_stats(trainable_tokens)
    top_pair = max(stats, key=stats.get)
    new_token = max_tokens_per_byte + i
    print(f'Merging pair {top_pair} into new token {new_token}')
    trainable_tokens = merge(trainable_tokens, top_pair, new_token)
    bpe_forest[top_pair] = new_token

Merging pair (101, 32) into new token 256
Merging pair (204, 173) into new token 257
Merging pair (205, 153) into new token 258
Merging pair (204, 178) into new token 259
Merging pair (115, 32) into new token 260
Merging pair (204, 171) into new token 261
Merging pair (204, 177) into new token 262
Merging pair (240, 159) into new token 263
Merging pair (205, 136) into new token 264
Merging pair (204, 185) into new token 265
Merging pair (226, 128) into new token 266
Merging pair (105, 110) into new token 267
Merging pair (205, 150) into new token 268
Merging pair (204, 187) into new token 269
Merging pair (205, 135) into new token 270
Merging pair (204, 188) into new token 271
Merging pair (204, 164) into new token 272
Merging pair (204, 166) into new token 273
Merging pair (97, 110) into new token 274
Merging pair (204, 176) into new token 275


In [13]:
print('Number of unique tokens before BPE: ', len(unique_tokens))
print('Number of unique tokens after BPE: ', len(set(trainable_tokens)))
print(f'Compression rate: {len(set(trainable_tokens)) / len(unique_tokens):.2f}X')

Number of unique tokens before BPE:  105
Number of unique tokens after BPE:  113
Compression rate: 1.08X


### 2.2. Decoding tokens into strings

UTF-8 follows a specific schema that bytes can take, which is used to encode and decode strings. Per this schema, a multi-byte character must follow certain rules as to how each byte is structured (see section 1 above). In order to avoid running into errors, the binary decode function can take a `errors` argument, which can be set to `replace`, which replaces any byte that cannot be decoded to a Unicode character with a question mark.

In [14]:
vocab = {i: bytes([i]) for i in range(max_tokens_per_byte)}
for (i, j), new_token in bpe_forest.items():
    vocab[new_token] = vocab[i] + vocab[j]
print('Number of tokens in the vocabulary: ', len(vocab))

Number of tokens in the vocabulary:  276


In [15]:
def decode(tokens: list[int]) -> str:
    """Decodes the given list of tokens using the given vocabulary."""
    binary = b''.join(vocab[token] for token in tokens)
    return binary.decode('utf-8', errors='replace')

print('Decoded token: ', decode([97]))

Decoded token:  a


In [16]:
decode([128])

'�'

### 2.3. Encoding strings into tokens

In [17]:
text = 'hello software engineering'
print('Length of the text: ', len(text))

Length of the text:  26


In [18]:
def encode(text: str) -> list[int]:
    """Encodes the given text using the given vocabulary."""
    tokens = list(text.encode('utf-8'))

    while len(tokens) > 1:
        stats = get_stats(tokens)
        pair = min(stats, key=lambda p: bpe_forest.get(p, float('inf')))
        if pair not in bpe_forest:
            break
          
        new_token = bpe_forest[pair]
        tokens = merge(tokens, pair, new_token)

    return tokens

print('Encoded tokens: ', encode(text))
print('Length of the encoded tokens: ', len(encode(text)))

Encoded tokens:  [104, 101, 108, 108, 111, 32, 115, 111, 102, 116, 119, 97, 114, 256, 101, 110, 103, 267, 101, 101, 114, 267, 103]
Length of the encoded tokens:  23


In [19]:
print('Decoded text: ', decode(encode(text)))

Decoded text:  hello software engineering


# 3. Regex patterns to force splits across categories

This section is based on the following excerpt from the GPT-2 paper: "We observed BPE including many versions of common words like `dog` since they occur in many variations such as `dog.`, `dog!` and `dog?`. This results in a sub-optimal allocation of limited vocabulary slots and model capacity. To avoid this, we prevent BPE from merging across character categories for any byte sequence". 

In order to prevent BPE from merging across character categories, regex patterns are used to force splits across categories and then tokenization can be performed on the resulting splits. In the end, the results of that processing are concatenated back together. This way, byte-pair merges can only happen within the same category.

In [20]:
pattern = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
print(re.findall(pattern, "It's aren't they're they've I'm I'll He'd        Hello123 World!?!?"))

['It', "'s", ' aren', "'t", ' they', "'re", ' they', "'ve", ' I', "'m", ' I', "'ll", ' He', "'d", '       ', ' Hello', '123', ' World', '!?!?']


### 3.1. `Tiktoken` library intro

In [21]:
text = '    hello world!!!'

# GPT-2 (does not merge spaces)
gpt2_encoding = tiktoken.get_encoding('gpt2')
print('GPT-2 encoding: ', gpt2_encoding.encode(text))

# GPT-4 (merges spaces)
gpt4_encoding = tiktoken.get_encoding('cl100k_base')
print('GPT-4 encoding: ', gpt4_encoding.encode(text))

GPT-2 encoding:  [220, 220, 220, 23748, 995, 10185]
GPT-4 encoding:  [262, 24748, 1917, 12340]


# 4. GPT-2 `encoder.py` walkthrough

References: 
- Code repository: https://github.com/openai/gpt-2/blob/master/src/encoder.py
- Vocabulary: https://openaipublic.blob.core.windows.net/gpt-2/models/1558M/encoder.json
- BPE merges: https://openaipublic.blob.core.windows.net/gpt-2/models/1558M/vocab.bpe

In [22]:
with open('../data/encoder.json', 'r', encoding='utf-8') as f:
    vocab = json.load(f)

print('Number of tokens in the vocab: ', len(vocab))
print('First 10 tokens in the vocab: ', list(vocab.items())[:10])

Number of tokens in the vocab:  50257
First 10 tokens in the vocab:  [('!', 0), ('"', 1), ('#', 2), ('$', 3), ('%', 4), ('&', 5), ("'", 6), ('(', 7), (')', 8), ('*', 9)]


In [23]:
with open('../data/vocab.bpe', 'r', encoding='utf-8') as f:
    bpe_data = f.read()
    
bpe_merges = [tuple(merge_str.split()) for merge_str in bpe_data.split('\n')[1:-1]]
print('Number of BPE merges: ', len(bpe_merges))
print('First 10 BPE merges: ', bpe_merges[:10])

Number of BPE merges:  50000
First 10 BPE merges:  [('Ġ', 't'), ('Ġ', 'a'), ('h', 'e'), ('i', 'n'), ('r', 'e'), ('o', 'n'), ('Ġt', 'he'), ('e', 'r'), ('Ġ', 's'), ('a', 't')]


In [24]:
# Returns list of utf-8 byte and a corresponding list of unicode strings.
# The reversible bpe codes work on unicode strings.
# This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
# When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
# This is a signficant percentage of your normal, say, 32K bpe vocab.
# To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
# And avoids mapping to whitespace/control characters the bpe code barfs on.
def bytes_to_unicode():
    bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
    cs = bs[:]
    n = 0
    for b in range(2**8):
        if b not in bs:
            bs.append(b)
            cs.append(2**8+n)
            n += 1
    cs = [chr(n) for n in cs]
    return dict(zip(bs, cs))

print('First 10 bytes to unicode mapping: ', list(bytes_to_unicode().items())[:10])

First 10 bytes to unicode mapping:  [(33, '!'), (34, '"'), (35, '#'), (36, '$'), (37, '%'), (38, '&'), (39, "'"), (40, '('), (41, ')'), (42, '*')]


In [25]:
def get_pairs(word):
    pairs = set()
    prev_char = word[0]
    for char in word[1:]:
        pairs.add((prev_char, char))
        prev_char = char
    return pairs

print('Pairs in the word "hello": ', get_pairs('hello'))

Pairs in the word "hello":  {('e', 'l'), ('h', 'e'), ('l', 'o'), ('l', 'l')}


In [26]:
class Tokenizer:
    def __init__(self, encoder, bpe_merges, errors='replace'):
        self.encoder = encoder
        self.decoder = {v:k for k,v in self.encoder.items()}
        self.errors = errors # how to handle errors in decoding
        self.byte_encoder = bytes_to_unicode()
        self.byte_decoder = {v:k for k, v in self.byte_encoder.items()}
        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
        self.cache = {}

        # Should haved added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
        self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

    def bpe(self, token):
        if token in self.cache:
            return self.cache[token]
        word = tuple(token)
        pairs = get_pairs(word)

        if not pairs:
            return token

        while True:
            bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))
            if bigram not in self.bpe_ranks:
                break
            first, second = bigram
            new_word = []
            i = 0
            while i < len(word):
                try:
                    j = word.index(first, i)
                    new_word.extend(word[i:j])
                    i = j
                except:
                    new_word.extend(word[i:])
                    break

                if word[i] == first and i < len(word)-1 and word[i+1] == second:
                    new_word.append(first+second)
                    i += 2
                else:
                    new_word.append(word[i])
                    i += 1
            new_word = tuple(new_word)
            word = new_word
            if len(word) == 1:
                break
            else:
                pairs = get_pairs(word)
        word = ' '.join(word)
        self.cache[token] = word
        return word

    def encode(self, text):
        bpe_tokens = []
        for token in re.findall(self.pat, text):
            token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))
            bpe_tokens.extend(self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' '))
        return bpe_tokens

    def decode(self, tokens):
        text = ''.join([self.decoder[token] for token in tokens])
        text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors)
        return text

In [27]:
gpt2_tokenizer = Tokenizer(encoder=vocab, bpe_merges=bpe_merges)
bpe_tokens = gpt2_tokenizer.encode('hello world')
print('Original text: ', 'hello world')
print('BPE tokens: ', bpe_tokens)
text = gpt2_tokenizer.decode(bpe_tokens)
print('Decoded text: ', text)

Original text:  hello world
BPE tokens:  [31373, 995]
Decoded text:  hello world


# 5. Special tokens

In addition to tokens that are coming from raw bytes and BPE merges, there are also special tokens that are added to the vocabulary to delimit different parts of the data or to create special structures onto the token streams. For example, OpenAI GPT-2's vocabulary is composed of 50257 tokens:

- 256 tokens from raw bytes
- 50000 tokens from BPE merges
- 1 special token: `<|endoftext|>`

This special token is used to delimit documents in the training set to signal to the language model that the document has ended and what follows is unrelated to the previous document.

Special tokens can be arbitrarily registered in the vocabulary and they are used pervasively in the fine-tuning stage, for instance, to delimit the conversation turns in a dialogue dataset. In order to do that, some model *surgery* is required in the parameters of the transformer, to make sure that the special tokens are properly handled, ie., the embedding matrix has to be extended to add the special tokens.

In [28]:
print("Number of OpenAI GPT-2's BPE merges: ", len(bpe_merges))
print("Number of tokens in OpenAI GPT-2's vocab: ", len(vocab))
print('Special token <|endoftext|> in the vocab: ', vocab['<|endoftext|>'])

Number of OpenAI GPT-2's BPE merges:  50000
Number of tokens in OpenAI GPT-2's vocab:  50257
Special token <|endoftext|> in the vocab:  50256


# 6. `Sentencepiece` library intro

Commonly used because (unlike **Tiktoken**) it can efficiently both train and inference BPE tokenizers. It is used in both Llama and Mistral series.

**Sentencepiece** runs BPE on the Unicode code points directly. It then has as optional `character_coverage` for what to do with rare codepoints, either mapping them onto an `UNK` token, or if `byte_fallback` is turned on, encoding them with UTF-8 raw bytes. It also has *normalization* configurations, such as all lowercase, remove double-whitespaces, etc. It used to be very commom before LLMs in natural language processing, but now are less used because the LLMs can learn to normalize text by themselves.

Below is a list of configurations, since it implements a large diversity of tokenization algorithms:

- [Training options](https://github.com/google/sentencepiece/blob/master/doc/options.md)
- [Protocol bufffer used to represent TrainerSpec](https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto)

In [29]:
# settings are (best effort) those used for training Llama 2
options = dict(
  # input spec
  input='../data/unicode.txt',
  input_format='text',
  # output spec
  model_prefix='nano-tokenizer', # output filename prefix
  # algorithm spec -- BPE
  model_type='bpe',
  vocab_size=400,
  # normalization
  normalization_rule_name='identity', # no normalization
  remove_extra_whitespaces=False,
  input_sentence_size=1e6,
  max_sentence_length=4192, # max number of bytes per sentence
  seed_sentencepiece_size=1e6,
  shuffle_input_sentence=True,
  # rare codepoint treatment
  character_coverage=0.99995, # if token appears only once every 20k words, it's treated as rare and replaced with UNK
  byte_fallback=True,
  # merge rules
  split_digits=True,
  split_by_unicode_script=True,
  split_by_whitespace=True,
  split_by_number=True,
  max_sentencepiece_length=16,
  add_dummy_prefix=True, # 'world' and 'hello world' are both mapped to '_world'
  allow_whitespace_only_pieces=True,
  # special_tokens
  unk_id=0,  # UNK token must exist
  bos_id=1,  # optional BOS token - set to -1 if not needed
  eos_id=2,  # optional EOS token - set to -1 if not needed
  pad_id=-1, # optional PAD token - set to -1 if not needed
  # system spec
  num_threads=os.cpu_count() / 2, # use half of the available cores
)

spm.SentencePieceTrainer.train(**options)

sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: ../data/unicode.txt
  input_format: text
  model_prefix: nano-tokenizer
  model_type: BPE
  vocab_size: 400
  self_test_sample_size: 0
  character_coverage: 0.99995
  input_sentence_size: 1000000
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 4
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 1
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 1
  required_chars: 
  byte_fallback: 1
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
  enable_differential_

In [30]:
sp = spm.SentencePieceProcessor()
sp.load('nano-tokenizer.model')
vocab = [[sp.id_to_piece(i), i] for i in range(sp.get_piece_size())]
print('Number of tokens in the vocab: ', len(vocab))
print('First 5 tokens in the vocab: ', vocab[:5])
print('Last 5 tokens in the vocab: ', vocab[-5:])

Number of tokens in the vocab:  400
First 5 tokens in the vocab:  [['<unk>', 0], ['<s>', 1], ['</s>', 2], ['<0x00>', 3], ['<0x01>', 4]]
Last 5 tokens in the vocab:  [['🇮', 395], ['🇳', 396], ['🇴', 397], ['🇺', 398], ['😄', 399]]


In [31]:
# NOTE: SentencePiece has not seen 안녕하세요 codepoints during training
#       Because `byte_fallback=True`, it will fallback to bytes
utf8_enc = '안녕하세요'.encode('utf-8')
print('UTF-8 length: ', len(utf8_enc))
print('UTF-8 representation: ', list(utf8_enc))
print('UTF-8 byte string: ', utf8_enc)

UTF-8 length:  15
UTF-8 representation:  [236, 149, 136, 235, 133, 149, 237, 149, 152, 236, 132, 184, 236, 154, 148]
UTF-8 byte string:  b'\xec\x95\x88\xeb\x85\x95\xed\x95\x98\xec\x84\xb8\xec\x9a\x94'


In [32]:
ids = sp.encode('안녕하세요 world')
print('Encoded ids: ', ids)

Encoded ids:  [268, 239, 152, 139, 238, 136, 152, 240, 152, 155, 239, 135, 187, 239, 157, 151, 268, 299, 265, 280, 277]


In [33]:
num_special_tokens_used = 3
adjusted_encoded_ids = [x - 3 for x in ids[1:len(utf8_enc)+1]]
list(utf8_enc), adjusted_encoded_ids

([236, 149, 136, 235, 133, 149, 237, 149, 152, 236, 132, 184, 236, 154, 148],
 [236, 149, 136, 235, 133, 149, 237, 149, 152, 236, 132, 184, 236, 154, 148])

In [34]:
print('Tokens: ', [sp.id_to_piece(i) for i in ids])

Tokens:  ['▁', '<0xEC>', '<0x95>', '<0x88>', '<0xEB>', '<0x85>', '<0x95>', '<0xED>', '<0x95>', '<0x98>', '<0xEC>', '<0x84>', '<0xB8>', '<0xEC>', '<0x9A>', '<0x94>', '▁', 'w', 'or', 'l', 'd']


In [35]:
text = sp.decode(ids)
print('Decoded text: ', text)

Decoded text:  안녕하세요 world


In [36]:
# NOTE: demo of `add_dummy_prefix=True` option
ids = sp.encode('world')
print('Encoded ids for "world": ', ids)
print('Tokens for "world": ', [sp.id_to_piece(i) for i in ids])

ids = sp.encode('hello world')
print('Encoded ids for "hello world": ', ids)
print('Tokens for "hello world": ', [sp.id_to_piece(i) for i in ids])

Encoded ids for "world":  [268, 299, 265, 280, 277]
Tokens for "world":  ['▁', 'w', 'or', 'l', 'd']
Encoded ids for "hello world":  [268, 278, 269, 280, 280, 274, 268, 299, 265, 280, 277]
Tokens for "hello world":  ['▁', 'h', 'e', 'l', 'l', 'o', '▁', 'w', 'or', 'l', 'd']


In [37]:
# NOTE: demo of `byte_fallback=False` option
options['byte_fallback'] = False
spm.SentencePieceTrainer.train(**options)

sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: ../data/unicode.txt
  input_format: text
  model_prefix: nano-tokenizer
  model_type: BPE
  vocab_size: 400
  self_test_sample_size: 0
  character_coverage: 0.99995
  input_sentence_size: 1000000
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 4
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 1
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 1
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
  enable_differential_

In [38]:
sp = spm.SentencePieceProcessor()
sp.load('nano-tokenizer.model')

ids = sp.encode('안녕하세요 world')
print('Encoded ids: ', ids)

Encoded ids:  [268, 0, 132, 280, 277]


In [39]:
print('Tokens: ', [sp.id_to_piece(i) for i in ids])

Tokens:  ['▁', '<unk>', '▁wor', 'l', 'd']


In [40]:
text = sp.decode(ids)
print('Decoded text: ', text)

Decoded text:   ⁇  world


# Sources

1. [Ground truth - Let's build the GPT Tokenizer, by Andrej Karpathy](https://www.youtube.com/watch?v=zduSFxRajkE&t=38s)
2. [A programmer's introduction to Unicode, by Nathan Reed](https://www.reedbeta.com/blog/programmers-intro-to-unicode)
3. [Language models are unsupervised multitask learners [GPT-2 paper], by Alec Radford; Dario Amodei; Ilya Sutskever; et al.](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
4. [Efficient Training of Language Models to Fill in the Middle, by Bavarian, M; et al.](https://arxiv.org/abs/2207.14255)
5. [Learning to Compress Prompts with Gist Tokens, by Mu, J; et al.](https://arxiv.org/pdf/2304.08467)
6. [Taming Transformers for High-Resolution Image Synthesis, by Esser, P; et al.](https://compvis.github.io/taming-transformers)
7. [Integer tokenization is insane, by Beren Millidge](https://www.beren.io/2023-02-04-Integer-tokenization-is-insane/)
8. [SolidGoldMagikarp (plus, prompt generation), by Jessica Rumbelow](https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation)