# Tokenizers II
Tokenization occurs in steps:
1. Normalization
2. Pre-tokenization
3. Model
4. Postprocessor
   
# 1. Normalization
This is the step in tokenization where the characters are modified to make them understandable to the model. This could include
- changing to lowercase
- removing accents (beware... depends on language)
- mapping unicode characters that look identical onto one unique character cod

You can access the tokenizer's normalizer through the `tokenizer.backend_tokenizer.normalizer` object

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
print(type(tokenizer.backend_tokenizer))
print(tokenizer.backend_tokenizer.normalizer.normalize_str("Héllò hôw are ü?"))

<class 'tokenizers.Tokenizer'>
hello how are u?


# 2. Pre-tokenization
Applies some rules to split the input into words based on spaces and punctuation.

You can access the tokenizer's pre-tokenizer through the `tokenizer.backend_tokenizer.pre_tokenizer` object.

In [7]:
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?")

[('Hello', (0, 5)),
 (',', (5, 6)),
 ('how', (7, 10)),
 ('are', (11, 14)),
 ('you', (16, 19)),
 ('?', (19, 20))]

Note: This is where the offsets are recorded

In [8]:
# Other tokentizers have different rules
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?")

[('Hello', (0, 5)),
 (',', (5, 6)),
 ('Ġhow', (6, 10)),
 ('Ġare', (10, 14)),
 ('Ġ', (14, 15)),
 ('Ġyou', (15, 19)),
 ('?', (19, 20))]

This one will split on whitespace and punctuation as well, but it will keep the spaces and replace them with a Ġ symbol, enabling it to recover the original spaces if we decode the tokens. Also note that it did NOT ignore the double space between `are` and `you`

In [9]:
tokenizer = AutoTokenizer.from_pretrained("t5-small")
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?")

[('▁Hello,', (0, 6)),
 ('▁how', (7, 10)),
 ('▁are', (11, 14)),
 ('▁you?', (16, 20))]

Like the GPT-2 tokenizer, this one keeps spaces and replaces them with a specific token (_), but the T5 tokenizer only splits on whitespace, not punctuation. Also note that it added a space by default at the beginning of the sentence (before Hello) and ignored the double space between `are` and `you`

# 3. Tokenizer model
There are three main subword tokenization algorithms: 
- BPE (used by GPT-2 and others)
- WordPiece (used for example by BERT)
- Unigram (used by T5 and others)

![algorithm table](./img/tokenizer_algorithms_table.png)

## Byte-Pair Encoding (BPE)
Break words into subword units that appear frequently in reference corpus. Basic idea:

toy corpus: ["the best is there"]
1. Start with your corpus and split into words, with frequency of each word
2. Create a vocabulary with the characters that are in the splits. `vocab=['b', 'e', 'h', 'i', 'r', 's', 't']`
4. Count the frequency that every character occurs with every other character throughout all the splits (eg `th`)
5. Add the pair(s) of characters with highest frequency to a merge list `merge_list = ['th']` and to the vocabulary `vocab=['b', 'e', 'h', 'i', 'r', 's', 't', th']`
6. Remake the splits, replacing pair that were merged with their merged counterpart
7. Repeat steps 4-7 until the vocaulary has increased to the desired size.

To run tokenizer, the merge rules are applied... and there we have the tokenized text.

## WordPiece
Developd by Google to train BERT. This is very similar to BPE, only it encodes characters that are inside words by prepending `##` to it. Eg `tot` would have the characters `['t', '##o', '##t']`.

The pair that are kept is NOT the most frequent, but the one with the highest score given by:
$$\text{score}=\frac{(\text{freq of pair})}{\text{freq of first element} \times \text{freq of second element}}$$

Unlike BPE, WordPiece saves only the final vocabulary (not the merge rules) so the text is tokenized by finding the longest token that the characters fit into

## Unigram
A statistical model based on the probabilities of characters and pairs of characters. Basic idea:
1. Compute the probability (proportion) of each pair of characters occurs in the corpus. This leads to a huge vocab
2. Calculate the unigram loss
3. Remove the one(s) that result in the least increase to the loss
4. Repeat steps 1-3 until the vocab has reached the desired size


# Building a tokenizer piece by piece

![tokenizer_steps](./img/tokenizer_steps.png)

# Step 1: Acquire a corpus
Use the `wikitext-2` dataset.

In [12]:
from datasets import load_dataset

dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")

# generator that yields batches of 1000 texts
def get_training_corpus():
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i + 1000]["text"]

Downloading readme:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/733k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

In [13]:
# # write to file, incase you're curious
# with open("wikitext-2.txt", "w", encoding="utf-8") as f:
#     for i in range(len(dataset)):
#         f.write(dataset[i]["text"] + "\n")

# 1. WordPiece tokenizer

Start by instantiating a Tokenizer object with a model, then set its normalizer, pre_tokenizer, post_processor, and decoder attributes to the values we want.

For this example, use a Tokenizer with a WordPiece model

In [None]:
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

# instantiate tokenizer with WordPiece model
tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

### Add normalizer

In [None]:
# # Use BERT normalizer (off-the-shelf)
# tokenizer.normalizer = normalizers.BertNormalizer(
#     lowercase=True, 
#     strip_accents=True,
#     clean_text = True, # remove all control characters and replace repeating spaces with a single one
#     handle_chinese_chars = True # places spaces around Chinese characters
# )

# OR... Make a BERT-like normalizer from scratch
tokenizer.normalizer = normalizers.Sequence([
    normalizers.NFD(), #  unicode NFD needed so StripAccents properly recognizes accents
    normalizers.Lowercase(), 
    normalizers.StripAccents()
])

# test it out
print(tokenizer.normalizer.normalize_str("Héllò hôw are ü?"))

### Add pre-tokenizer

In [None]:
#  # BERT (off-the-shelf)
# tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()

# # OR... Make a BERT-like pre-tokenizer
# tokenizer.pre_tokenizer = pre_tokenizers.Whitespace() # splits on whitespace AND puntuation, keeping puctuation

# OR compose low lovel ones
tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
    pre_tokenizers.WhitespaceSplit(), # splits on whitespace ONLY
    pre_tokenizers.Punctuation() # splits on punctuation, keeping puctuation
])
pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")
# test it out
tokenizer.pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")

### Train model

In [None]:
# Add special tokens because they are not in the training corpus
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]

# instantiate trainer
trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)

# train model
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

In [30]:
# test tokenizer
encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

['let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', '.']


### Post-processing

In [38]:
encoding

Encoding(num_tokens=20, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

We don't yet have the post-processing steps that add the `[CLS]` token at the beginning and the `[SEP]` token at the end.

We will use a `TemplateProcessor` for this, where we specify how to treat a single sentence and a pair of sentences. For both, we write the special tokens we want to use; the first (or single) sentence is represented by `$A`, while the second sentence (if encoding a pair) is represented by `$B`. For each of these (special tokens and sentences), we also specify the corresponding token type ID after a colon.

In [34]:
cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")
print(cls_token_id, sep_token_id)

# To mimic the classic BERT template (I DONT'T GET THIS SYNTAX)
tokenizer.post_processor = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)],
)
# test tokenizer
encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

2 3
['[CLS]', 'let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', '.', '[SEP]']


In [35]:
encoding = tokenizer.encode("Let's test this tokenizer...", "on a pair of sentences.")
print(encoding.tokens)
print(encoding.type_ids)

['[CLS]', 'let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', '.', '.', '.', '[SEP]', 'on', 'a', 'pair', 'of', 'sentences', '.', '[SEP]']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]


In [36]:
 ## Add decoder
tokenizer.decoder = decoders.WordPiece(prefix="##")

# test it out
tokenizer.decode(encoding.ids)

"let ' s test this tokenizer... on a pair of sentences."

### Save tokenizer
We can save the settings as a `.json` file, which will allow us to load it an rebuild it

To use it with full functionality, we have to wrap it in `PreTrainedTokenizerFast()` or else it will be slow :(

In [37]:


# save parameters to file
tokenizer.save("tokenizer.json")

# load from file
new_tokenizer = Tokenizer.from_file("tokenizer.json")

### Make tokenizer FAST

In [None]:
# for fast tokenizer
from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    # tokenizer_file="tokenizer.json", # You can load from the tokenizer file, alternatively
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)

If you are using a specific tokenizer class (like BertTokenizerFast), you will only need to specify the special tokens that are different from the default ones (here, none)

In [40]:
from transformers import BertTokenizerFast

wrapped_tokenizer = BertTokenizerFast(tokenizer_object=tokenizer)

# 2. BPE Tokenizer
Like used for GPT-2

In [41]:
# instantiate tokenizer
tokenizer = Tokenizer(models.BPE())

# GPT-2 does not use a normalizer, so skip and go directly to the pre-tokenization
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

tokenizer.pre_tokenizer.pre_tokenize_str("Let's test pre-tokenization!")

[('Let', (0, 3)),
 ("'s", (3, 5)),
 ('Ġtest', (5, 10)),
 ('Ġpre', (10, 14)),
 ('-', (14, 15)),
 ('tokenization', (15, 27)),
 ('!', (27, 28))]

In [42]:
trainer = trainers.BpeTrainer(vocab_size=25000, special_tokens=["<|endoftext|>"])
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)
# tokenizer.train(["wikitext-2.txt"], trainer=trainer) # or from text file

encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

['L', 'et', "'", 's', 'Ġtest', 'Ġthis', 'Ġto', 'ken', 'izer', '.']


The `trim_offsets = False` option indicates to the post-processor that we should leave the offsets of tokens that begin with `Ġ` as they are: this way the start of the offsets will point to the space before the word, not the first character of the word (since the space is technically part of the token)

In [43]:
# apply  byte-level post-processing 
tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)

In [45]:
sentence = "Let's test this tokenizer."
encoding = tokenizer.encode(sentence)
encoding.tokens

['L', 'et', "'", 's', 'Ġtest', 'Ġthis', 'Ġto', 'ken', 'izer', '.']

In [47]:
# add a byte-level decoder
tokenizer.decoder = decoders.ByteLevel()

tokenizer.decode(encoding.ids)

"Let's test this tokenizer."

In [48]:
# make it fast
from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    bos_token="<|endoftext|>",
    eos_token="<|endoftext|>",
)

In [49]:
# or... if the special characters are the same as the gpt2 defaults
from transformers import GPT2TokenizerFast

wrapped_tokenizer = GPT2TokenizerFast(tokenizer_object=tokenizer)

# 3. Unigram tokenizer


In [54]:
from tokenizers import Regex

# instantiate tokenizer
tokenizer = Tokenizer(models.Unigram())

# add normalizer
tokenizer.normalizer = normalizers.Sequence(
    [
        normalizers.Replace("``", '"'),
        normalizers.Replace("''", '"'),
        normalizers.NFKD(),
        normalizers.StripAccents(),
        normalizers.Replace(Regex(" {2,}"), " "),
    ]
)

# add pre-tokenizer
tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()

tokenizer.pre_tokenizer.pre_tokenize_str("Let's test the pre-tokenizer!")

[("▁Let's", (0, 5)),
 ('▁test', (5, 10)),
 ('▁the', (10, 14)),
 ('▁pre-tokenizer!', (14, 29))]

In [55]:
# Add model, which needs training (XLNet has quite a few special tokens)
special_tokens = ["<cls>", "<sep>", "<unk>", "<pad>", "<mask>", "<s>", "</s>"]
trainer = trainers.UnigramTrainer(
    vocab_size=25000, special_tokens=special_tokens, unk_token="<unk>"
)
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)
# tokenizer.train(["wikitext-2.txt"], trainer=trainer)

A very important argument not to forget for the UnigramTrainer is the `unk_token`. We can also pass along other arguments specific to the Unigram algorithm, such as the `shrinking_factor` for each step where we remove tokens (defaults to 0.75) or the `max_piece_length` to specify the maximum length of a given token (defaults to 16).

In [57]:
# add decoder
tokenizer.decoder = decoders.Metaspace()

# make it fast
wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    bos_token="<s>",
    eos_token="</s>",
    unk_token="<unk>",
    pad_token="<pad>",
    cls_token="<cls>",
    sep_token="<sep>",
    mask_token="<mask>",
    padding_side="left",
)

# or ...
# from transformers import XLNetTokenizerFast
# wrapped_tokenizer = XLNetTokenizerFast(tokenizer_object=tokenizer)

In [59]:
encoding = tokenizer.encode("Let's test this tokenizer.")
tokenizer.decode(encoding.ids)

"Let's test this tokenizer."