THE HUGGING FACE TOKENIZERS LIBRARY

Training a tokenizer is a statistical process that tries to identify which subwords are the ebst to pick for a given corpus, and the exact rules used to pick them depend on the tokenization algorithm. It is deterministic, meaning you always get the same results when training with the same algorithm on the same corpus

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("code_search_net", "python")

# We can check the training split to see which columns we have access to

raw_datasets["train"]

# OUTPUT:
# Dataset({
#    features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 
#      'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 
#      'func_code_url'
#    ],
#    num_rows: 412178
#})

In [None]:
# We'll use the whole_func_string column to train our tokenizer. These could be an example of one these functions by indezing into the train split

print(raw_datasets["train"][123456]["whole_func_string"])

# Which should print the following:
#def handle_simple_responses(
#      self, timeout_ms=None, info_cb=DEFAULT_MESSAGE_CALLBACK):
#    """Accepts normal responses from the device.
#
#    Args:
#      timeout_ms: Timeout in milliseconds to wait for each response.
#      info_cb: Optional callback for text sent from the bootloader.
#
#    Returns:
#      OKAY packet's message.
#    """
#    return self._accept_responses('OKAY', info_cb, timeout_ms=timeout_ms)

First we need to transform the dataset into an iterator of lists of texts. Using lists of texts will enable our tokenizer to go faster and it should be an iterator if we want to avoid having everything in memory at once.

In [None]:
# We define a function that returns a generator
def get_training_corpus():
    return (
        raw_datasets["train"][i : i + 1000]["whole_func_string"]
        for i in range(0, len(raw_datasets["train"]), 1000)
    )


training_corpus = get_training_corpus()

or 


def get_training_corpus():
    dataset = raw_datasets["train"]
    for start_idx in range(0, len(dataset), 1000):
        samples = dataset[start_idx : start_idx + 1000]
        yield samples["whole_func_string"]
        
# These two functions are used for exactly the same, but the second one allows you tu use more complex logic than
# you can in a list comprehension

In [None]:
# For trainig a new tokenizer, first we need to load the tokenizer we want to pair with our model
from transformers import AutoTokenizer

old_tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Starting from gpt2, avoids starting entirely from scratch. The only thing we will change is the vocabulary, which will be determined by the trainig on our corpus


In [None]:
#EXAMPLE

example = '''def add_numbers(a, b):
    """Add the two numbers `a` and `b`."""
    return a + b'''

tokens = old_tokenizer.tokenize(example)
tokens

#OUTPUT:
# ['def', 'Ġadd', '_', 'n', 'umbers', '(', 'a', ',', 'Ġb', '):', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġ"""', 'Add', 'Ġthe', 'Ġtwo',
# 'Ġnumbers', 'Ġ`', 'a', '`', 'Ġand', 'Ġ`', 'b', '`', '."', '""', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb']

#This was not efficient. Ler's train a new tokenizer and see if it solves the issues
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)



In [None]:
# To save the tokenizer. This will create a new folder named code-search-net-tokenizer, which will contain all the diles the tokenizer
# needs to be reloaded.
tokenizer.save_pretrained("code-search-net-tokenizer")
# If you are working in a notebook, there's a convenience function to help you with this

from huggingface_hub import notebook_login

notebook_login()

#This will display a widget where you can enter your hugging face login credentials.

# If not working in a notebook, type in terminal huggingface-cli login  and once you've logged in, you can push
# your tokenizer by executing  tokenizer.push_to_hub("code-search-net-tokenizer")
# This will create a new repository in your namespace with the name code-search-net-tokenizer containing the tokenizer file


Slow tokenizers are those written in Python inside the hf transformers library
Fast tokenizers are the ones provided by hf tokenizers, which are written in Rust

The output of a tokenizer is a special BatchEncoding object. It's a subclass of a dictionary but with additional methods that are mostly used by fast tokenizers

The key functionality of fast tokenizers is that they always keep track of the original span of texts the final tokens come from (offset mapping)

EXAMPLE

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
encoding = tokenizer(example)
print(type(encoding))

WHAT A FAST TOKENIZER ENABLES US TO DO

- Access the tokens without having to convert the IDs back to tokens --> encoding.tokens()
- Get the index of the word each token comes from --> encoding.word_ids()
- Map a token to the sentence it came from --> encoding.sentence_ids()
- Map any word or token to characters in the original text and viceversa --> word_to_chars/token_to_chars/char_to_word/char_to_token()

In [None]:
#Gettting the base results with the pipeline

from transformers import pipeline

token_classifier = pipeline("token-classification")
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

#OUTPUT:
# [{'entity': 'I-PER', 'score': 0.9993828, 'index': 4, 'word': 'S', 'start': 11, 'end': 12},
#{'entity': 'I-PER', 'score': 0.99815476, 'index': 5, 'word': '##yl', 'start': 12, 'end': 14},
#{'entity': 'I-PER', 'score': 0.99590725, 'index': 6, 'word': '##va', 'start': 14, 'end': 16},
#{'entity': 'I-PER', 'score': 0.9992327, 'index': 7, 'word': '##in', 'start': 16, 'end': 18},
#{'entity': 'I-ORG', 'score': 0.97389334, 'index': 12, 'word': 'Hu', 'start': 33, 'end': 35},
#{'entity': 'I-ORG', 'score': 0.976115, 'index': 13, 'word': '##gging', 'start': 35, 'end': 40},
#{'entity': 'I-ORG', 'score': 0.98879766, 'index': 14, 'word': 'Face', 'start': 41, 'end': 45},
#{'entity': 'I-LOC', 'score': 0.99321055, 'index': 16, 'word': 'Brooklyn', 'start': 49, 'end': 57}]

In [None]:
# We can also ask the pipeline to group together the tokens that correspond to the same entity
from transformers import pipeline

token_classifier = pipeline("token-classification", aggregation_strategy="simple")
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

#OUTPUT:
# [{'entity_group': 'PER', 'score': 0.9981694, 'word': 'Sylvain', 'start': 11, 'end': 18},
#{'entity_group': 'ORG', 'score': 0.97960204, 'word': 'Hugging Face', 'start': 33, 'end': 45},
#{'entity_group': 'LOC', 'score': 0.99321055, 'word': 'Brooklyn', 'start': 49, 'end': 57}]

The aggregation_strategy = simple, the score is just the mean of the scores of each token in the given entity
There are other strategies available:
- First: the score of each entity is the score of the first token of that entity
- Max: the score of each entity is the maximum score of the tokens in that entity
- Average: the score of each entity is the average of the scores of the words composing that entity

In [None]:
# To obtain these results without the pipeline() function. First we have to tokenize our input and pass it through the model
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
inputs = tokenizer(example, return_tensors="pt")
outputs = model(**inputs)

print(inputs["input_ids"].shape)
print(outputs.logits.shape)

#OUTPUT:
# torch.size([1,19])
# torch.size([1,19,9])

We have a batch with 1 sequence of 19 tokens and the model has 9 different labels, so the output of the model has a shape of 1 x 19 x 9. Like for the text classification pipeline, we use a softmax function to convert those logits to probabilities, and we take the argmax to get predictions (note that we can take the argmax on the logits because the softmax does not change the order):

In [None]:
import torch

probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)[0].tolist()
predictions = outputs.logits.argmax(dim=-1)[0].tolist()
print(predictions)

#OUTPUT: [0, 0, 0, 0, 4, 4, 4, 4, 0, 0, 0, 0, 6, 6, 6, 0, 8, 0, 0]

# The model.config.id2label attribute contains the mapping of indexes to labels that we can use to make sense of the predictions

model.config.id2label 

#OUTPUT: {0: 'O',
# 1: 'B-MISC',
# 2: 'I-MISC',
# 3: 'B-PER',
# 4: 'I-PER',
# 5: 'B-ORG',
# 6: 'I-ORG',
# 7: 'B-LOC',
# 8: 'I-LOC'}

In [None]:
results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]
    if label != "O":
        start, end = offsets[idx]
        results.append(
            {
                "entity": label,
                "score": probabilities[idx][pred],
                "word": tokens[idx],
                "start": start,
                "end": end,
            }
        )

print(results)


# OUTPUT:
#[{'entity': 'I-PER', 'score': 0.9993828, 'index': 4, 'word': 'S', 'start': 11, 'end': 12},
# {'entity': 'I-PER', 'score': 0.99815476, 'index': 5, 'word': '##yl', 'start': 12, 'end': 14},
# {'entity': 'I-PER', 'score': 0.99590725, 'index': 6, 'word': '##va', 'start': 14, 'end': 16},
# {'entity': 'I-PER', 'score': 0.9992327, 'index': 7, 'word': '##in', 'start': 16, 'end': 18},
# {'entity': 'I-ORG', 'score': 0.97389334, 'index': 12, 'word': 'Hu', 'start': 33, 'end': 35},
# {'entity': 'I-ORG', 'score': 0.976115, 'index': 13, 'word': '##gging', 'start': 35, 'end': 40},
# {'entity': 'I-ORG', 'score': 0.98879766, 'index': 14, 'word': 'Face', 'start': 41, 'end': 45},
# {'entity': 'I-LOC', 'score': 0.99321055, 'index': 16, 'word': 'Brooklyn', 'start': 49, 'end': 57}]

To write the code that post-processes the predictions while grouping entities, we will group together entities that are consecutive and labeled with I-XXX, except for the first one, which can be labeled as B-XXX or I-XXX (so, we stop grouping an entity when we get a O, a new type of entity, or a B-XXX that tells us an entity of the same type is starting):

In [None]:
import numpy as np

results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

idx = 0
while idx < len(predictions):
    pred = predictions[idx]
    label = model.config.id2label[pred]
    if label != "O":
        # Remove the B- or I-
        label = label[2:]
        start, _ = offsets[idx]

        # Grab all the tokens labeled with I-label
        all_scores = []
        while (
            idx < len(predictions)
            and model.config.id2label[predictions[idx]] == f"I-{label}"
        ):
            all_scores.append(probabilities[idx][pred])
            _, end = offsets[idx]
            idx += 1

        # The score is the mean of all the scores of the tokens in that grouped entity
        score = np.mean(all_scores).item()
        word = example[start:end]
        results.append(
            {
                "entity_group": label,
                "score": score,
                "word": word,
                "start": start,
                "end": end,
            }
        )
    idx += 1

print(results)

# OUTPUT:
# [{'entity_group': 'PER', 'score': 0.9981694, 'word': 'Sylvain', 'start': 11, 'end': 18},
# {'entity_group': 'ORG', 'score': 0.97960204, 'word': 'Hugging Face', 'start': 33, 'end': 45},
# {'entity_group': 'LOC', 'score': 0.99321055, 'word': 'Brooklyn', 'start': 49, 'end': 57}]

FAST TOKENIZERS IN THE QA PIPELINE
LONG CONTEXTS

The QA pipeline allows us to split the context into smaller chunks, specifying the maximum length. It also includes some overlap between the chunks. 
We have the tokenizer do this for us by adding return_overflowing_tokens_=True, and we can specify the overlap we want with the stride argument

***
sentence = "This sentence is not too long but we are going to split it anyway."
inputs = tokenizer(
    sentence, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)

for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))
***

We mask the tokens that are not part of the context before taking the softmax. We also maks all the padding tokens

***
sequence_ids = inputs.sequence_ids()
# Mask everything apart from the tokens of the context
mask = [i != 1 for i in sequence_ids]
# Unmask the [CLS] token
mask[0] = False
# Mask all the [PAD] tokens
mask = torch.logical_or(torch.tensor(mask)[None], (inputs["attention_mask"] == 0))

start_logits[mask] = -10000
end_logits[mask] = -10000

Now we can use the softmax to convert our logits to probabilities

start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)

We attribute a score to all possible spans of answer, then take the span with the best score

candidates = []
for start_probs, end_probs in zip(start_probabilities, end_probabilities):
    scores = start_probs[:, None] * end_probs[None, :]
    idx = torch.triu(scores).argmax().item()

    start_idx = idx // scores.shape[1]
    end_idx = idx % scores.shape[1]
    score = scores[start_idx, end_idx].item()
    candidates.append((start_idx, end_idx, score))

print(candidates)

The model found two candidates corresponding to the best answers in each chunk

Now we have to map those two token spans to spans of characters in the context

for candidate, offset in zip(candidates, offsets):
    start_token, end_token, score = candidate
    start_char, _ = offset[start_token]
    _, end_char = offset[end_token]
    answer = long_context[start_char:end_char]
    result = {"answer": answer, "start": start_char, "end": end_char, "score": score}
    print(result)
***


Pre-tokenization step comes in the need to split the texts into small entities.
SentencePiece is a tokenization algorithm for the preprocessing of text that considers the text as a sequence of Unicode characters and replaces spaces with '_'. Used in conjuction with the Unigram algorithm, it doesnt reuqire a pre-tokenization step.

BPE:
Training--> Starts from a small vocabulary and learns rules to merge tokens
Training step--> Merges the tokens corresponding to the most common pair
Learns--> Merge rules and a vocabulary
Encoding--> Splits a word into characters and applies the merges learned during training

WordPiece:
Training--> Starts from a small vocabulary and learns rules to merge tokens
Training step--> Merges the tokens corresponding to the pair with the best score based on the frequency of the pair, privileging pairs where each individual token is less frequent
Learns--> Just a vocabulary
Encoding--> Finds the longest subword starting from the beginning that is in the vocabulary, then does the same for the rest of the word

Unigram:
Training--> Starts from a large vocabulary and learns rules to remove tokens
Training step--> Removes all the tokens in the vocabulary that will minimize the loss computed on the whole corpus
Learns--> A vocabulary with a score for each token
Encoding--> Finds the most likely split into tokens, using the scores learned during training

Tokenization algorithm:
    - Normalization
    - Pre-tokenization
    - Splitting the words into individual characters
    - Applying the merge rules learned in order on those splits

In [None]:
#BPE Tokenization
corpus = [
    "This is the Hugging Face Course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

# Pre-tokenize corpus into words
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

#Compute the frequencies of each word in the corpus
from collections import defaultdict

word_freqs = defaultdict(int)

for text in corpus:
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    new_words = [word for word, offset in words_with_offsets]
    for word in new_words:
        word_freqs[word] += 1

print(word_freqs)

#Compute the base vocabulary (all characters used in the corpus)
alphabet = []

for word in word_freqs.keys():
    for letter in word:
        if letter not in alphabet:
            alphabet.append(letter)
alphabet.sort()

print(alphabet)

#Also add the special tokens used by the model at the beginning of that vocabulary
vocab = ["<|endoftext|>"] + alphabet.copy()

#Split each word into individual characters to start training
splits = {word: [c for c in word] for word in word_freqs.keys()}

#Compute the frequency of each pair
def compute_pair_freqs(splits):
    pair_freqs = defaultdict(int)
    for word, freq in word_freqs.items():
        split = splits[word]
        if len(split) == 1:
            continue
        for i in range(len(split) - 1):
            pair = (split[i], split[i + 1])
            pair_freqs[pair] += freq
    return pair_freqs

#Find the most frequent pair
best_pair = ""
max_freq = None

for pair, freq in pair_freqs.items():
    if max_freq is None or max_freq < freq:
        best_pair = pair
        max_freq = freq

print(best_pair, max_freq)

#The answer to this is ('Ġ', 't') 7, so we need to learn the merge ('Ġ', 't') -> 'Ġt' and add it to the vocabulary
merges = {("Ġ", "t"): "Ġt"}
vocab.append("Ġt")

#Apply that merge in our splits dictionary
def merge_pair(a, b, splits):
    for word in word_freqs:
        split = splits[word]
        if len(split) == 1:
            continue

        i = 0
        while i < len(split) - 1:
            if split[i] == a and split[i + 1] == b:
                split = split[:i] + [a + b] + split[i + 2 :]
            else:
                i += 1
        splits[word] = split
    return splits

# Now we have everything we need to loop until we've learned akk the merges we want. We'll aim for a vocab size of 50
vocab_size = 50

while len(vocab) < vocab_size:
    pair_freqs = compute_pair_freqs(splits)
    best_pair = ""
    max_freq = None
    for pair, freq in pair_freqs.items():
        if max_freq is None or max_freq < freq:
            best_pair = pair
            max_freq = freq
    splits = merge_pair(*best_pair, splits)
    merges[best_pair] = best_pair[0] + best_pair[1]
    vocab.append(best_pair[0] + best_pair[1])
    
    
#To tokenize a new text, we pre-tokenize it, split it, and then apply all the merge rules learned
def tokenize(text):
    pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text)
    pre_tokenized_text = [word for word, offset in pre_tokenize_result]
    splits = [[l for l in word] for word in pre_tokenized_text]
    for pair, merge in merges.items():
        for idx, split in enumerate(splits):
            i = 0
            while i < len(split) - 1:
                if split[i] == pair[0] and split[i + 1] == pair[1]:
                    split = split[:i] + [merge] + split[i + 2 :]
                else:
                    i += 1
            splits[idx] = split

    return sum(splits, [])


In [None]:
#WordPiece tokenization: It only saves the final vocabulary, not the merge rules learned.
# If one character in the word in not in the vocabulary, the whole word would be tokenized as [UNK]

corpus = [
    "This is the Hugging Face Course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

#Pre-tokenize corpus
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

#Compute frequencies of each word
from collections import defaultdict

word_freqs = defaultdict(int)
for text in corpus:
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    new_words = [word for word, offset in words_with_offsets]
    for word in new_words:
        word_freqs[word] += 1

word_freqs

#The alphabet is the unique set composed of all the first letters of words, and all the other letters that appear in words prefixed by ##
alphabet = []
for word in word_freqs.keys():
    if word[0] not in alphabet:
        alphabet.append(word[0])
    for letter in word[1:]:
        if f"##{letter}" not in alphabet:
            alphabet.append(f"##{letter}")

alphabet.sort()
alphabet

print(alphabet)

#Add the special tokens 
vocab = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"] + alphabet.copy()

#Split each word 
splits = {
    word: [c if i == 0 else f"##{c}" for i, c in enumerate(word)]
    for word in word_freqs.keys()
}

#Compute the score of each pair
def compute_pair_scores(splits):
    letter_freqs = defaultdict(int)
    pair_freqs = defaultdict(int)
    for word, freq in word_freqs.items():
        split = splits[word]
        if len(split) == 1:
            letter_freqs[split[0]] += freq
            continue
        for i in range(len(split) - 1):
            pair = (split[i], split[i + 1])
            letter_freqs[split[i]] += freq
            pair_freqs[pair] += freq
        letter_freqs[split[-1]] += freq

    scores = {
        pair: freq / (letter_freqs[pair[0]] * letter_freqs[pair[1]])
        for pair, freq in pair_freqs.items()
    }
    return scores

#Find the pair with the best score
best_pair = ""
max_score = None
for pair, score in pair_scores.items():
    if max_score is None or max_score < score:
        best_pair = pair
        max_score = score

print(best_pair, max_score)

#Add the first merge to learn
vocab.append("ab")

#Apply that merge in out splits dictionary
def merge_pair(a, b, splits):
    for word in word_freqs:
        split = splits[word]
        if len(split) == 1:
            continue
        i = 0
        while i < len(split) - 1:
            if split[i] == a and split[i + 1] == b:
                merge = a + b[2:] if b.startswith("##") else a + b
                split = split[:i] + [merge] + split[i + 2 :]
            else:
                i += 1
        splits[word] = split
    return splits

#We have everything we need. Aim for a vocab size of 70
vocab_size = 70
while len(vocab) < vocab_size:
    scores = compute_pair_scores(splits)
    best_pair, max_score = "", None
    for pair, score in scores.items():
        if max_score is None or max_score < score:
            best_pair = pair
            max_score = score
    splits = merge_pair(*best_pair, splits)
    new_token = (
        best_pair[0] + best_pair[1][2:]
        if best_pair[1].startswith("##")
        else best_pair[0] + best_pair[1]
    )
    vocab.append(new_token)

#To tokenize a new text, we pre-tokenize it, split it, then apply the tokenization algorithm on each word

def encode_word(word):
    tokens = []
    while len(word) > 0:
        i = len(word)
        while i > 0 and word[:i] not in vocab:
            i -= 1
        if i == 0:
            return ["[UNK]"]
        tokens.append(word[:i])
        word = word[i:]
        if len(word) > 0:
            word = f"##{word}"
    return tokens
