# Role of Tokenizers
---

Tokenization is the process of breaking text into smaller units (tokens) such as words, subwords, or characters. Tokenization is a crucial step in preparing text data for models. Different algorithms like BPE, WordPiece, and Unigram have their own strategies for splitting and encoding text.

In [1]:
import warnings
warnings.filterwarnings('ignore')

training_data = [
    "walker walked a long walk",
]

## Byte Pair Encoding (BPE)

BPE is a subword tokenization algorithm that iteratively merges the most frequent character pairs. It reduces the number of unknown tokens while keeping the vocabulary size manageable.

In [2]:
## Byte Pain Encoding - BPE

from tokenizers.trainers import BpeTrainer
from tokenizers.models import BPE
from tokenizers import Tokenizer
from tokenizers.pre_tokenizers import Whitespace

bpe_tokenizer = Tokenizer(BPE())
bpe_tokenizer.pre_tokenizer = Whitespace()

bpe_trainer = BpeTrainer(vocab_size=14)

bpe_tokenizer.train_from_iterator(training_data, bpe_trainer)

In [16]:
"""
Display the learned vocabulary
"""
bpe_tokenizer.get_vocab()

{'r': 8,
 'k': 4,
 'wal': 11,
 'walk': 12,
 'w': 9,
 'walke': 13,
 'al': 10,
 'n': 6,
 'l': 5,
 'd': 1,
 'o': 7,
 'a': 0,
 'e': 2,
 'g': 3}

In [17]:
"""
This cell demonstrates how the BPE tokenizer splits and encodes text based on the learned vocabulary.
"""
bpe_tokenizer.encode("walker walked a long walk").tokens

['walke', 'r', 'walke', 'd', 'a', 'l', 'o', 'n', 'g', 'walk']

In [5]:
bpe_tokenizer.encode("wlk").ids

[9, 5, 4]

In [6]:
bpe_tokenizer.encode("wlk").tokens

['w', 'l', 'k']

## Unknown Tokens in BPE

BPE does not rely on a fixed dictionary of words, allowing it to split unknown tokens into smaller subword units.

In [7]:
bpe_tokenizer.encode("she walked").tokens

['e', 'walke', 'd']

## WordPiece Tokenization

WordPiece is similar to BPE but focuses on maximizing the likelihood of the training data. It is widely used in models like BERT.

In [8]:
## WordPiece

from real_wordpiece.trainer import RealWordPieceTrainer
from tokenizers.models import WordPiece

real_wordpiece_tokenizer = Tokenizer(WordPiece())
real_wordpiece_tokenizer.pre_tokenizer = Whitespace()

real_wordpiece_trainer = RealWordPieceTrainer(
    vocab_size=27,
)

In [9]:
real_wordpiece_trainer.train_tokenizer(
    training_data, real_wordpiece_tokenizer
)
real_wordpiece_tokenizer.get_vocab()

{'wa': 24,
 '##g': 11,
 'w': 0,
 'g': 18,
 '##ng': 20,
 '##k': 3,
 'o': 16,
 'e': 13,
 'walk': 26,
 'l': 6,
 '##n': 10,
 'a': 5,
 '##ed': 23,
 '##l': 2,
 'd': 15,
 'r': 14,
 'n': 17,
 'lo': 19,
 '##o': 9,
 'long': 21,
 '##lk': 25,
 '##r': 7,
 '##a': 1,
 '##e': 4,
 '##d': 8,
 'k': 12,
 '##er': 22}

## Tokenized Output for WordPiece

WordPiece uses subword tokens to represent words while reducing unknown tokens. The prefix ```##``` indicates a subword.

In [10]:
real_wordpiece_tokenizer.encode("walker walked a long walk").tokens

['walk', '##er', 'walk', '##ed', 'a', 'long', 'walk']

In [18]:
real_wordpiece_tokenizer.encode("wlk").tokens

['w', '##lk']

## Handling Unknown Tokens in WordPiece and HuggingFace WordPiece

This tokenizer splits unknown words into subword components but might fail without an ```[UNK]``` token. The implementationwith HuggingFace WordPiece adds special tokens like ```[UNK]``` to handle out-of-vocabulary words more gracefully.

In [23]:
real_wordpiece_tokenizer.encode("she walked").tokens

Exception: WordPiece error: Missing [UNK] token from the vocabulary

In [24]:
## HuggingFace WordPiece and special tokens

from tokenizers.trainers import WordPieceTrainer

unk_token = "[UNK]"

wordpiece_model = WordPiece(unk_token=unk_token)
wordpiece_tokenizer = Tokenizer(wordpiece_model)
wordpiece_tokenizer.pre_tokenizer = Whitespace()
wordpiece_trainer = WordPieceTrainer(
    vocab_size=28,
    special_tokens=[unk_token]
)

In [25]:
wordpiece_tokenizer.train_from_iterator(
    training_data, 
    wordpiece_trainer
)
wordpiece_tokenizer.get_vocab()

{'e': 3,
 'o': 8,
 'd': 2,
 '##l': 12,
 '##n': 18,
 'n': 7,
 '##lk': 21,
 'l': 6,
 '##ng': 25,
 'a': 1,
 '##g': 19,
 '##o': 17,
 'w': 10,
 'walke': 23,
 'r': 9,
 '##r': 15,
 'walked': 27,
 '##a': 11,
 'lo': 24,
 '##k': 13,
 'walk': 22,
 'g': 4,
 'walker': 26,
 '##e': 14,
 '##d': 16,
 '[UNK]': 0,
 'k': 5,
 'wa': 20}

In [26]:
wordpiece_tokenizer.encode("walker walked a long walk").tokens

['walker', 'walked', 'a', 'lo', '##ng', 'walk']

In [27]:
wordpiece_tokenizer.encode("wlk").tokens

['w', '##lk']

In [28]:
wordpiece_tokenizer.encode("she walked").tokens

['[UNK]', 'walked']

## Unigram Tokenization

Unigram tokenization selects the smallest set of subwords that maximize the likelihood of the training data, balancing efficiency and coverage.

In [29]:
## Unigram

from tokenizers.trainers import UnigramTrainer
from tokenizers.models import Unigram

unigram_tokenizer = Tokenizer(Unigram())
unigram_tokenizer.pre_tokenizer = Whitespace()
unigram_trainer = UnigramTrainer(
    vocab_size=14, 
    special_tokens=[unk_token],
    unk_token=unk_token,
)

unigram_tokenizer.train_from_iterator(training_data, unigram_trainer)
unigram_tokenizer.get_vocab()

{'walke': 1,
 'e': 7,
 'n': 11,
 'o': 9,
 'd': 8,
 'k': 2,
 'l': 5,
 'g': 12,
 '[UNK]': 0,
 'walk': 4,
 'a': 6,
 'w': 3,
 'r': 10}

In [30]:
unigram_tokenizer.encode("walker walked a long walk").tokens

['walke', 'r', 'walke', 'd', 'a', 'l', 'o', 'n', 'g', 'walk']

In [31]:
unigram_tokenizer.encode("wlk").tokens

['w', 'l', 'k']

In [32]:
unigram_tokenizer.encode("she walked").tokens

['sh', 'e', 'walke', 'd']

In [33]:
unigram_tokenizer.encode("she walked").ids

[0, 7, 1, 8]

This notebook demonstrated four tokenization techniques: Byte Pair Encoding (BPE), WordPiece, HuggingFace WordPiece with special tokens, and Unigram tokenization. Each has its strengths and weaknesses, making them suitable for different NLP tasks.

**Well Done!**