# IR Lab Tutorial: Tokenization

This tutorial shows how to use a tokenizer.
Tokenization turns a text into a token sequence (token $\approx$ word).

**Attention:** The scenario below is cherry-picked to explain the concept of Tokenization with a minimal example.


## Preparation: Install dependencies

In [1]:
# This is needed in both Google Colab and in a dev container
!pip3 install -q nltk transformers sentencepiece

[0m

In [2]:
import nltk

nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

## Our Scenario

We want to build a search engine and need to tokenize text into tokens in the text analysis and query analysis step.

In [3]:
sentence_en = "At eight o'clock on Thursday morning Arthur didn't feel very good."
sentence_de = "Donnerstag morgens um acht Uhr fühlte sich Arthur nicht so gut."
spaces = ">   <"

# Word Tokenization

## Naive Approach

A naive approach to word tokenization is to split the text at whitespace characters.

However, this approach does not handle punctuation properly. 
For example, the sentence `"Hello, world!"` would be tokenized into `["Hello,", "world!"]`, which includes punctuation marks attached to the words.

In [4]:
print(sentence_en.split())
print(sentence_de.split())
print(spaces.split())

['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 'Arthur', "didn't", 'feel', 'very', 'good.']
['Donnerstag', 'morgens', 'um', 'acht', 'Uhr', 'fühlte', 'sich', 'Arthur', 'nicht', 'so', 'gut.']
['>', '<']


## NLTK's Word Tokenizer

A more advanced approach is to use [NLTK's word tokenizer](https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize) which separates and identifies punctuation and special characters.
The tokenizer `word_tokenize`is a wrapper for the `Punkt` sentence tokenizer and the `Treebank` word tokenizer.

```word_tokenize() --> Punkt Sentence Tokenizer --> Treebank Word Tokenizer```

- The [`Punkt`](https://www.nltk.org/_modules/nltk/tokenize/punkt.html) [sentence tokenizer](https://www.nltk.org/_modules/nltk/tokenize.html#sent_tokenize) uses a unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection from plaintext in the target language before it can be used. The NLTK data package includes a pre-trained Punkt tokenizer for English. [[Reference](https://www.nltk.org/api/nltk.tokenize.punkt.html)]
- The [`Treebank`](https://www.nltk.org/_modules/nltk/tokenize/treebank.html) word tokenizer uses regular expressions to tokenize text as done in the [Penn Treebank](https://aclanthology.org/J93-2004.pdf). [[Reference](https://www.nltk.org/api/nltk.tokenize.treebank.html)]

- Finde more sample usages of NLTK's word tokenizers in the [documentation](https://www.nltk.org/_modules/nltk/tokenize/punkt.html).
We have downloaded `nltk.download('punkt_tab')` for this purpose.

In [5]:
print(nltk.word_tokenize(sentence_en))
print(nltk.word_tokenize(sentence_de))
print(nltk.word_tokenize(spaces))

['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
['Donnerstag', 'morgens', 'um', 'acht', 'Uhr', 'fühlte', 'sich', 'Arthur', 'nicht', 'so', 'gut', '.']
['>', '<']


## Byte Pair Encoding (BPE)

BPE [Sennrich et al.](https://aclanthology.org/P16-1162/) is a **subword tokenization algorithm** used by a lot of Transformer models like GPT, GPT-2, RoBERTa, BART, and DeBERTa [[Reference](https://huggingface.co/learn/llm-course/chapter6/5)].
It allows tokenizers to handle out-of-vocabulary words by splitting words into smaller subword units.

The following example illustrates how training BPE works [[Reference](https://huggingface.co/learn/llm-course/chapter6/5)]:

Assume our corpus contains the words: `hug`, `pug`, `pun`, `bun`, `hugs` with the following frequencies:
`("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)`.

1. Firstly, the vocabulary is initialized with the characters from the corpus. Each word is represented as a sequence of characters, plus a special end-of-word symbol `</w>`.

   ```
   base vocabulary: ["b", "g", "h", "n", "p", "s", "u"]
   ```

2. Count all symbol pairs and find the most frequent pair of consecutive symbols across the corpus.

   ```
   Pairs in corpus: ("h" "u", 15), ("u" "g", 20), ("p" "u", 5), ("u" "n", 16), ("b" "u", 4), ("g" "s", 5)
   Most frequent pair: ("u", "g")
   ```

3. Merge the most frequent pair and replace all occurrences of the most frequent pair with a new symbol. [[Reference](https://arxiv.org/abs/1508.07909)]  

    The first merge rule learned by the tokenizer is `("u", "g") -> "ug"`, which means that `"ug"` will be added to the vocabulary, and the pair should be merged in all the words of the corpus.
    At the end of this stage, the vocabulary and corpus look like this:

   ```
   Vocabulary: ["b", "g", "h", "n", "p", "s", "u", "ug"]
   Corpus: ("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)
   ```

4. Continue merging the most frequent pairs.

   ```
   Pairs in corpus: ("h" "ug", 15), ("p" "ug", 5), ("p" "u", 5), ("u" "n", 16), ("b" "u", 4), ("ug" "s", 5)
   Most frequent pair: ("u", "n")
   ```
   The second merge rule learned by the tokenizer is `("u", "n") -> "un"`, which means that `"un"` will be added to the vocabulary, and the pair should be merged in all the words of the corpus.
   At the end of this stage, the vocabulary and corpus look like this:
   ```
   Vocabulary: ["b", "g", "h", "n", "p", "s", "u", "ug", "un"]
   Corpus: ("h" "ug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("h" "ug" "s", 5)
   ```

   Let's look at one more merge step:
   ```
   Pairs in corpus: ("h" "ug", 15), ("p" "un", 5), ("p" "un", 12), ("b" "u", 4), ("ug" "s", 5)
   Most frequent pair: ("h", "ug")
   ```
   The next merge rule learned by the tokenizer is `("h", "ug") -> "hug"`, which means that `"hug"` will be added to the vocabulary, and the pair should be merged in all the words of the corpus.
   At the end of this stage, the vocabulary and corpus look like this:
   ```
   Vocabulary: ["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]
   Corpus: ("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5)
   ```

   Continue like this until we reach the desired vocabulary size.
5. The final vocabulary contains **frequent subword units**, which allows encoding unseen words as sequences of subwords.


Have a look at the [re-implementation of BPE](https://huggingface.co/learn/llm-course/chapter6/5#implementing-bpe)!


In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
print(tokenizer.tokenize(sentence_en))
print(tokenizer.tokenize(sentence_de))
print(tokenizer.tokenize(spaces))
print(tokenizer.tokenize("... :-( :( !!!!!!!!!!\nd"))

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


['At', 'Ġeight', 'Ġo', "'", 'clock', 'Ġon', 'ĠThursday', 'Ġmorning', 'ĠArthur', 'Ġdidn', "'t", 'Ġfeel', 'Ġvery', 'Ġgood', '.']
['Don', 'ner', 'st', 'ag', 'Ġmor', 'g', 'ens', 'Ġum', 'Ġa', 'cht', 'ĠU', 'hr', 'Ġf', 'Ã¼', 'hl', 'te', 'Ġs', 'ich', 'ĠArthur', 'Ġn', 'icht', 'Ġso', 'Ġgut', '.']
['>', 'Ġ', 'Ġ', 'Ġ<']
['...', 'Ġ:', '-(', 'Ġ:(', 'Ġ', '!!!!!!!!', '!!', 'Ċ', 'd']


## WordPiece

WordPiece used in quite a few Transformer models based on BERT, such as DistilBERT, MobileBERT, Funnel Transformers, and MPNET. 
It’s very similar to BPE in terms of the training, but the actual tokenization is done differently [[Reference](https://huggingface.co/learn/llm-course/chapter6/6)].

Like BPE, WordPiece starts from a small vocabulary including the special tokens used by the model and the initial alphabet. 
It identifies subwords by adding a prefix (like ## for BERT), each word is initially split by adding that prefix to all the characters inside the word. 
So, for instance, "pun" gets split like this:
```
["p", "##u", "##n"]
```

The main difference between BPE and WordPiece is in the way they select pairs to be merged during training.
Instead of selecting the most frequent pair, WordPiece computes a score for each pair, using the following formula: 
$ score(pair) = \frac{count(pair)}{count(first\_subword) \times count(second\_subword)} $
By dividing the frequency of the pair by the product of the frequencies of each of its parts, the algorithm prioritizes the merging of pairs where the individual parts are less frequent in the vocabulary. 

Assume our corpus contains the words: `hug`, `pug`, `pun`, `bun`, `hugs` with the following frequencies:
`("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)`.

Which results in the following corpus representation at the beginning of the training:
```
("h" "##u" "##g", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("h" "##u" "##g" "##s", 5)
```
and the initial vocabulary (without special tokens):
```
["b", "h", "p", "##g", "##n", "##s", "##u"]
```

1. Count all symbol pairs and compute their scores.
```
   Pairs in corpus: ("h" "##u", 15), ("##u" "##g", 20), ("p" "##u", 5), ("##u" "##n", 16), ("b" "##u", 4), ("##g" "##s", 5)
   Most frequent pair: ("##u", "##g"), but individual counts are high, resulting in a lower score of 1/36.
   Highest scoring pair: ("##g", "##s") with a score of 1/20
```
2. Merge the highest scoring pair and replace all occurrences of the pair with a new symbol.
   The first merge rule learned by the tokenizer is `("##g", "##s") -> "##gs"`, which means that `"##gs"` will be added to the vocabulary, and the pair should be merged in all the words of the corpus.
   At the end of this stage, the vocabulary and corpus look like this:
```
   Vocabulary: ["b", "h", "p", "##g", "##n", "##s", "##u", "##gs"]
   Corpus: ("h" "##u" "##g", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("h" "##u" "##gs", 5)
```
3. Continue like this until we reach the desired vocabulary size.

Have a look at the [re-implementation of WordPiece](https://huggingface.co/learn/llm-course/chapter6/6#implementing-wordpiece)!

In [7]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")
print(tokenizer.tokenize(sentence_en))
print(tokenizer.tokenize(sentence_de))
print(tokenizer.tokenize(spaces))
print(tokenizer.tokenize("... :-( :( !!!!!!!!!! .a."))
print(tokenizer.tokenize("wierd"))
print(tokenizer.tokenize("weird"))

['at', 'eight', 'o', "'", 'clock', 'on', 'thursday', 'morning', 'arthur', 'didn', "'", 't', 'feel', 'very', 'good', '.']
['don', '##ners', '##tag', 'mor', '##gens', 'um', 'ac', '##ht', 'uh', '##r', 'fu', '##hl', '##te', 'sic', '##h', 'arthur', 'nic', '##ht', 'so', 'gut', '.']
['>', '<']
['.', '.', '.', ':', '-', '(', ':', '(', '!', '!', '!', '!', '!', '!', '!', '!', '!', '!', '.', 'a', '.']
['wi', '##er', '##d']
['weird']


## Sentence Piece

[SentencePiece](https://github.com/google/sentencepiece) is a re-implementation of sub-word units, an effective way to alleviate the open vocabulary problems in neural machine translation. 
SentencePiece supports two segmentation, byte-pair-encoding (BPE) [Sennrich et al.](https://aclanthology.org/P16-1162/) and unigram language model [Kudo.](https://arxiv.org/abs/1804.10959).
SentencePiece is an unsupervised text tokenizer and detokenizer.
It considers the text as a sequence of Unicode characters, and replaces spaces with a special character, `▁`. 
The other main feature of SentencePiece is reversible tokenization: 
since there is no special treatment of spaces, decoding the tokens is done simply by concatenating them and replacing the `_` with spaces — this results in the normalized text [[Reference](https://huggingface.co/learn/llm-course/en/chapter6/4#sentencepiece)].

Find more details about SentencePiece in their [Github Repo README](https://github.com/google/sentencepiece?tab=readme-ov-file#overview).

In [8]:
from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("google-t5/t5-base")
print(tokenizer.tokenize(sentence_en))
print(tokenizer.tokenize(sentence_de))
print(tokenizer.tokenize(spaces))
print(tokenizer.tokenize("... :-( :( !!!!!!!!!! .a."))

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


['▁At', '▁eight', '▁', 'o', "'", 'clock', '▁on', '▁Thursday', '▁morning', '▁Arthur', '▁didn', "'", 't', '▁feel', '▁very', '▁good', '.']
['▁Don', 'ner', 's', 'tag', '▁mor', 'gen', 's', '▁um', '▁acht', '▁Uhr', '▁fühlt', 'e', '▁sich', '▁Arthur', '▁nicht', '▁so', '▁gut', '.']
['▁>', '▁', '<']
['▁', '...', '▁', ':', '-', '(', '▁', ':', '(', '▁', '!!!!!', '!!!!!', '▁', '.', 'a', '.']


## MorphBPE, MorphPiece and other Morpheme Tokenizations

[MorphBPE](https://arxiv.org/abs/2502.00894) is a morphology-aware extension of BPE that integrates linguistic structure into subword tokenization while preserving statistical efficiency.

[MorphPiece](https://arxiv.org/pdf/2307.07262v2) is based partly on morphological segmentation of the underlying text.

# Tweet Tokenizer

[Tweet Tokenizer](https://www.nltk.org/api/nltk.tokenize.casual.html#nltk.tokenize.casual.TweetTokenizer) in NLTK is a specialized tokenizer designed for tokenizing social media text, especially tweets. It handles the quirks of informal writing on platforms like Twitter.

In [9]:
from nltk.tokenize import TweetTokenizer

tweet = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <-- @tag"
print(nltk.word_tokenize(tweet))
print(TweetTokenizer().tokenize(tweet))

['This', 'is', 'a', 'cooool', '#', 'dummysmiley', ':', ':', '-', ')', ':', '-P', '<', '3', 'and', 'some', 'arrows', '<', '>', '-', '>', '<', '--', '@', 'tag']
['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--', '@tag']
