# Implementing Transformer Models
## Practical III
Carel van Niekerk & Hsien-Chin Lin

21-25.10.2024

---

In previous practicals, we delved into the attention mechanism, which serves as the foundation of transformer-style models. We noted that such mechanisms necessitate the representation of text as numerical vectors. In this session, we will investigate word tokenizers, which are methods for converting words into meaningful subword units termed as 'tokens'. Specifically, we will implement a basic Byte Pair Encoding (BPE) tokenizer to gain insights into the workings of this kind of tokenizer.

### 1. Tokenizers

Word tokenizers are used to split text into tokens, which can be words or subwords. In this practical we investigate the BPE tokenizer. BPE is a simple algorithm that iteratively replaces the most frequent pair of characters in a text with a new character. This process is repeated until a predefined number of tokens is reached. The BPE algorithm is described in the following paper: [Neural Machine Translation of Rare Words with Subword Units](https://arxiv.org/pdf/1508.07909.pdf).

### 2. The Byte-Pair Encoding (BPE) Tokenizer

The BPE algorithm is implemented in the following steps:

#### 2.1. Building the base vocabulary

The base vocabulary is a set of all the characters present in the data. To obtain the base vocabulary, we first find the set of all unique words in a corpus. We then find the set of all unique characters in theses words.

For example, given the following set of words:

`['hug', 'pug', 'pun', 'bun', 'hugs']`

The base vocabulary is:

`['h', 'u', 'g', 'p', 'n', 'b', 's']`

#### 2.2. Building the BPE vocabulary

Once we have the base vocabulary, we learn a set of merges, these are rules indicating which characters should be merged. Each merge becomes a new token in the vocabulary. The merges are learned by iteratively finding the most frequent pair of characters in the data and merging them. This process is repeated until a predefined vocabulary size is reached.

Let us assume that each of the above words has a frequency of:

`{'hug': 10, 'pug': 5, 'pun': 12, 'bun': 4, 'hugs': 5}`

We can now compute the co-occurrence frequencies of all tokens in the vocabulary:

`{('h', 'u'): 15, ('u', 'g'): 20, ('p', 'u'): 17, ('u', 'n'): 16, ('b', 'u'): 4, ('g', 's'): 5}`

We see that the characters `('u', 'g')` co-occur the most. We create the merge rule `('u', 'g')` resulting in the new token 'ug'. We can now update the vocabulary and co-occurrence frequencies to:

`['h', 'u', 'g', 'p', 'n', 'b', 's', 'ug']`

`{('p', 'u'): 12, ('u', 'n'): 16, ('b', 'u'): 4, ('h', 'ug'): 15, ('p', 'ug'): 5, ('ug', 's'): 5}`

The next merge rule is `('u', 'n')` resulting in the token 'un'.

If we stop here we obtain the vocabulary:

`['h', 'u', 'g', 'p', 'n', 'b', 's', 'ug', 'un']`

and the set of merge rules:

`{('u', 'g'): 'ug', ('u', 'n'): 'un'}`.

#### 2.3. Encoding a word

Based on this vocabulary we can now encode a word. First the word, for example 'pugs', is split into characters:

`['p', 'u', 'g', 's']`

Then the merge rules are applied to the word (here 'u' and 'g' are combined to become 'ug'):
`['p', 'ug', 's']`

Finally, the word is encoded as a sequence of tokens:
`['p', 'ug', 's']`.

# Exercises

1. Implement the BPE tokenizer module. The module should be able to extract the vocubulary from a corpus of text.
2. Given the corpus below train your BPE tokenizer. Use a vocabulary size of 64.

```python
[
    "Machine learning helps in understanding complex patterns.",
    "Learning machine languages can be complex yet rewarding.",
    "Natural language processing unlocks valuable insights from data.",
    "Processing language naturally is a valuable skill in machine learning.",
    "Understanding natural language is crucial in machine learning."
]
```

3. Using the BPETokenizer implementation of Huggingface ([more info](https://pypi.org/project/tokenizers/)) train a BPE tokenizer using the above corpus. Use a vocabulary of size 295 (due to larger default base vocab of this implmentation).
4. Tokenize the following sentence: "Machine learning is a subset of artificial intelligence." using both your implementation and the Huggingface implementation

### Additional Material
- [Huggingface tutorial on BPE](https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt)

In [1]:
from tokenizers import Tokenizer, models, trainers, pre_tokenizers
from src.modelling.bpe_tokenizer import BPETokenizer

# Corpus
corpus = [
    "Machine learning helps in understanding complex patterns.",
    "Learning machine languages can be complex yet rewarding.",
    "Natural language processing unlocks valuable insights from data.",
    "Processing language naturally is a valuable skill in machine learning.",
    "Understanding natural language is crucial in machine learning."
]

# Train custom BPETokenizer
custom_tokenizer = BPETokenizer(vocab_size=64)
custom_tokenizer.train(corpus)

# Tokenize using custom BPETokenizer
sentence = "Machine learning is a subset of artificial intelligence."
custom_tokens = custom_tokenizer.encode(sentence)
print("Custom BPETokenizer tokens:", custom_tokens)

# Train Huggingface BPETokenizer
hf_tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(vocab_size=295)
hf_tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
hf_tokenizer.train_from_iterator(corpus, trainer)

# Tokenize using Huggingface BPETokenizer
hf_tokens = hf_tokenizer.encode(sentence).tokens
print("Huggingface BPETokenizer tokens:", hf_tokens)

Training BPE: 100%|██████████| 35/35 [00:00<00:00, 25072.70it/s]

Custom BPETokenizer tokens: ['M', 'achine', 'learning', 'i', 's', 'a', 's', 'u', 'b', 's', 'e', 't', 'o', 'f', 'ar', 't', 'i', 'f', 'i', 'c', 'i', 'al', 'in', 't', 'e', 'l', 'l', 'i', 'g', 'e', 'n', 'c', 'e', '.']



Huggingface BPETokenizer tokens: ['Machine', 'learning', 'is', 'a', 's', 'u', 'b', 's', 'et', 'o', 'f', 'ar', 't', 'i', 'f', 'i', 'ci', 'al', 'in', 't', 'el', 'l', 'i', 'ge', 'n', 'c', 'e', '.']



