```{contents}
```

## Tokenization

### Why Do We Need Tokenization?

Neural networks operate on **numbers**, not raw text.
Tokenization converts text into an index sequence.

The challenge:

* Natural language has **millions** of possible words.
* New words appear constantly (“microplastics”, “uninstallable”, “ChatGPT”).
* Typing errors (“accomodate”), slang (“lol”), and morphology (“running”, “runs”, “ran”) increase variety.
* A fixed vocabulary will cause **Out-of-Vocabulary (OOV)** problems.

#### Example:

Traditional word tokenizers break on unknown words:

```
"unhappiness" → [unknown]
```

This is unacceptable for real-world language modeling.

**Solution:**
Use **subword tokenization** to break words into meaningful pieces.

---

### Core Idea Behind Subword Tokenization

Instead of full words, models use **subwords**:

* “unhappiness” → “un”, “happy”, “ness”
* “microplastics” → “micro”, “plastic”, “s”
* “playing” → “play”, “ing”

Benefits:

* Covers all words (no OOV)
* Compact vocabulary (20k–50k tokens)
* Keeps common words whole
* Splits rare words into pieces
* Efficient for multilingual models

---

### Byte Pair Encoding (BPE)

#### **Algorithm Summary**

BPE builds tokens by repeatedly merging **the most frequent pair of symbols**.

Initial symbols are characters.

Example:

Training text:

```
low lowly low
```

Start with characters:

```
l o w
l o w l y
l o w
```

Count pairs:

* ("l","o")
* ("o","w")
* ("w","l")
* ("l","y")

Merge the most frequent pair, e.g.:

1. merge "l"+"o" → "lo"
2. merge "lo"+"w" → "low"
3. merge "l"+"y" → "ly"

Final tokens:

* low
* lowly

#### **Properties**

* Deterministic
* Based strictly on frequency
* Used in GPT-2, GPT-3, LLaMA-1, CLIP

---

### 4. WordPiece (Used in BERT)

WordPiece is similar to BPE but uses **likelihood maximization**, not frequency.

#### Key difference:

**Select merges that maximize the likelihood of the training corpus under a language model**.

Example:
BPE merges the most frequent pair ("ing", "ly")
WordPiece merges the pair that most improves the model’s probability of generating the text.

This leads to:

* A more semantically consistent vocabulary
* Better handling of rare words

WordPiece outputs tokens prefixed with “##” to mark continuation:

```
playing → play, ##ing
unwanted → un, ##want, ##ed
```

---

### SentencePiece (Used in T5, LLaMA-2, GPT-NeoX)

SentencePiece does **not require whitespace**.
It treats the input as a raw byte stream.

Example:

```
Hello world
```

Becomes:

```
H e l l o _ w o r l d
```

("_" means space)

Advantages:

* Language-agnostic (works for Japanese, Chinese)
* Robust to noise and misspellings
* Can use BPE or Unigram algorithm internally

---

### PyTorch Implementation (Toy but Faithful)

Below is a **minimal working demonstration** of:

* BPE training
* BPE tokenization
* PyTorch encoding

This is simplified but accurate enough to understand the mechanism.

---

### Toy BPE Training

```python
from collections import Counter, defaultdict

def build_bpe_vocab(corpus, num_merges=10):
    # Step 1: Split words into characters + EOS marker
    vocab = Counter([' '.join(w) + ' </w>' for w in corpus])

    merges = []

    for _ in range(num_merges):
        # Count all symbol pairs
        pair_counts = Counter()

        for word, freq in vocab.items():
            symbols = word.split()
            for i in range(len(symbols)-1):
                pair = (symbols[i], symbols[i+1])
                pair_counts[pair] += freq

        if not pair_counts:
            break

        # Most frequent pair
        best_pair = max(pair_counts, key=pair_counts.get)
        merges.append(best_pair)

        # Merge the pair in all words
        new_vocab = Counter()
        bigram = ' '.join(best_pair)

        for word, freq in vocab.items():
            updated = word.replace(bigram, ''.join(best_pair))
            new_vocab[updated] += freq

        vocab = new_vocab

    return merges
```

#### Train BPE

```python
corpus = ["low", "lower", "lowest", "lowly"]
merges = build_bpe_vocab(corpus, num_merges=10)
print("Learned merges:", merges)
```

---

### 6.2 Apply BPE tokenization

```python
def bpe_tokenize(word, merges):
    word = list(word) + ["</w>"]

    merges_set = {"".join(a): a for a in merges}

    while True:
        pairs = [(word[i], word[i+1]) for i in range(len(word)-1)]
        merge_candidates = [(a+b, (a,b)) for (a,b) in pairs if (a,b) in merges_set]

        if not merge_candidates:
            break

        # Merge the first eligible pair
        merge_str, (a, b) = merge_candidates[0]
        new_word = []
        i = 0
        while i < len(word):
            if i < len(word)-1 and word[i] == a and word[i+1] == b:
                new_word.append(a+b)
                i += 2
            else:
                new_word.append(word[i])
                i += 1

        word = new_word

    return word

print(bpe_tokenize("lowly", merges))
```

---

### 6.3 Convert Tokens to PyTorch Tensor

```python
import torch

# Build a vocab mapping
vocab = {}
idx = 0
for w in corpus:
    tokens = bpe_tokenize(w, merges)
    for t in tokens:
        if t not in vocab:
            vocab[t] = idx
            idx += 1

# Encode sentence
sentence = "lowly"
tokens = bpe_tokenize(sentence, merges)
token_ids = torch.tensor([vocab[t] for t in tokens])

print("Tokens:", tokens)
print("Token IDs tensor:", token_ids)
```

This produces a PyTorch tensor ready for embedding layers.

---

**Summary Table**

| Method            | Main idea                            | Used in               |
| ----------------- | ------------------------------------ | --------------------- |
| **BPE**           | Merge most frequent character pairs  | GPT-2, LLaMA-1        |
| **WordPiece**     | Maximize LM likelihood when merging  | BERT, RoBERTa         |
| **SentencePiece** | Works without whitespace; byte-level | T5, LLaMA-2, GPT-NeoX |