```{contents}
```
## **Tokenization**

Tokenization is the process of converting raw text into **discrete units (tokens)** that a machine learning model can process.
It is the **first step** in almost every NLP and LLM pipeline.

---

### **1. Why Tokenization Exists**

Neural networks operate on numbers, not text.
Tokenization bridges human language and numerical computation:

```
Text → Tokens → Token IDs → Embeddings → Model
```

Without good tokenization, even the best model fails.

---

### **2. Core Intuition**

Tokenization breaks language into the **smallest meaningful building blocks** that balance:

* Expressiveness
* Vocabulary size
* Computational efficiency

> Too large → huge vocabulary
> Too small → long sequences, slow models

---

### **3. What Is a Token?**

A **token** can be:

* A word
* A subword
* A character
* A byte
* A combination

Modern LLMs primarily use **subword tokens**.

---

### **4. Tokenization Pipeline**

```
Raw Text
   ↓
Normalization (lowercasing, unicode cleanup)
   ↓
Pre-tokenization (split by whitespace / punctuation)
   ↓
Subword segmentation
   ↓
Token IDs
```

---

### **5. Major Tokenization Methods**

#### 5.1 Word-Level Tokenization

```
"I love NLP" → ["I", "love", "NLP"]
```

**Problems:**

* Huge vocabulary
* Unknown words (OOV)
* Poor multilingual support

---

#### 5.2 Character-Level Tokenization

```
"cat" → ["c", "a", "t"]
```

**Pros:** No OOV
**Cons:** Long sequences, weak semantics

---

#### 5.3 Subword Tokenization (Modern Standard)

Splits text into **frequent fragments**:

```
"unhappiness" → ["un", "happi", "ness"]
```

Balances vocabulary size and sequence length.

---

### **6. Popular Subword Algorithms**

| Algorithm | Used By       | Key Idea                      |
| --------- | ------------- | ----------------------------- |
| BPE       | GPT, GPT-2    | Merge frequent symbol pairs   |
| WordPiece | BERT          | Likelihood-based merges       |
| Unigram   | SentencePiece | Probabilistic token selection |
| Byte-BPE  | GPT-2, LLaMA  | Byte-level + BPE              |

---

### **7. Example: BPE Tokenization**

```
Text: "lower"
Initial: l o w e r
Merge frequent pairs → lo w e r → low e r → lower
```

Resulting tokens become part of the learned vocabulary.

---

### **8. From Tokens to Model Input**

After tokenization:

```
Tokens → Token IDs → Embedding Vectors → Model
```

Example:

```
"Hello world"
→ ["Hello", "world"]
→ [15496, 995]
→ [[0.12, -0.04, ...], [...]]
```

---

### **9. Special Tokens**

| Token   | Purpose               |
| ------- | --------------------- |
| `<BOS>` | Beginning of sequence |
| `<EOS>` | End of sequence       |
| `<PAD>` | Padding               |
| `<UNK>` | Unknown               |
| `<CLS>` | Classification        |
| `<SEP>` | Separator             |

---

### **10. Tokenization Challenges**

* Multilingual text
* Rare words
* Emojis & symbols
* Domain-specific vocabulary
* Efficiency vs expressiveness

---

### **11. Why Tokenization Quality Matters**

| Impact               | Explanation                           |
| -------------------- | ------------------------------------- |
| Model accuracy       | Poor tokens → poor understanding      |
| Training speed       | Sequence length controls cost         |
| Generalization       | Good subwords handle new words        |
| Multilingual support | Robust segmentation improves coverage |

---

### **12. Tokenization in Modern LLMs**

* GPT, LLaMA: Byte-level BPE
* BERT: WordPiece
* T5: SentencePiece (Unigram)
* Whisper: Multilingual BPE

---

### **13. Summary**

| Aspect          | Description                       |
| --------------- | --------------------------------- |
| Purpose         | Convert text to model-ready units |
| Modern approach | Subword tokenization              |
| Design goal     | Compact, expressive, efficient    |
| Criticality     | Foundational to LLM performance   |

