# __Table of Contents__

<ol>
    <li><a href="#Introduction">Introduction</a></li>
    <li><a href="#Types-of-Tokenizer">Types of Tokenizer</a></li>
        <ol>
            <li><a href="#Word-based-tokenizer">Word-based tokenizer</a></li>
            <li><a href="#Character-based-tokenizer">Character-based tokenizer</a></li>
            <li><a href="#Subword-based-tokenizer">Subword-based tokenizer</a></li>
                <ol>
                    <li><a href="#WordPiece">WordPiece</a></li>
                    <li><a href="#Unigram-and-SentencePiece">Unigram and SentencePiece</a></li>
                </ol>
        </ol>
    <li>
        <a href="#Tokenization-with-PyTorch">Tokenization with PyTorch</a>
    </li>
    <li>
        <a href="#Token-indices">Token indices</a>
        <ol>
            <li><a href="#Out-of-vocabulary-(OOV)">Out-of-vocabulary (OOV)</a></li>
        </ol>
    </li>
    <li><a href="#Exercise:-Comparative-text-tokenization-and-performance-analysis">Exercise: Comparative text tokenization and performance analysis</a></li>
</ol>


## Introduction 

**Tokenization** is the process of breaking down a piece of text, such as a sentence or paragraph, into smaller units called **tokens**.

## Libraries 

1. **NLTK (Natural Language Toolkit)**
    - A classic, general-purpose NLP toolkit used for teaching and prototyping.
    - Provides rule-based word tokenization and other basic preprocessing tools.
    - Not designed for deep learning models but great for traditional NLP tasks.
2.  **spaCy**
    - An industrial-strength NLP pipeline designed for speed and efficiency.
    - Offers fast, rule-based tokenization along with POS tagging, named entity recognition, and dependency parsing.
    - Commonly used in production environments.
3. **BertTokenizer (Hugging Face Transformers)**
    - A subword-based tokenizer using the **WordPiece** algorithm.
    - Specifically designed to tokenize text for input into **BERT** models.
    - Ensures compatibility with pre-trained transformer architectures.
4. **XLNetTokenizer (Hugging Face Transformers)**
    - Implements **Unigram** or **SentencePiece**-based tokenization.
    - Tailored for XLNet's architecture, which uses permutation-based language modeling.
    - Tokens often include underscores to indicate word boundaries.
5. **torchtext**
    - Part of the PyTorch ecosystem, focusing on NLP data processing.
    - Provides tokenization, vocabulary construction, and batching.
    - Compatible with user-defined tokenizers and custom data pipelines.



## Types of Tokenizer 

### Word-Based Tokenization
- In this method, the text is split into individual words.
- Each word is treated as a single token.
- Example:
  - Input: `"Tokenization is important."`
  - Output: `["Tokenization", "is", "important", "."]`

**Advantages:**
- Preserves the meaning of entire words.
- Simple and intuitive.

**Disadvantages:**
- Results in a very large vocabulary.
- Treats similar words (e.g., "run", "running") as completely different tokens.
- Cannot handle out-of-vocabulary (OOV) words effectively.

---

### Character-Based Tokenization
- This approach splits text into individual characters.
- Each character is treated as a token.
- Example:
  - Input: `"NLP"`
  - Output: `["N", "L", "P"]`

**Advantages:**
- Very small vocabulary size.
- Can handle any word, including unseen ones.

**Disadvantages:**
- Loses semantic meaning of words.
- Requires longer sequences and more computation.
- Harder for the model to learn language structure.

---

### Subword-Based Tokenization
- A hybrid approach that breaks text into smaller word components (subwords).
- Frequently occurring words may be kept whole, while rare or unknown words are split into subword units.
- Used in modern transformer models like BERT and XLNet.

**Examples of subword algorithms:**
- WordPiece (used in BERT)
- Unigram (used in XLNet)
- SentencePiece (used in T5 and others)

**Example:**
- Input: `"tokenization"`
- Output (WordPiece): `["token", "##ization"]`

**Advantages:**
- Balances vocabulary size and expressiveness.
- Can handle OOV words by composing them from subwords.
- More efficient than word-based models in terms of vocabulary.

**Disadvantages:**
- Slightly more complex to implement and interpret.

### Word-based tokenizer 

### Subword-based tokenizer
#### WordPiece (Used in BERT)

- **Core idea**: WordPiece builds its vocabulary by starting with all characters in the training data and **progressively merging subword units** to maximize the likelihood of the training corpus.
- Unlike Byte Pair Encoding (BPE), which selects the most frequent pair, **WordPiece chooses the merge that improves data likelihood the most**.
- During tokenization, WordPiece applies learned merge rules greedily to segment a word into subword units.
- **Implementation**: Used by `BertTokenizer` in Hugging Face.
- **Special markers**: Uses `##` to indicate that a token is a continuation of a word.
  
**Example**:  
Input: `"tokenization"`  
Tokens: `["token", "##ization"]`



#### Unigram Language Model (Used in XLNet, SentencePiece)

- **Core idea**: The Unigram model starts with a very large list of possible subword candidates and then **prunes the vocabulary** by removing those that contribute the least to the likelihood of the data.
- It is **probabilistic**, meaning it allows for multiple segmentations and selects the most likely one based on the model.
- Unlike WordPiece, it does not build vocabulary through merging, but through iterative elimination.
- **Flexible and powerful** for multilingual or noisy text.



#### SentencePiece (Tokenizer Framework)

- **Core idea**: SentencePiece is not a tokenization algorithm itself but a **tokenizer framework** that supports both Unigram and BPE.
- It treats input as **raw text without any whitespace pre-tokenization**, making it **language-agnostic**.
- SentencePiece also ensures **consistency and reproducibility**, assigning unique IDs to subwords so that the same text always results in the same tokens and indices.
- Can be trained with either Unigram or BPE strategies.

**Example**:  
Input: `"Tokenization is powerful"`  
Tokens: `["▁Token", "ization", "▁is", "▁power", "ful"]`  
(Note: `▁` marks word boundaries)



#### Integration: SentencePiece + Unigram

- SentencePiece is often used to **implement Unigram tokenization**.
- SentencePiece handles the **training, segmentation, and ID assignment**, while **Unigram** guides the vocabulary pruning process to optimize token efficiency.
- This combination is widely used in multilingual models such as **mT5**, **ALBERT**, and **XLNet**.



#### Summary Comparison

| Feature                          | WordPiece            | Unigram                    | SentencePiece              |
|----------------------------------|-----------------------|----------------------------|----------------------------|
| Strategy                         | Merge-based           | Prune-based (likelihood)   | Tokenizer framework        |
| Probabilistic?                   | No                    | Yes                        | Supports probabilistic (Unigram) |
| Pre-tokenization (e.g. by spaces) | Required              | Not required               | Not required               |
| Special Markers                  | `##` for subword      | `▁` (via SentencePiece)    | `▁` (underscores)          |
| Vocabulary Learning              | Greedy merges         | Vocabulary reduction       | Supports both BPE & Unigram |
| Common Usage                     | BERT, RoBERTa         | XLNet, mT5, ALBERT         | T5, mT5, ALBERT, XLNet     |

---


In [7]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer.tokenize("IBM taught me tokenization.")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

['ibm', 'taught', 'me', 'token', '##ization', '.']