```{contents}
```

# Tokenization

### 1. **Sentence Tokenization (Sentence Segmentation)**

* Breaks a paragraph/document into sentences.
* Useful for tasks like summarization, translation, and dialogue systems.
* Example:

  ```
  Text: "I love NLP. It is amazing!"
  Sentence Tokens: ["I love NLP.", "It is amazing!"]
  ```

---

### 2. **Word Tokenization**

* Splits sentences into individual words.
* Example:

  ```
  Text: "I love NLP."
  Word Tokens: ["I", "love", "NLP", "."]
  ```

---

### 3. **Character Tokenization**

* Splits text into individual characters.
* Useful for handling misspellings, rare words, and languages like Chinese.
* Example:

  ```
  Text: "NLP"
  Character Tokens: ["N", "L", "P"]
  ```

---

### 4. **Subword Tokenization**

* Breaks words into meaningful sub-units, instead of whole words or characters.
* Handles out-of-vocabulary (OOV) words better.
* Used in modern NLP models like **BERT, GPT, T5**.
* Example:

  ```
  Word: "unhappiness"
  Subword Tokens: ["un", "happi", "ness"]
  ```

Common algorithms:

* **Byte Pair Encoding (BPE)**
* **WordPiece (used in BERT)**
* **SentencePiece (used in T5, XLNet, GPT-2)**

---

### 5. **Whitespace Tokenization**

* Simply splits text by spaces.
* Fast but naive: `"NLP-based tokenization"` → `["NLP-based", "tokenization"]`
* Problem: doesn’t handle punctuation well.

---

### 6. **Regex Tokenization**

* Uses regular expressions to define custom rules.
* Example: Split by non-alphabetic characters, keep only words.

  ```
  Text: "Email me at abc123@gmail.com!"
  Regex Tokens: ["Email", "me", "at", "abc123", "gmail", "com"]
  ```

---

### 7. **Morphological Tokenization**

* Splits words into **roots, prefixes, suffixes** (morphological units).
* Example (English): `"playing"` → `["play", "ing"]`
* Example (Turkish): `"evlerinizden"` (from your houses) → `["ev" (house), "ler" (plural), "iniz" (your), "den" (from)]`

---

### 8. **Byte-Level Tokenization**

* Works at the raw byte level (instead of characters).
* Used in **GPT-2 and GPT-3** models.
* Handles any language, emoji, or special character.

---

**Summary Table**

| Tokenization Type       | Example Input                                 | Example Output                      | Used In                    |
| ----------------------- | --------------------------------------------- | ----------------------------------- | -------------------------- |
| Sentence Tokenization   | "I love NLP. It's fun."                       | \["I love NLP.", "It's fun."]       | Summarization, Translation |
| Word Tokenization       | "I love NLP."                                 | \["I", "love", "NLP", "."]          | Most NLP tasks             |
| Character Tokenization  | "NLP"                                         | \["N", "L", "P"]                    | Chinese/Japanese, OCR      |
| Subword Tokenization    | "unhappiness"                                 | \["un", "happi", "ness"]            | BERT, GPT, T5              |
| Whitespace Tokenization | "NLP-based model"                             | \["NLP-based", "model"]             | Simple tasks               |
| Regex Tokenization      | "[abc123@gmail.com](mailto:abc123@gmail.com)" | \["abc123", "gmail", "com"]         | Custom NLP pipelines       |
| Morphological           | "playing"                                     | \["play", "ing"]                    | Morphology-heavy languages |
| Byte-Level Tokenization | "🔥 NLP"                                      | \[bytes representing emoji + "NLP"] | GPT-2, GPT-3               |

---

👉 In modern NLP, **subword tokenization (BPE, WordPiece, SentencePiece, Byte-Level)** is the most popular because it balances vocabulary size and handles rare words gracefully.

In [1]:
# Demonstration of different types of tokenization

import re
from nltk.tokenize import sent_tokenize, word_tokenize

text = "I love NLP. It's amazing! Unhappiness can't stop us 😊."

# 1. Sentence Tokenization
sent_tokens = sent_tokenize(text)

# 2. Word Tokenization
word_tokens = word_tokenize(text)

# 3. Character Tokenization
char_tokens = list(text)

# 4. Whitespace Tokenization
whitespace_tokens = text.split()

# 5. Regex Tokenization (keep only words, split on non-alphabetic)
regex_tokens = re.findall(r"[A-Za-z]+", text)

# 6. Subword Tokenization (simple example: split prefixes/suffixes manually)
example_word = "unhappiness"
subword_tokens = ["un", "happi", "ness"]

# 7. Byte-Level Tokenization (encode text into bytes)
byte_tokens = list(text.encode("utf-8"))

results = {
    "Sentence Tokenization": sent_tokens,
    "Word Tokenization": word_tokens,
    "Character Tokenization": char_tokens[:20],  # show first 20 chars
    "Whitespace Tokenization": whitespace_tokens,
    "Regex Tokenization": regex_tokens,
    "Subword Tokenization (example)": subword_tokens,
    "Byte-Level Tokenization (first 20 bytes)": byte_tokens[:20],
}

results


{'Sentence Tokenization': ['I love NLP.',
  "It's amazing!",
  "Unhappiness can't stop us 😊."],
 'Word Tokenization': ['I',
  'love',
  'NLP',
  '.',
  'It',
  "'s",
  'amazing',
  '!',
  'Unhappiness',
  'ca',
  "n't",
  'stop',
  'us',
  '😊',
  '.'],
 'Character Tokenization': ['I',
  ' ',
  'l',
  'o',
  'v',
  'e',
  ' ',
  'N',
  'L',
  'P',
  '.',
  ' ',
  'I',
  't',
  "'",
  's',
  ' ',
  'a',
  'm',
  'a'],
 'Whitespace Tokenization': ['I',
  'love',
  'NLP.',
  "It's",
  'amazing!',
  'Unhappiness',
  "can't",
  'stop',
  'us',
  '😊.'],
 'Regex Tokenization': ['I',
  'love',
  'NLP',
  'It',
  's',
  'amazing',
  'Unhappiness',
  'can',
  't',
  'stop',
  'us'],
 'Subword Tokenization (example)': ['un', 'happi', 'ness'],
 'Byte-Level Tokenization (first 20 bytes)': [73,
  32,
  108,
  111,
  118,
  101,
  32,
  78,
  76,
  80,
  46,
  32,
  73,
  116,
  39,
  115,
  32,
  97,
  109,
  97]}