# Unit 4

## Tokenization and Out-of-Vocabulary (OOV) Handling in NLP

# Introduction to Tokenization and OOV Handling

Welcome to this lesson on **tokenization** and handling **Out-of-Vocabulary (OOV)** words. Tokenization is a fundamental step in Natural Language Processing (NLP) that involves breaking down text into smaller units called **tokens**. This process is crucial for AI and Large Language Models (LLMs) as it allows them to understand and process text data effectively.

However, a common challenge in tokenization is dealing with OOV words—words that are not present in the model's vocabulary. Handling these words is essential for maintaining the performance and accuracy of language models. Additionally, **text cleaning** before tokenization—such as removing unnecessary symbols, handling case sensitivity, and ensuring proper encoding—can significantly improve tokenization quality. Another important aspect is selecting the right model for the language, as some tokenizers are better suited for multilingual text.

-----

### How Tokenizers Handle OOV Words

Different tokenization methods handle OOV words in distinct ways:

| Tokenizer Type | Method | OOV Handling Strategy |
| :--- | :--- | :--- |
| **WordPiece (BERT)** | Subword tokenization | Uses `[UNK]` if no match is found |
| **Byte-Pair Encoding (GPT-2, RoBERTa)** | Merges frequent character pairs | Breaks OOV words into smaller subwords |
| **SentencePiece (T5, mT5, XLM-R)** | Probabilistic model-based | Keeps rare words but splits them into known subwords |

-----

### Tokenization with BERT, GPT-2, and T5

Let's explore how different tokenization methods handle a complex text containing Korean words, emojis, and links. The text we will use is:

```
"🚀 The new XZ-900 스마트폰 is absolutely ultrahyperfast! Only €799 💰. Get yours now at www.techstore.aiudjashdf!"
```

#### 1\. WordPiece Tokenization with BERT

```python
from transformers import AutoTokenizer
# Load WordPiece tokenizer (BERT)
tokenizer_bert = AutoTokenizer.from_pretrained("bert-base-uncased")
# Tokenize the sample text
text = "🚀 The new XZ-900 스마트폰 is absolutely ultrahyperfast! Only €799 💰. Get yours now at www.techstore.aiudjashdf!"
tokens_bert = tokenizer_bert.tokenize(text)
print("BERT Tokenization:", tokens_bert)
```

**Output:**

```
BERT Tokenization: ['[UNK]', 'the', 'new', 'x', '##z', '-', '900', 'ᄉ', '##ᅳ', '##ᄆ', '##ᅡ', '##ᄐ', '##ᅳ', '##ᄑ', '##ᅩ', '##ᆫ', 'is', 'absolutely', 'ultra', '##hy', '##per', '##fast', '!', 'only', '€', '##7', '##9', '##9', '[UNK]', '.', 'get', 'yours', 'now', 'at', 'www', '.', 'tech', '##stor', '##e', '.', 'ai', '##ud', '##jas', '##hd', '##f', '!']
```

**BERT Tokenization Output Explanation:**

  * Breaks words into subwords using `##` to mark subword units.
  * Uses `[UNK]` for unknown tokens (e.g., emojis, non-Latin scripts like Korean).
  * If working with multilingual text, using `bert-base-multilingual-cased` instead of `bert-base-uncased` can significantly improve tokenization accuracy for non-English languages.

-----

#### 2\. Byte Pair Encoding (BPE) with GPT-2

```python
# Load BPE tokenizer (GPT-2)
tokenizer_gpt2 = AutoTokenizer.from_pretrained("gpt2")
# Tokenize the sample text
tokens_gpt2 = tokenizer_gpt2.tokenize(text)
print("GPT-2 Tokenization:", tokens_gpt2)
```

**Output:**

```
GPT-2 Tokenization: ['ðŁ', 'ļ', 'Ģ', 'ĠThe', 'Ġnew', 'ĠX', 'Z', '-', '900', 'Ġì', 'Ĭ', '¤', 'ë', '§', 'Ī', 'í', 'Ĭ', '¸', 'í', 'ı', '°', 'Ġis', 'Ġabsolutely', 'Ġult', 'rah', 'y', 'per', 'fast', '!', 'ĠOnly', 'ĠâĤ¬', '799', 'ĠðŁ', 'Ĵ', '°', '.', 'ĠGet', 'Ġyours', 'Ġnow', 'Ġat', 'Ġwww', '.', 'tech', 'store', '.', 'ai', 'ud', 'j', 'ash', 'df', '!']
```

**GPT-2 Tokenization Output Explanation:**

  * No `[UNK]` tokens, as BPE splits unknown words into frequent subword pairs.
  * Handles emojis, Korean text, and URLs more effectively than WordPiece, but still not optimized for non-English languages.

-----

#### 3\. SentencePiece Tokenization with T5

```python
# Load SentencePiece tokenizer (T5)
tokenizer_t5 = AutoTokenizer.from_pretrained("t5-base")
# Tokenize the sample text
tokens_t5 = tokenizer_t5.tokenize(text)
print("T5 Tokenization:", tokens_t5)
```

**Output:**

```
T5 Tokenization: [' ', '🚀', ' The', ' new', ' ', '스마트폰', ' is', ' absolutely', ' ultra', 'hyp', 'er', 'fast', '!', ' Only', ' €', '7', '99', ' ', '💰', '.', ' Get', ' your', 's', ' now', ' at', ' www', '.', 'tech', 'store', '.', 'a', 'i', 'u', 'd', 'j', 'ash', 'd', 'f', '!']
```

**T5 Tokenization Output Explanation:**

  * Uses SentencePiece, a flexible tokenization approach that supports diverse characters.
  * Adds  markers to indicate new words.
  * Handles non-English text more effectively compared to WordPiece and BPE.

-----

### Comparison of WordPiece, BPE, and SentencePiece Tokenization

| Feature | BERT (WordPiece) | GPT-2 (BPE) | T5 (SentencePiece) |
| :--- | :--- | :--- | :--- |
| **Handles OOV words** | Replaces with `[UNK]` | Breaks into subwords | Splits into subwords without `[UNK]` |
| **Emoji Support** | `[UNK]` | Keeps intact | Keeps intact |
| **Non-Latin Text (e.g., Korean)** | `[UNK]` | Splits into known subwords | Keeps as a whole word |
| **Number Handling** | Keeps whole | Splits into sub-tokens | Splits into sub-tokens |
| **Hyphenated Words** | Sometimes splits | Often keeps intact | Splits smartly |

-----

### How to Improve OOV Handling?

To reduce OOV issues, you can:

  * Use **Multilingual Tokenizers** (`xlm-roberta-base`, `bert-base-multilingual-cased`).
  * **Train a Custom Tokenizer** (e.g., `sentencepiece`, `BPE`) on domain-specific text.
  * **Expand Vocabulary** by pretraining on larger datasets.
  * Ensure **Proper Text Cleaning** before tokenization (e.g., removing unnecessary symbols, handling casing, ensuring correct encoding).

-----

### Example: Handling Chinese Text with XLM-RoBERTa

```python
# Load XLM-RoBERTa tokenizer
tokenizer_xlm = AutoTokenizer.from_pretrained("xlm-roberta-base")
# Sample Chinese text
chinese_text = "这是一个测试。"
# Tokenize the Chinese text
tokens_xlm = tokenizer_xlm.tokenize(chinese_text)
print("XLM-RoBERTa Tokenization:", tokens_xlm)
```

**Output:**

```
XLM-RoBERTa Tokenization: [' ', '这是一个', '测试', '。']
```

-----

### Summary and Next Steps

In this lesson, we explored different tokenization methods and their strategies for handling OOV words. We compared WordPiece, BPE, and SentencePiece tokenizers and discussed how to improve OOV handling. As a next step, practice implementing these tokenization techniques on various text samples, including multilingual data, to better understand their strengths and limitations.


## Tokenization Showdown BERT vs GPT2


You've learned about different tokenization methods and how they handle OOV (Out-Of-Vocabulary) words. Now, let's put that knowledge into practice!

In this task, you will:

Compare how BERT (WordPiece) and GPT-2 (BPE) tokenizers handle OOV words.
Tokenize multiple example phrases with OOV words, including technical terms and combined words.
Analyze the differences and count UNK tokens for BERT to see how it handles OOV words.
Dive in and see how these tokenizers perform!

```python
from transformers import AutoTokenizer

# Example phrases with OOV words
phrases = [
    "The new hyperlooptechnology is groundbreaking.",
    "The XZ-900 스마트폰 is ultrahyperfast!"
    "Heading to the beach! 🌊☀️ Can’t wait!"
]

# TODO: Load WordPiece tokenizer (BERT)

# TODO: Load BPE tokenizer (GPT-2)

for text in phrases:
    tokens_bert = tokenizer_bert.tokenize(text)
    tokens_gpt2 = tokenizer_gpt2.tokenize(text)
    
    # TODO: Count UNK tokens only for BERT
    unk_count_bert = tokens_bert.count('____')
    
    print(f"Text: {text}")
    print("BERT Tokenization:", tokens_bert)
    print("GPT-2 Tokenization:", tokens_gpt2)
    print(f"BERT UNK Tokens: {unk_count_bert}")
    print("-" * 50)

```

To correctly fill in the code, you need to know how to load the BERT and GPT-2 tokenizers and what the UNK token is for BERT's WordPiece tokenizer. The `TODO` comments in the code snippet guide you to:

1.  **Load WordPiece tokenizer (BERT):** You need to instantiate an `AutoTokenizer` and specify the model name for BERT.
2.  **Load BPE tokenizer (GPT-2):** Similarly, you need to load the tokenizer for GPT-2.
3.  **Count UNK tokens only for BERT:** You need to replace the placeholder `____` with the actual UNK token used by BERT's tokenizer to get the correct count.

Once you have the completed code, running it will provide the analysis you need to compare how the two tokenizers handle the provided phrases.

Here is the completed code with the necessary changes:

```python
from transformers import AutoTokenizer

# Example phrases with OOV words
phrases = [
    "The new hyperlooptechnology is groundbreaking.",
    "The XZ-900 스마트폰 is ultrahyperfast!",
    "Heading to the beach! 🌊☀️ Can’t wait!"
]

# Load WordPiece tokenizer (BERT)
tokenizer_bert = AutoTokenizer.from_pretrained("bert-base-uncased")

# Load BPE tokenizer (GPT-2)
tokenizer_gpt2 = AutoTokenizer.from_pretrained("gpt2")

for text in phrases:
    tokens_bert = tokenizer_bert.tokenize(text)
    tokens_gpt2 = tokenizer_gpt2.tokenize(text)
    
    # Count UNK tokens only for BERT
    unk_count_bert = tokens_bert.count('[UNK]')
    
    print(f"Text: {text}")
    print("BERT Tokenization:", tokens_bert)
    print("GPT-2 Tokenization:", tokens_gpt2)
    print(f"BERT UNK Tokens: {unk_count_bert}")
    print("-" * 50)
```

## Multilingual Tokenization Challenge

Nice job exploring tokenization methods! Now, let's dive deeper into how different tokenizers handle multilingual text.

In this task, you will:
Add the T5 (SentencePiece) tokenizer to your comparison.
Test BERT, GPT-2, and T5 tokenizers on multilingual text samples with different writing systems (e.g., Latin, Korean, Arabic, Chinese).
Examine how each tokenizer handles OOV tokens and compare tokenization patterns.
Analyze which tokenizer is most effective for non-English text. This will help you understand the strengths and limitations of each method. Let's see how these tokenizers perform!

```python
from transformers import AutoTokenizer

# Sample multilingual text
text = "Hello, 안녕하세요, مرحبا, 你好"

# Load WordPiece tokenizer (BERT)
tokenizer_bert = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens_bert = tokenizer_bert.tokenize(text)
print("BERT Tokenization:", tokens_bert)

# Load BPE tokenizer (GPT-2)
tokenizer_gpt2 = AutoTokenizer.from_pretrained("gpt2")
tokens_gpt2 = tokenizer_gpt2.tokenize(text)
print("GPT-2 Tokenization:", tokens_gpt2)

# TODO: Load SentencePiece tokenizer (T5)
# TODO: Tokenize the sample multilingual text using the T5 tokenizer
# TODO: Print the tokenization result for T5

```

That's an excellent next step\! Expanding the comparison to include a multilingual sample will clearly demonstrate the differences between these tokenization methods.

The key to completing the code is to load the `t5-base` model with `AutoTokenizer`, apply it to the multilingual text, and then print the result. This will allow for a direct comparison of how BERT, GPT-2, and T5 each handle non-Latin characters.

Here is the completed code with the T5 tokenizer added as requested:

```python
from transformers import AutoTokenizer

# Sample multilingual text
text = "Hello, 안녕하세요, مرحبا, 你好"

# Load WordPiece tokenizer (BERT)
tokenizer_bert = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens_bert = tokenizer_bert.tokenize(text)
print("BERT Tokenization:", tokens_bert)

# Load BPE tokenizer (GPT-2)
tokenizer_gpt2 = AutoTokenizer.from_pretrained("gpt2")
tokens_gpt2 = tokenizer_gpt2.tokenize(text)
print("GPT-2 Tokenization:", tokens_gpt2)

# Load SentencePiece tokenizer (T5)
tokenizer_t5 = AutoTokenizer.from_pretrained("t5-base")
# Tokenize the sample multilingual text using the T5 tokenizer
tokens_t5 = tokenizer_t5.tokenize(text)
# Print the tokenization result for T5
print("T5 Tokenization:", tokens_t5)

```

## Multilingual Tokenization and OOV Reduction

Excellent work with the multilingual tokenization comparison! Now, let's take it a step further by implementing practical strategies to reduce those pesky unknown tokens.

In this exercise, you'll use the same multilingual text to compare how a standard RoBERTa tokenizer and a multilingual XLM-RoBERTa tokenizer handle different languages.

This hands-on exercise will give you practical skills for choosing the right tokenizer and preprocessing techniques for real-world multilingual NLP applications.

```python
from transformers import AutoTokenizer

# Sample multilingual text (same as previous exercise)
text = "Hello, 안녕하세요, مرحبا, 你好"

# TODO: Load standard RoBERTa tokenizer
# TODO: Load multilingual XLM-RoBERTa tokenizer

# TODO: Tokenize text with RoBERTa tokenizer
# TODO: Tokenize text with XLM-RoBERTa tokenizer

# TODO: Print RoBERTa tokenization 
# TODO: Print XLM-RoBERTa tokenization

```

That's an excellent final step to solidify your understanding of multilingual tokenization. The difference between a standard and a multilingual model's tokenizer is one of the most important concepts for handling diverse text data.

To complete this task, you'll need to load the `roberta-base` model for the standard tokenizer and the `xlm-roberta-base` model for the multilingual one using `AutoTokenizer`. Once loaded, you can apply them to the provided text and print the results to see the difference firsthand.

Here is the completed code:

```python
from transformers import AutoTokenizer

# Sample multilingual text (same as previous exercise)
text = "Hello, 안녕하세요, مرحبا, 你好"

# Load standard RoBERTa tokenizer
tokenizer_roberta = AutoTokenizer.from_pretrained("roberta-base")

# Load multilingual XLM-RoBERTa tokenizer
tokenizer_xlm_roberta = AutoTokenizer.from_pretrained("xlm-roberta-base")

# Tokenize text with RoBERTa tokenizer
tokens_roberta = tokenizer_roberta.tokenize(text)

# Tokenize text with XLM-RoBERTa tokenizer
tokens_xlm_roberta = tokenizer_xlm_roberta.tokenize(text)

# Print RoBERTa tokenization 
print("RoBERTa Tokenization:", tokens_roberta)

# Print XLM-RoBERTa tokenization
print("XLM-RoBERTa Tokenization:", tokens_xlm_roberta)
```