[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vuhung16au/hf-transformer-trove/blob/main/examples/02_tokenizers.ipynb)
[![View on GitHub](https://img.shields.io/badge/View_on-GitHub-blue?logo=github)](https://github.com/vuhung16au/hf-transformer-trove/blob/main/examples/02_tokenizers.ipynb)

# 02 - Tokenizers: The Foundation of Transformer Models

## Learning Objectives
By the end of this notebook, you will understand:
- What tokenization is and why it's crucial for NLP
- Different tokenization strategies (word, subword, character)
- How Hugging Face tokenizers work
- Special tokens and their purposes
- Handling padding, truncation, and attention masks
- Advanced tokenizer features and customization

## What is Tokenization?

Tokenization is the process of converting text into a sequence of tokens (usually integers) that machine learning models can understand. It's the bridge between human-readable text and numerical data that models process.

## Why is Tokenization Important?

- **Neural networks work with numbers**, not text
- **Vocabulary management**: Handling unknown words and rare terms
- **Efficiency**: Balancing vocabulary size with representation quality
- **Consistency**: Ensuring the same text is always tokenized the same way

In [None]:
# Import necessary libraries
from transformers import (
    AutoTokenizer, BertTokenizer, GPT2Tokenizer, T5Tokenizer,
    RobertaTokenizer, DistilBertTokenizer
)
import torch
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from collections import Counter
import os
import warnings
warnings.filterwarnings('ignore')

# Load environment variables from .env.local for local development
try:
    from dotenv import load_dotenv
    load_dotenv('.env.local', override=True)
    print("Environment variables loaded from .env.local")
except ImportError:
    print("python-dotenv not installed, skipping .env.local loading")

# Credential management function
def get_api_key(key_name: str) -> str:
    """Get API key from environment or Colab secrets."""
    try:
        # Try to import Colab userdata (only available in Colab)
        from google.colab import userdata
        return userdata.get(key_name)
    except (ImportError, Exception):
        # Fall back to local environment variable
        api_key = os.getenv(key_name)
        if not api_key:
            print(f"Info: {key_name} not found. Public models will work without authentication.")
            return None
        return api_key

# Device detection function
def get_device():
    """Get the best available device for training/inference."""
    if torch.cuda.is_available():
        return torch.device("cuda")
    elif torch.backends.mps.is_available():
        return torch.device("mps") 
    else:
        return torch.device("cpu")

# Set up plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Setup authentication and device
hf_token = get_api_key('HF_TOKEN')
if hf_token:
    os.environ['HF_TOKEN'] = hf_token
    print("Hugging Face token configured")

device = get_device()
print(f"\n=== Setup Information ===")
print(f"PyTorch version: {torch.__version__}")
print(f"Using device: {device}")
if device.type == 'cuda':
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
elif device.type == 'mps':
    print("Apple Silicon GPU (MPS) detected")
print("Libraries loaded successfully!")

## Part 1: Basic Tokenization Concepts

Let's start with understanding different levels of tokenization.

In [None]:
# Sample text for demonstration
sample_text = "Hello! How are you doing today? I'm learning about tokenization."

print(f"Original text: {sample_text}")
print(f"Text length: {len(sample_text)} characters\n")

# Character-level tokenization
char_tokens = list(sample_text)
print(f"Character tokens: {char_tokens[:20]}...")  # Show first 20
print(f"Number of character tokens: {len(char_tokens)}\n")

# Simple word-level tokenization (split by spaces)
word_tokens = sample_text.split()
print(f"Word tokens: {word_tokens}")
print(f"Number of word tokens: {len(word_tokens)}\n")

# Punctuation handling
import string
text_no_punct = sample_text.translate(str.maketrans('', '', string.punctuation))
word_tokens_no_punct = text_no_punct.split()
print(f"Word tokens (no punctuation): {word_tokens_no_punct}")
print(f"Number of tokens without punctuation: {len(word_tokens_no_punct)}")

## Part 2: Hugging Face Tokenizers in Action

Let's explore how different transformer models tokenize text.

In [None]:
# Load different tokenizers
tokenizers = {
    "BERT": AutoTokenizer.from_pretrained("bert-base-uncased"),
    "GPT-2": AutoTokenizer.from_pretrained("gpt2"),
    "RoBERTa": AutoTokenizer.from_pretrained("roberta-base"),
    "DistilBERT": AutoTokenizer.from_pretrained("distilbert-base-uncased")
}

# Test text with some challenging words
test_text = "The unhappiness of the students was unfathomable."

print(f"Test text: {test_text}\n")

for name, tokenizer in tokenizers.items():
    tokens = tokenizer.tokenize(test_text)
    print(f"{name} tokenization:")
    print(f"  Tokens: {tokens}")
    print(f"  Number of tokens: {len(tokens)}")
    print(f"  Vocabulary size: {len(tokenizer)}\n")

## Part 3: Understanding Subword Tokenization

Subword tokenization is key to modern transformer models. Let's see how it handles out-of-vocabulary words.

In [None]:
# Load BERT tokenizer for detailed analysis
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Test words with varying complexity
test_words = [
    "hello",           # Common word
    "tokenization",    # Technical term
    "unhappiness",     # Complex word with prefix/suffix
    "antidisestablishmentarianism",  # Very long word
    "COVID-19",        # Recent term
    "ChatGPT",         # Proper noun/brand
    "unfathomable"     # Less common word
]

print("Subword Tokenization Analysis:\n")

for word in test_words:
    tokens = bert_tokenizer.tokenize(word)
    token_ids = bert_tokenizer.encode(word, add_special_tokens=False)
    
    print(f"Word: '{word}'")
    print(f"  Tokens: {tokens}")
    print(f"  Token IDs: {token_ids}")
    print(f"  Number of tokens: {len(tokens)}")
    print(f"  Reconstructed: '{bert_tokenizer.decode(token_ids)}'\n")

### Visualizing Subword Tokenization

In [None]:
# Create a visualization of tokenization differences
def compare_tokenization(text, tokenizers_dict):
    results = []
    
    for name, tokenizer in tokenizers_dict.items():
        tokens = tokenizer.tokenize(text)
        results.append({
            'Model': name,
            'Tokens': len(tokens),
            'Vocab_Size': len(tokenizer)
        })
    
    return pd.DataFrame(results)

# Compare tokenization for different texts
texts = [
    "Hello world",
    "The quick brown fox jumps",
    "Antidisestablishmentarianism is difficult",
    "COVID-19 pandemic affected everyone"
]

fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Tokenization Comparison Across Different Models', fontsize=16)

for i, text in enumerate(texts):
    row, col = i // 2, i % 2
    
    df = compare_tokenization(text, tokenizers)
    
    axes[row, col].bar(df['Model'], df['Tokens'])
    axes[row, col].set_title(f'Text: "{text[:30]}..."' if len(text) > 30 else f'Text: "{text}"')
    axes[row, col].set_ylabel('Number of Tokens')
    axes[row, col].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## Part 4: Special Tokens

Special tokens serve specific purposes in transformer models. Let's explore them.

In [None]:
# Examine special tokens for different models
def analyze_special_tokens(tokenizer, model_name):
    print(f"{model_name} Special Tokens:")
    
    special_tokens = {
        'CLS': getattr(tokenizer, 'cls_token', None),
        'SEP': getattr(tokenizer, 'sep_token', None),
        'PAD': getattr(tokenizer, 'pad_token', None),
        'UNK': getattr(tokenizer, 'unk_token', None),
        'MASK': getattr(tokenizer, 'mask_token', None),
        'BOS': getattr(tokenizer, 'bos_token', None),
        'EOS': getattr(tokenizer, 'eos_token', None)
    }
    
    for token_type, token in special_tokens.items():
        if token is not None:
            token_id = tokenizer.convert_tokens_to_ids(token)
            print(f"  {token_type}: '{token}' (ID: {token_id})")
        else:
            print(f"  {token_type}: Not available")
    print()

# Analyze special tokens for each model
for name, tokenizer in tokenizers.items():
    analyze_special_tokens(tokenizer, name)

### Special Tokens in Practice

In [None]:
# Demonstrate how special tokens are used
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

text = "Hello world"

# Tokenize without special tokens
tokens_no_special = bert_tokenizer.tokenize(text)
ids_no_special = bert_tokenizer.encode(text, add_special_tokens=False)

# Tokenize with special tokens
tokens_with_special = bert_tokenizer.tokenize(text, add_special_tokens=True)
ids_with_special = bert_tokenizer.encode(text, add_special_tokens=True)

print(f"Original text: '{text}'\n")

print("Without special tokens:")
print(f"  Tokens: {tokens_no_special}")
print(f"  IDs: {ids_no_special}")
print(f"  Decoded: '{bert_tokenizer.decode(ids_no_special)}'\n")

print("With special tokens:")
print(f"  Tokens: {tokens_with_special}")
print(f"  IDs: {ids_with_special}")
print(f"  Decoded: '{bert_tokenizer.decode(ids_with_special)}'")

# Show what each ID represents
print("\nToken ID breakdown:")
for i, token_id in enumerate(ids_with_special):
    token = bert_tokenizer.convert_ids_to_tokens(token_id)
    print(f"  Position {i}: ID {token_id} -> '{token}'")

## Part 5: Padding, Truncation, and Attention Masks

When processing batches of text, we need to handle sequences of different lengths.

In [None]:
# Create texts of different lengths
texts = [
    "Short",
    "This is a medium length sentence.",
    "This is a much longer sentence that contains more words and information about various topics.",
    "Very long sentence that goes on and on and contains lots of information about many different subjects and concepts that might be relevant to natural language processing tasks."
]

print("Original texts:")
for i, text in enumerate(texts):
    print(f"{i+1}. '{text}' (length: {len(text.split())} words)")

print("\n" + "="*50)

# Tokenize without padding/truncation
print("\nTokenization without padding:")
for i, text in enumerate(texts):
    tokens = bert_tokenizer.encode(text)
    print(f"{i+1}. Length: {len(tokens)} tokens")

### Padding Strategies

In [None]:
# Different padding strategies
max_length = 20

padding_strategies = {
    "No padding": {"padding": False, "truncation": False},
    "Pad to max_length": {"padding": "max_length", "max_length": max_length, "truncation": True},
    "Pad to longest": {"padding": "longest", "truncation": True, "max_length": max_length}
}

for strategy_name, params in padding_strategies.items():
    print(f"\n{strategy_name}:")
    try:
        result = bert_tokenizer(texts, return_tensors="pt", **params)
        
        print(f"  Input IDs shape: {result['input_ids'].shape}")
        print(f"  Attention mask shape: {result['attention_mask'].shape}")
        
        # Show first example in detail
        print(f"  First example input_ids: {result['input_ids'][0].tolist()}")
        print(f"  First example attention_mask: {result['attention_mask'][0].tolist()}")
        
    except Exception as e:
        print(f"  Error: {e}")

### Understanding Attention Masks

In [None]:
# Visualize attention masks
texts = ["Hello", "Hello world", "Hello world how are you"]

# Tokenize with padding
encoded = bert_tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

print("Attention Mask Visualization:")
print("1 = attend to this token, 0 = ignore (padding)\n")

fig, ax = plt.subplots(1, 1, figsize=(10, 6))

# Create heatmap of attention masks
attention_masks = encoded['attention_mask'].numpy()
input_ids = encoded['input_ids'].numpy()

# Get token strings for labels
token_labels = []
for i in range(attention_masks.shape[0]):
    tokens = bert_tokenizer.convert_ids_to_tokens(input_ids[i])
    token_labels.append(tokens)

# Plot heatmap
sns.heatmap(attention_masks, 
            annot=True, 
            fmt='d', 
            cmap='RdYlBu_r',
            yticklabels=[f"Text {i+1}" for i in range(len(texts))],
            xticklabels=token_labels[0],  # Use tokens from first example
            ax=ax)

ax.set_title('Attention Masks for Batched Input')
ax.set_xlabel('Token Position')
ax.set_ylabel('Example')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Show detailed breakdown
for i, text in enumerate(texts):
    tokens = bert_tokenizer.convert_ids_to_tokens(input_ids[i])
    mask = attention_masks[i]
    
    print(f"\nExample {i+1}: '{text}'")
    print("Token\t\tMask")
    print("-" * 20)
    for token, mask_val in zip(tokens, mask):
        status = "ATTEND" if mask_val == 1 else "IGNORE"
        print(f"{token:<15} {mask_val} ({status})")

## Part 6: Advanced Tokenizer Features

Let's explore some advanced features of Hugging Face tokenizers.

In [None]:
# Custom tokenizer configuration
def demonstrate_tokenizer_options(text):
    """Show different tokenizer configuration options"""
    
    configs = {
        "Default": {},
        "No special tokens": {"add_special_tokens": False},
        "Return tensors": {"return_tensors": "pt"},
        "Return attention mask": {"return_attention_mask": True},
        "Return token type IDs": {"return_token_type_ids": True},
        "Return offsets": {"return_offsets_mapping": True},
        "Truncate to 10": {"max_length": 10, "truncation": True},
        "Pad to 15": {"max_length": 15, "padding": "max_length"}
    }
    
    print(f"Text: '{text}'\n")
    
    for config_name, config in configs.items():
        print(f"{config_name}:")
        try:
            result = bert_tokenizer(text, **config)
            
            if isinstance(result, dict):
                for key, value in result.items():
                    if torch.is_tensor(value):
                        print(f"  {key}: {value.tolist()}")
                    else:
                        print(f"  {key}: {value}")
            else:
                print(f"  Result: {result}")
                
        except Exception as e:
            print(f"  Error: {e}")
        print()

demonstrate_tokenizer_options("Hello world, how are you today?")

### Tokenizer Fast vs Slow

In [None]:
import time

# Compare fast vs slow tokenizers
text_list = ["This is a test sentence for tokenization speed."] * 1000

# Fast tokenizer (default)
fast_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Slow tokenizer (Python-based)
try:
    slow_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=False)
    
    # Time fast tokenizer
    start_time = time.time()
    fast_result = fast_tokenizer(text_list, padding=True, truncation=True)
    fast_time = time.time() - start_time
    
    # Time slow tokenizer
    start_time = time.time()
    slow_result = slow_tokenizer(text_list, padding=True, truncation=True)
    slow_time = time.time() - start_time
    
    print(f"Fast tokenizer time: {fast_time:.4f} seconds")
    print(f"Slow tokenizer time: {slow_time:.4f} seconds")
    print(f"Speedup: {slow_time/fast_time:.2f}x")
    
    # Check if results are identical
    identical = (fast_result['input_ids'] == slow_result['input_ids'])
    print(f"Results identical: {identical}")
    
except Exception as e:
    print(f"Could not load slow tokenizer: {e}")
    print("Fast tokenizers are the default and recommended option.")

## Part 7: Working with Multiple Languages

In [None]:
# Load multilingual tokenizer
multilingual_tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")

# Test texts in different languages
multilingual_texts = {
    "English": "Hello, how are you today?",
    "Spanish": "Hola, ¿cómo estás hoy?",
    "French": "Bonjour, comment allez-vous aujourd'hui?",
    "German": "Hallo, wie geht es dir heute?",
    "Chinese": "你好，你今天怎么样？",
    "Japanese": "こんにちは、今日はいかがですか？"
}

print("Multilingual Tokenization:")
print("=" * 50)

for language, text in multilingual_texts.items():
    tokens = multilingual_tokenizer.tokenize(text)
    token_ids = multilingual_tokenizer.encode(text, add_special_tokens=False)
    
    print(f"\n{language}:")
    print(f"  Text: {text}")
    print(f"  Tokens: {tokens}")
    print(f"  Number of tokens: {len(tokens)}")
    print(f"  Reconstructed: {multilingual_tokenizer.decode(token_ids)}")

## Part 8: Tokenizer Vocabulary Analysis

In [None]:
# Analyze tokenizer vocabulary
def analyze_vocabulary(tokenizer, model_name, sample_size=1000):
    """Analyze tokenizer vocabulary characteristics"""
    
    vocab = tokenizer.get_vocab()
    vocab_size = len(vocab)
    
    # Sample some tokens for analysis
    sample_tokens = list(vocab.keys())[:sample_size]
    
    # Analyze token characteristics
    token_lengths = [len(token) for token in sample_tokens]
    special_tokens = [token for token in sample_tokens if token.startswith('[') or token.startswith('<') or token.startswith('Ġ')]
    
    print(f"\n{model_name} Vocabulary Analysis:")
    print(f"  Total vocabulary size: {vocab_size:,}")
    print(f"  Average token length (first {sample_size}): {np.mean(token_lengths):.2f}")
    print(f"  Min token length: {min(token_lengths)}")
    print(f"  Max token length: {max(token_lengths)}")
    print(f"  Special tokens in sample: {len(special_tokens)}")
    
    # Show some example tokens
    print(f"  Sample tokens: {sample_tokens[:10]}")
    
    return {
        'vocab_size': vocab_size,
        'avg_token_length': np.mean(token_lengths),
        'token_lengths': token_lengths
    }

# Analyze different tokenizers
vocab_stats = {}
for name, tokenizer in tokenizers.items():
    vocab_stats[name] = analyze_vocabulary(tokenizer, name)

# Create comparison plot
models = list(vocab_stats.keys())
vocab_sizes = [vocab_stats[model]['vocab_size'] for model in models]
avg_lengths = [vocab_stats[model]['avg_token_length'] for model in models]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Vocabulary sizes
ax1.bar(models, vocab_sizes)
ax1.set_title('Vocabulary Sizes')
ax1.set_ylabel('Number of tokens')
ax1.tick_params(axis='x', rotation=45)

# Average token lengths
ax2.bar(models, avg_lengths)
ax2.set_title('Average Token Length')
ax2.set_ylabel('Characters per token')
ax2.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## Part 9: Common Tokenization Pitfalls and Solutions

In [None]:
# Common issues and their solutions
def demonstrate_common_issues():
    """Show common tokenization issues and how to handle them"""
    
    issues = {
        "Empty strings": ["", "   ", "\n\t"],
        "Very long text": ["word " * 1000],  # 1000 words
        "Special characters": ["Hello @user! Check out https://example.com 😊"],
        "Mixed languages": ["Hello 世界 bonjour"],
        "Numbers and dates": ["Born on 1990-01-01, phone: +1-555-123-4567"]
    }
    
    for issue_type, examples in issues.items():
        print(f"\n{issue_type.upper()}:")
        print("-" * 30)
        
        for text in examples:
            display_text = repr(text) if len(text) < 50 else f"{repr(text[:47])}..."
            print(f"\nText: {display_text}")
            
            try:
                # Basic tokenization
                tokens = bert_tokenizer.tokenize(text)
                print(f"Tokens ({len(tokens)}): {tokens[:10]}{'...' if len(tokens) > 10 else ''}")
                
                # Safe tokenization with limits
                safe_encoded = bert_tokenizer(
                    text,
                    max_length=50,
                    truncation=True,
                    padding=False,
                    return_tensors="pt"
                )
                print(f"Safe encoding shape: {safe_encoded['input_ids'].shape}")
                
            except Exception as e:
                print(f"Error: {e}")

demonstrate_common_issues()

## Part 10: Best Practices and Tips

In [None]:
# Best practices demonstration
def tokenization_best_practices():
    """Demonstrate tokenization best practices"""
    
    print("TOKENIZATION BEST PRACTICES:")
    print("=" * 40)
    
    # Practice 1: Always handle edge cases
    print("\n1. Handle edge cases safely:")
    
    def safe_tokenize(text, tokenizer, max_length=512):
        """Safely tokenize text with error handling"""
        if not isinstance(text, str):
            text = str(text)
        
        if len(text.strip()) == 0:
            text = "[EMPTY]"
        
        return tokenizer(
            text,
            max_length=max_length,
            truncation=True,
            padding=False,
            return_tensors="pt"
        )
    
    # Test with problematic inputs
    test_inputs = ["", None, 12345, "Normal text"]
    
    for inp in test_inputs:
        try:
            result = safe_tokenize(inp, bert_tokenizer)
            print(f"  Input: {repr(inp)} -> Tokens: {result['input_ids'].shape[1]}")
        except Exception as e:
            print(f"  Input: {repr(inp)} -> Error: {e}")
    
    # Practice 2: Batch processing
    print("\n2. Use batch processing for efficiency:")
    
    texts = ["Text 1", "Text 2", "Text 3"]
    
    # Inefficient: one by one
    start_time = time.time()
    individual_results = [bert_tokenizer(text, return_tensors="pt") for text in texts]
    individual_time = time.time() - start_time
    
    # Efficient: batch processing
    start_time = time.time()
    batch_result = bert_tokenizer(texts, padding=True, return_tensors="pt")
    batch_time = time.time() - start_time
    
    print(f"  Individual processing: {individual_time:.6f} seconds")
    print(f"  Batch processing: {batch_time:.6f} seconds")
    print(f"  Speedup: {individual_time/batch_time:.2f}x")
    
    # Practice 3: Memory considerations
    print("\n3. Consider memory usage:")
    
    # Memory-efficient approach for large datasets
    def memory_efficient_tokenization(texts, tokenizer, batch_size=32):
        """Tokenize large datasets in batches"""
        results = []
        
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]
            batch_result = tokenizer(
                batch,
                padding=True,
                truncation=True,
                max_length=512,
                return_tensors="pt"
            )
            results.append(batch_result)
        
        return results
    
    # Example with larger dataset
    large_dataset = [f"Sample text {i}" for i in range(100)]
    batched_results = memory_efficient_tokenization(large_dataset, bert_tokenizer)
    print(f"  Processed {len(large_dataset)} texts in {len(batched_results)} batches")

tokenization_best_practices()

## Summary

In this notebook, we covered:

1. **Tokenization Fundamentals**: Understanding what tokenization is and why it's important
2. **Hugging Face Tokenizers**: How different models tokenize text differently
3. **Subword Tokenization**: How modern tokenizers handle complex and unknown words
4. **Special Tokens**: Understanding CLS, SEP, PAD, UNK, MASK and their purposes
5. **Padding and Attention Masks**: Handling variable-length sequences in batches
6. **Advanced Features**: Tokenizer configuration options and performance considerations
7. **Multilingual Support**: Working with multiple languages
8. **Vocabulary Analysis**: Understanding tokenizer vocabularies
9. **Common Pitfalls**: Issues you might encounter and how to handle them
10. **Best Practices**: Efficient and safe tokenization patterns

## Key Takeaways

- **Subword tokenization** balances vocabulary size with representation quality
- **Special tokens** serve important roles in model architecture
- **Attention masks** are crucial for handling padded sequences
- **Different models** use different tokenization strategies
- **Batch processing** is more efficient than individual tokenization
- **Always handle edge cases** in production code

## Next Steps

- **Notebook 03**: Working with the datasets library
- **Notebook 04**: Mini-project integrating tokenization with datasets

Understanding tokenization is fundamental to working effectively with transformer models. The concepts learned here will be essential for all subsequent notebooks!

---

## About the Author

**Vu Hung Nguyen** - AI Engineer & Researcher

Connect with me:
- 🌐 **Website**: [vuhung16au.github.io](https://vuhung16au.github.io/)
- 💼 **LinkedIn**: [linkedin.com/in/nguyenvuhung](https://www.linkedin.com/in/nguyenvuhung/)
- 💻 **GitHub**: [github.com/vuhung16au](https://github.com/vuhung16au/)

*This notebook is part of the [HF Transformer Trove](https://github.com/vuhung16au/hf-transformer-trove) educational series.*