# Exercise: Custom Tokenizer and Vocabulary Builder

**Estimated Time**: 15 minutes | **Status**: üöß Your implementation

**Scenario**: Your chatbot splits "can't" into "can" + "'t" (losing negation) and "USB-C" into three tokens. Build a custom tokenizer that handles edge cases while keeping vocabulary manageable.

**What You'll Learn**: Tokenization challenges, vocabulary building, encode/decode functionality, unknown word analysis, and why production systems use subword tokenization.

---

## ü§ñ Why This Matters in the GenAI Era

**"Doesn't ChatGPT handle all this already?"** Great question! Here's why tokenization is MORE critical now:

- **API Costs**: GPT-4 charges per token (~$0.03/1K). Better tokenization = lower costs
- **Custom Models**: Fine-tuning, RAG systems, and domain-specific AI need proper tokenization
- **Debugging AI**: When models fail, tokenization issues are often the root cause
- **Performance**: Efficient tokenization directly impacts inference speed and memory usage

**Bottom line**: Understanding tokenization helps you build better, faster, cheaper AI systems. Let's see how! üöÄ

---

In [1]:
# Import necessary libraries
import torch
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter, defaultdict
import re
from datasets import load_dataset
from transformers import AutoTokenizer
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print("‚úÖ Libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")

‚úÖ Libraries imported successfully!
PyTorch version: 2.9.0


## Part A: Implement Basic Tokenizer

Create a simple tokenizer that splits text into tokens and handles basic punctuation.

In [None]:
class CustomTokenizer:
    """
    A custom tokenizer for handling text processing.
    """
    
    def __init__(self):
        # TODO - Initialize vocabulary dictionaries and special tokens
        # Hint: You'll need word_to_id, id_to_word, vocab_counts
        # Special tokens: PAD, UNK, START, END
        pass
    
    def _add_word(self, word):
        """Add word to vocabulary if not present."""
        # TODO - Add word to both word_to_id and id_to_word mappings
        pass
    
    def basic_tokenize(self, text):
        """
        Basic tokenization using whitespace splitting with punctuation handling.
        
        Strategy:
        1. Convert to lowercase for consistency
        2. Split on whitespace
        3. Handle punctuation separately
        """
        # TODO - Implement tokenization logic
        # Hint: Convert to lowercase, split on whitespace, handle punctuation
        pass
    
    def test_edge_cases(self):
        """Test tokenizer on various examples."""
        test_cases = [
            "I can't help you today.",           
            "My USB-C cable is broken.",         
            "The customer's order was delayed.", 
            "It's a state-of-the-art system!",  
            "Won't you help? That's odd.",       
        ]
        
        print("üß™ Testing tokenization:\n")
        for i, text in enumerate(test_cases, 1):
            tokens = self.basic_tokenize(text)
            print(f"{i}. Input: '{text}'")
            print(f"   Tokens: {tokens}")
            print(f"   Count: {len(tokens)} tokens\n")

# Test the basic tokenizer
# TODO - Create tokenizer instance and test it
tokenizer = None  # Replace with your implementation
# tokenizer.test_edge_cases()

## Part B: Build a Frequency-Based Vocabulary

Keep only frequently occurring tokens to manage vocabulary size and eliminate noise.

In [None]:
# Load 1,000 customer support messages for vocabulary building
print("üìä Loading customer support dataset...")
dataset = load_dataset("bitext/Bitext-customer-support-llm-chatbot-training-dataset", split="train")

# Extract 1,000 training messages and 200 test messages
np.random.seed(42)
all_indices = np.random.choice(len(dataset), size=1200, replace=False)
train_indices = all_indices[:1000]
test_indices = all_indices[1000:1200]

train_messages = [dataset[int(idx)]['instruction'] for idx in train_indices]
test_messages = [dataset[int(idx)]['instruction'] for idx in test_indices]

print(f"‚úÖ Loaded {len(train_messages)} training messages and {len(test_messages)} test messages")
print(f"\nSample training messages:")
for i, msg in enumerate(train_messages[:3], 1):
    print(f"{i}. {msg}")

In [None]:
class CustomTokenizer(CustomTokenizer):  # Extend our existing class
    def build_vocabulary(self, texts, min_frequency=5, max_vocab_size=10000):
        """
        Build vocabulary from training texts with frequency-based filtering.
        
        Args:
            texts: List of training texts
            min_frequency: Minimum token frequency to include in vocabulary
            max_vocab_size: Maximum vocabulary size (keeps most frequent)
        """
        print(f"üèóÔ∏è Building vocabulary from {len(texts)} texts...")
        
        # TODO - Count all tokens from all texts
        # Hint: Loop through texts, tokenize each, update vocab_counts
        
        # TODO - Filter by frequency and limit vocabulary size
        # Hint: Use vocab_counts.most_common(), filter by min_frequency
        
        # TODO - Add filtered tokens to vocabulary
        # Hint: Use self._add_word() for each frequent token
        
        # TODO - Calculate and print statistics
        # Show: total unique tokens, tokens in vocab, vocabulary reduction %
        
        return self.get_vocab_stats()
    
    def get_vocab_stats(self):
        """Get detailed vocabulary statistics."""
        # TODO - Calculate coverage percentage and other stats
        # Return dictionary with vocab_size, total_unique_tokens, coverage_percentage
        pass
    
    def show_most_frequent_tokens(self, top_n=20):
        """Display most frequent tokens in vocabulary."""
        # TODO - Show top N most frequent tokens with their frequencies
        # Include whether each token is in the vocabulary or not
        pass

# Build vocabulary
# TODO - Create new tokenizer, build vocabulary, show frequent tokens


## Part C: Convert Messages to Sequences

Implement encode/decode to convert between text and numerical sequences.

In [None]:
class CustomTokenizer(CustomTokenizer):  # Extend our existing class
    def encode(self, text, add_special_tokens=True, max_length=None):
        """
        Convert text to sequence of token IDs.
        
        Args:
            text: Input text to encode
            add_special_tokens: Whether to add START/END tokens
            max_length: Maximum sequence length (truncate if longer)
            
        Returns:
            List of token IDs
        """
        # TODO - Implement encoding logic
        # 1. Tokenize text
        # 2. Add special tokens if requested
        # 3. Convert tokens to IDs (use UNK for unknown tokens)
        # 4. Apply max_length truncation if specified
        pass
    
    def decode(self, token_ids, skip_special_tokens=True):
        """
        Convert sequence of token IDs back to text.
        
        Args:
            token_ids: List of token IDs
            skip_special_tokens: Whether to exclude special tokens from output
            
        Returns:
            Decoded text string
        """
        # TODO - Implement decoding logic
        # 1. Convert IDs to tokens
        # 2. Remove special tokens if requested
        # 3. Join tokens into text string
        pass
    
    def test_round_trip(self, test_texts):
        """
        Test encode/decode round-trip functionality.
        """
        print("üîÑ Testing round-trip encoding/decoding:\n")
        
        for i, text in enumerate(test_texts, 1):
            # TODO - Test encoding and decoding
            # Show: original text, token IDs, unknown token count, decoded text
            print(f"{i}. Original: '{text}'")
            # Add your implementation here
            print()

# Test round-trip functionality
test_examples = [
    "help with refund",
    "I can't access my account",
    "My USB-C cable is defective",
    "The customer's order was cancelled",
    "This is a new product called iPhone-15-Pro-Max"  # Likely unknown words
]

# TODO - Test round-trip functionality

## Part D: Analyze Limitations

Understand what types of words our tokenizer misses and compare with BERT.

In [None]:
class TokenizerAnalyzer:
    def __init__(self, tokenizer, test_messages):
        self.tokenizer = tokenizer
        self.test_messages = test_messages
        self.unknown_tokens = Counter()
        self.test_stats = {}
    
    def analyze_unknown_tokens(self):
        """
        Analyze unknown token patterns in test set.
        """
        print("üîç Analyzing unknown token patterns on test set...\n")
        
        # TODO - Analyze unknown tokens
        # 1. Loop through test messages
        # 2. Tokenize and encode each message
        # 3. Count total tokens and unknown tokens
        # 4. Track which specific tokens are unknown
        
        # TODO - Calculate and display statistics
        # Show: total tokens, unknown tokens, unknown rate, unique unknown words
        
        return self.test_stats
    
    def show_most_common_unknown(self, top_n=10):
        """
        Show the most common unknown words.
        """
        # TODO - Display most common unknown words with categories
        pass
    
    def _categorize_unknown_word(self, word):
        """
        Categorize unknown words by type.
        """
        # TODO - Categorize words (Number/Code, Hyphenated, etc.)
        pass
    
    def analyze_unknown_categories(self):
        """
        Analyze what types of words are most commonly unknown.
        """
        # TODO - Analyze categories of unknown words
        pass

# TODO - Run analysis


## Comparison with Production Tokenizer

Compare our custom tokenizer with BERT's subword approach.

In [None]:
def compare_with_bert_tokenizer(custom_tokenizer, unknown_words, top_n=10):
    """
    Compare how BERT handles our unknown words.
    """
    print("ü§ñ Comparing with BERT subword tokenizer...\n")
    
    # TODO - Load BERT tokenizer and compare
    # 1. Load BERT tokenizer using AutoTokenizer
    # 2. Compare vocabulary sizes
    # 3. Test how BERT tokenizes our unknown words
    # 4. Show examples of subword tokenization
    
    pass

# TODO - Run comparison


## Final Analysis & Visualization

Visualize the vocabulary trade-offs and summarize key insights.

In [None]:
def create_tokenizer_analysis_plot(custom_stats, test_stats, unknown_tokens):
    """
    Create visualization of tokenizer analysis.
    """
    # TODO - Create comprehensive visualization
    # Create 2x2 subplot showing:
    # 1. Vocabulary coverage pie chart
    # 2. Vocabulary size reduction bar chart
    # 3. Unknown word categories pie chart
    # 4. Test set performance bar chart
    
    pass

# TODO - Create visualization

## üéØ Key Discoveries

**YOUR THOUGHTS**