# Exercise 0: Understanding Corpus, Tokens, and BPE Implementation

In this exercise, you'll build a Byte-Pair Encoding (BPE) tokenizer from scratch and understand the fundamental concepts of tokenization in NLP.

**Learning Objectives:**
- Understand text corpora and tokenization fundamentals
- Implement BPE algorithm from scratch
- Train and analyze custom tokenizers
- Compare different tokenization strategies

## Part 1: Setting Up and Loading the TinyStories Dataset

In [None]:
# Install required packages
!pip install datasets transformers matplotlib -q

In [None]:
from datasets import load_dataset
from collections import Counter, defaultdict
import re
import json

# Load a subset of TinyStories dataset
print("Loading TinyStories dataset...")
dataset = load_dataset("roneneldan/TinyStories", split="train[:5000]")
print(f"Loaded {len(dataset)} stories")

# Examine the first story
print("\nFirst story:")
print(dataset[0]['text'][:500])

## Part 2: Understanding Tokens

Before implementing BPE, let's understand different tokenization approaches:
- **Character-level**: Each character is a token
- **Word-level**: Each word is a token
- **Subword-level**: Parts of words are tokens (BPE, WordPiece)

In [None]:
sample_text = "Once upon a time, there was a little girl named Lily."

# TODO: Implement different tokenization strategies

def char_tokenize(text):
    """Split text into individual characters"""
    # YOUR CODE HERE
    return list(text)

def word_tokenize(text):
    """Split text into words (simple whitespace and punctuation split)"""
    # YOUR CODE HERE
    return re.findall(r'\w+|[^\w\s]', text)

# Test your tokenizers
print("Original:", sample_text)
print("\nCharacter tokens:", char_tokenize(sample_text)[:20])
print(f"Total character tokens: {len(char_tokenize(sample_text))}")
print("\nWord tokens:", word_tokenize(sample_text))
print(f"Total word tokens: {len(word_tokenize(sample_text))}")

## Part 3: Implementing Byte-Pair Encoding (BPE)

BPE is a subword tokenization algorithm that:
1. Starts with individual characters
2. Iteratively merges the most frequent adjacent pairs
3. Builds a vocabulary of subword units

This balances vocabulary size with the ability to represent any word.

In [None]:
class BPETokenizer:
    def __init__(self, vocab_size=1000):
        self.vocab_size = vocab_size
        self.merges = []  # List of (pair, merged_token) tuples in order
        self.vocab = {}   # Token to ID mapping
        
    def get_stats(self, words):
        """
        Count frequency of adjacent symbol pairs in the corpus.
        
        Args:
            words: Dictionary mapping word tuples to their frequencies
            
        Returns:
            Counter of symbol pairs and their frequencies
        """
        pairs = Counter()
        # TODO: Implement pair counting
        for word, freq in words.items():
            symbols = list(word)
            for i in range(len(symbols) - 1):
                pairs[(symbols[i], symbols[i + 1])] += freq
        return pairs
    
    def merge_vocab(self, pair, words):
        """
        Merge the most frequent pair in the vocabulary.
        
        Args:
            pair: Tuple of two symbols to merge (a, b)
            words: Dictionary mapping word tuples to their frequencies
            
        Returns:
            Updated words dictionary with merged pairs
        """
        new_words = {}
        # TODO: Implement vocabulary merging
        bigram = ' '.join(pair)
        replacement = ''.join(pair)
        
        for word, freq in words.items():
            w_str = ' '.join(word)
            w_str = w_str.replace(bigram, replacement)
            new_word = tuple(w_str.split())
            new_words[new_word] = freq
        return new_words
    
    def train(self, texts, verbose=True):
        """
        Train the BPE tokenizer on a corpus of texts.
        
        Args:
            texts: List of text strings to train on
            verbose: Whether to print progress
        """
        # Step 1: Prepare initial vocabulary (characters)
        # Create word frequency dictionary with end-of-word marker
        word_freqs = Counter()
        for text in texts:
            words = text.lower().split()
            word_freqs.update(words)
        
        # Convert words to character sequences with end marker
        words = {}
        for word, freq in word_freqs.items():
            # Split into characters and add end-of-word marker
            word_tuple = tuple(list(word) + ['</w>'])
            words[word_tuple] = freq
        
        # Step 2: Build initial character vocabulary
        base_vocab = set()
        for word in words.keys():
            base_vocab.update(word)
        
        # Assign IDs to base vocabulary
        self.vocab = {token: idx for idx, token in enumerate(sorted(base_vocab))}
        
        # Step 3: Iteratively merge most frequent pairs
        num_merges = self.vocab_size - len(self.vocab)
        if verbose:
            print(f"Starting with {len(self.vocab)} base tokens")
            print(f"Will perform {num_merges} merges")
        
        # TODO: Implement the BPE training loop
        for i in range(num_merges):
            pairs = self.get_stats(words)
            if not pairs:
                break
                
            best_pair = pairs.most_common(1)[0][0]
            words = self.merge_vocab(best_pair, words)
            
            # Add merged token to vocabulary
            merged_token = ''.join(best_pair)
            self.merges.append((best_pair, merged_token))
            self.vocab[merged_token] = len(self.vocab)
            
            if verbose and (i + 1) % 100 == 0:
                print(f"Merge {i + 1}/{num_merges}: {best_pair} -> {merged_token}")
        
        if verbose:
            print(f"Training complete. Final vocabulary size: {len(self.vocab)}")
    
    def encode(self, text):
        """
        Encode text into token IDs.
        
        Args:
            text: String to encode
            
        Returns:
            List of token IDs
        """
        # TODO: Implement encoding
        words = text.lower().split()
        tokens = []
        
        for word in words:
            # Start with characters + end marker
            word_tokens = list(word) + ['</w>']
            
            # Apply merges in order
            for pair, merged in self.merges:
                i = 0
                while i < len(word_tokens) - 1:
                    if (word_tokens[i], word_tokens[i + 1]) == pair:
                        word_tokens = word_tokens[:i] + [merged] + word_tokens[i + 2:]
                    else:
                        i += 1
            
            # Convert tokens to IDs
            for token in word_tokens:
                if token in self.vocab:
                    tokens.append(self.vocab[token])
                else:
                    # Handle unknown tokens (shouldn't happen if properly trained)
                    tokens.append(self.vocab.get('<unk>', 0))
        
        return tokens
    
    def decode(self, token_ids):
        """
        Decode token IDs back into text.
        
        Args:
            token_ids: List of token IDs
            
        Returns:
            Decoded text string
        """
        # TODO: Implement decoding
        # Create reverse vocab
        id_to_token = {v: k for k, v in self.vocab.items()}
        
        tokens = [id_to_token.get(tid, '<unk>') for tid in token_ids]
        text = ''.join(tokens).replace('</w>', ' ').strip()
        return text
    
    def get_vocab_info(self):
        """Return vocabulary statistics"""
        return {
            'vocab_size': len(self.vocab),
            'num_merges': len(self.merges),
            'sample_tokens': list(self.vocab.keys())[:20]
        }

## Part 4: Training Your BPE Tokenizer

Train tokenizers with different vocabulary sizes to see the impact.

In [None]:
# Extract texts from dataset
texts = [item['text'] for item in dataset]

# TODO: Train your tokenizer with different vocabulary sizes
vocab_sizes = [500, 1000, 2000]
tokenizers = {}

for vocab_size in vocab_sizes:
    print(f"\n{'=' * 50}")
    print(f"Training tokenizer with vocab_size={vocab_size}")
    print('=' * 50)
    
    tokenizer = BPETokenizer(vocab_size=vocab_size)
    tokenizer.train(texts, verbose=True)
    tokenizers[vocab_size] = tokenizer
    
    # Show some statistics
    info = tokenizer.get_vocab_info()
    print(f"\nVocabulary info:")
    print(f"  - Size: {info['vocab_size']}")
    print(f"  - Merges: {info['num_merges']}")
    print(f"  - Sample tokens: {info['sample_tokens']}")

## Part 5: Analyzing Your Tokenizer

Compare how different vocabulary sizes affect tokenization.

In [None]:
# TODO: Analyze tokenization on sample texts
test_sentences = [
    "Once upon a time, there was a little girl named Lily.",
    "She loved to play outside in the sunshine.",
    "One day, she found a beautiful butterfly."
]

# Compare tokenization results
print("Tokenization Comparison\n" + "=" * 70)

for sentence in test_sentences:
    print(f"\nSentence: {sentence}")
    print(f"Character count: {len(sentence)}")
    print(f"Word count: {len(sentence.split())}")
    print()
    
    for vocab_size in vocab_sizes:
        tokenizer = tokenizers[vocab_size]
        token_ids = tokenizer.encode(sentence)
        decoded = tokenizer.decode(token_ids)
        
        print(f"  Vocab {vocab_size}:")
        print(f"    - Tokens: {len(token_ids)}")
        print(f"    - Token IDs: {token_ids[:10]}{'...' if len(token_ids) > 10 else ''}")
        print(f"    - Decoded: {decoded}")
    print("-" * 70)

## Part 6: Vocabulary Analysis

Visualize and analyze the learned vocabularies.

In [None]:
import matplotlib.pyplot as plt

# TODO: Create visualizations

# 1. Average tokens per sentence for different vocab sizes
avg_tokens = {}

for vocab_size in vocab_sizes:
    tokenizer = tokenizers[vocab_size]
    total_tokens = 0
    
    for text in texts[:100]:  # Sample 100 texts
        total_tokens += len(tokenizer.encode(text))
    
    avg_tokens[vocab_size] = total_tokens / 100

plt.figure(figsize=(10, 5))
plt.bar([str(vs) for vs in vocab_sizes], [avg_tokens[vs] for vs in vocab_sizes])
plt.xlabel('Vocabulary Size')
plt.ylabel('Average Tokens per Story')
plt.title('Tokenization Efficiency vs Vocabulary Size')
plt.grid(axis='y', alpha=0.3)
plt.show()

print("\nAverage tokens per story:")
for vocab_size in vocab_sizes:
    print(f"  Vocab {vocab_size}: {avg_tokens[vocab_size]:.1f} tokens")

In [None]:
# 2. Show most common merged tokens
print("\nMost Recent Merges (Last 20):")
print("=" * 50)

tokenizer = tokenizers[2000]  # Use largest vocab
for i, (pair, merged) in enumerate(tokenizer.merges[-20:], 1):
    print(f"{i:2d}. {pair[0]:10s} + {pair[1]:10s} -> {merged}")

## Part 7: Comparison with HuggingFace Tokenizers

Compare your BPE implementation with professional tokenizers.

In [None]:
from transformers import GPT2Tokenizer

# Load pre-trained GPT-2 tokenizer
gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# TODO: Compare your BPE tokenizer with GPT-2's tokenizer
print("Comparison with GPT-2 Tokenizer")
print("=" * 70)

for sentence in test_sentences:
    print(f"\nSentence: {sentence}")
    
    # GPT-2 tokenization
    gpt2_tokens = gpt2_tokenizer.encode(sentence)
    gpt2_decoded = gpt2_tokenizer.decode(gpt2_tokens)
    
    print(f"\nGPT-2 (vocab size ~50k):")
    print(f"  - Tokens: {len(gpt2_tokens)}")
    print(f"  - Token IDs: {gpt2_tokens}")
    print(f"  - Decoded: {gpt2_decoded}")
    
    # Your tokenizer
    my_tokenizer = tokenizers[2000]
    my_tokens = my_tokenizer.encode(sentence)
    my_decoded = my_tokenizer.decode(my_tokens)
    
    print(f"\nYour BPE (vocab size 2000):")
    print(f"  - Tokens: {len(my_tokens)}")
    print(f"  - Token IDs: {my_tokens[:20]}{'...' if len(my_tokens) > 20 else ''}")
    print(f"  - Decoded: {my_decoded}")
    
    print(f"\nToken efficiency: GPT-2 uses {len(gpt2_tokens)/len(my_tokens):.2f}x fewer tokens")
    print("-" * 70)

## Part 8: Save Your Best Tokenizer

Save your trained tokenizer for use in Exercise 1.

In [None]:
import pickle

# Save the tokenizer with vocab size 2000
best_tokenizer = tokenizers[2000]

with open('bpe_tokenizer.pkl', 'wb') as f:
    pickle.dump(best_tokenizer, f)

print("Tokenizer saved to 'bpe_tokenizer.pkl'")
print(f"Vocabulary size: {len(best_tokenizer.vocab)}")

## Questions

Answer the following questions based on your experiments:

1. **How does vocabulary size affect the number of tokens needed to encode text?**
   - YOUR ANSWER HERE

2. **What are the trade-offs between small and large vocabulary sizes?**
   - YOUR ANSWER HERE

3. **How does your BPE tokenizer handle out-of-vocabulary words?**
   - YOUR ANSWER HERE

4. **What patterns do you notice in the most frequently merged token pairs?**
   - YOUR ANSWER HERE

5. **How does your tokenizer compare to GPT-2's tokenizer in terms of:**
   - Token efficiency (tokens per word): YOUR ANSWER HERE
   - Handling of rare words: YOUR ANSWER HERE
   - Vocabulary composition: YOUR ANSWER HERE