# Tokenization Exploration

Welcome to the tokenization exploration notebook! This notebook will help you understand how different tokenizers work and how they convert text into tokens that language models can process.

## What is Tokenization?

Tokenization is the process of breaking down text into smaller units called tokens. These tokens can be words, subwords, or characters, depending on the tokenizer used. Different tokenizers have different strategies for splitting text, which can significantly impact how language models interpret and process the input.

## Installing Required Libraries

First, let's install the necessary libraries for tokenization exploration:

In [None]:
# Install required libraries (it might take a few minutes)
%pip install transformers torch ipywidgets pandas matplotlib seaborn

## Import Libraries

Let's import the libraries we'll need for tokenization:

In [8]:
from transformers import GPT2Tokenizer, BertTokenizer, AutoTokenizer
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

## GPT-2 Tokenizer

Let's start with the GPT-2 tokenizer, which uses Byte Pair Encoding (BPE). This is one of the most commonly used tokenizers in modern language models.

In [None]:
# Load GPT-2 tokenizer
gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Test sentence from the quest
test_sentence = "The quick brown fox jumps over the lazy dog."

print(f"Original sentence: {test_sentence}")
print(f"Length in characters: {len(test_sentence)}")
print()

### Tokenizing with GPT-2

Now let's see how GPT-2 tokenizes our test sentence:

In [None]:
# Tokenize the sentence
tokens = gpt2_tokenizer.encode(test_sentence)
token_strings = gpt2_tokenizer.convert_ids_to_tokens(tokens)

print(f"GPT-2 Tokenization Results:")
print(f"Number of tokens: {len(tokens)}")
print(f"Token IDs: {tokens}")
print(f"Token strings: {token_strings}")
print()

# Show each token with its ID
print("Token breakdown:")
for i, (token_id, token_str) in enumerate(zip(tokens, token_strings)):
    print(f"  {i+1:2d}. ID: {token_id:5d} | Token: '{token_str}'")

### Understanding GPT-2 Tokenization

Notice how GPT-2 handles different parts of the sentence. The tokenizer uses BPE, which means it can split words into subword units. Let's explore this further:

In [None]:
# Let's analyze each word individually
words = test_sentence.split()

print("Word-by-word tokenization analysis:")
total_tokens = 0

for word in words:
    word_tokens = gpt2_tokenizer.encode(word)
    word_token_strings = gpt2_tokenizer.convert_ids_to_tokens(word_tokens)
    total_tokens += len(word_tokens)
    
    print(f"\nWord: '{word}'")
    print(f"  Tokens: {len(word_tokens)}")
    print(f"  Token strings: {word_token_strings}")

print(f"\nTotal tokens when processed word-by-word: {total_tokens}")
print(f"Tokens when processed as full sentence: {len(tokens)}")
print(f"Difference: {total_tokens - len(tokens)} (context matters!)")

## Comparing Different Tokenizers

Let's compare how different tokenizers handle the same sentence:

In [None]:
# Load different tokenizers
tokenizers = {
    'GPT-2': GPT2Tokenizer.from_pretrained('gpt2'),
    'BERT': BertTokenizer.from_pretrained('bert-base-uncased'),
    'RoBERTa': AutoTokenizer.from_pretrained('roberta-base')
}

# Compare tokenization results
comparison_results = []

for name, tokenizer in tokenizers.items():
    tokens = tokenizer.encode(test_sentence)
    token_strings = tokenizer.convert_ids_to_tokens(tokens)
    
    comparison_results.append({
        'Tokenizer': name,
        'Token Count': len(tokens),
        'Tokens': token_strings
    })
    
    print(f"\n{name} Tokenizer:")
    print(f"  Number of tokens: {len(tokens)}")
    print(f"  Tokens: {token_strings}")

# Create a comparison dataframe
df_comparison = pd.DataFrame(comparison_results)
print("\nComparison Summary:")
print(df_comparison[['Tokenizer', 'Token Count']])

## Visualizing Token Counts

Let's create a visual comparison of token counts across different tokenizers:

In [None]:
# Create a bar plot comparing token counts
plt.figure(figsize=(10, 6))
bars = plt.bar(df_comparison['Tokenizer'], df_comparison['Token Count'])

# Add value labels on bars
for bar, count in zip(bars, df_comparison['Token Count']):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1, 
             str(count), ha='center', va='bottom', fontsize=12, fontweight='bold')

plt.title('Token Count Comparison Across Different Tokenizers\n"The quick brown fox jumps over the lazy dog."', 
          fontsize=14, fontweight='bold')
plt.xlabel('Tokenizer', fontsize=12)
plt.ylabel('Number of Tokens', fontsize=12)
plt.ylim(0, max(df_comparison['Token Count']) + 2)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

## Interactive Tokenization

Now you can experiment with your own sentences! Try different inputs and see how the tokenizers handle them:

In [None]:
def analyze_tokenization(sentence, tokenizer_name='GPT-2'):
    """
    Analyze tokenization for a given sentence
    """
    tokenizer = tokenizers[tokenizer_name]
    tokens = tokenizer.encode(sentence)
    token_strings = tokenizer.convert_ids_to_tokens(tokens)
    
    print(f"\n{tokenizer_name} Tokenization Analysis:")
    print(f"Original sentence: '{sentence}'")
    print(f"Character length: {len(sentence)}")
    print(f"Number of tokens: {len(tokens)}")
    print(f"Tokens: {token_strings}")
    print("\nToken breakdown:")
    for i, (token_id, token_str) in enumerate(zip(tokens, token_strings)):
        print(f"  {i+1:2d}. '{token_str}'")
    
    return len(tokens)

# Try some examples
test_sentences = [
    "Hello, world!",
    "The quick brown fox jumps over the lazy dog.",
    "Tokenization is fundamental to natural language processing.",
    "🤖 AI models use tokenization to understand text! 🚀"
]

for sentence in test_sentences:
    analyze_tokenization(sentence)

## Your Turn: Complete the Quest!

Now it's time to answer the quest question. Run the cell below to get the exact token count for the GPT-2 tokenizer on our test sentence:

In [None]:
quest_sentence = ""
gpt2_tokens = gpt2_tokenizer.encode(quest_sentence)
token_count = len(gpt2_tokens)

print(f"🎯 QUEST ANSWER:")
print(f"Sentence: '{quest_sentence}'")
print(f"GPT-2 Token Count: {token_count}")

## Key Takeaways

1. **Different tokenizers produce different results** - The same sentence can have different token counts depending on the tokenizer used.

2. **Context matters** - Tokenizing a full sentence might produce different results than tokenizing words individually.

3. **BPE is efficient** - Byte Pair Encoding (used by GPT-2) creates a good balance between vocabulary size and representation efficiency.

4. **Special characters matter** - Punctuation, spaces, and special characters are handled differently by different tokenizers.

5. **Tokenization affects model performance** - The choice of tokenizer can significantly impact how well a language model performs on different tasks.

## Next Steps

Now that you understand tokenization, you're ready to explore how these tokens are processed by language models. The tokenization step is crucial because it determines how the model "sees" and interprets text input.