# üß™ Practical Lab: GPT-4 vs LLaMA Tokenization Deep Dive

## üéØ Objective
Compare GPT-4 (cl100k_base) vs LLaMA (SentencePiece/BPE) tokenizers focusing on **Token Count** and **Cost Analysis**.

## üîß Setup & Installation

In [None]:
# Install required packages
!pip install tiktoken transformers matplotlib pandas seaborn numpy

In [None]:
# Import libraries
import tiktoken  # GPT-4 tokenizer
from transformers import AutoTokenizer  # LLaMA tokenizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from typing import Dict, List, Tuple
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('default')
sns.set_palette("husl")

print("‚úÖ Libraries imported successfully!")

## üöÄ Initialize Tokenizers

In [None]:
# GPT-4 tokenizer (cl100k_base)
print("Loading GPT-4 tokenizer...")
gpt4_tokenizer = tiktoken.get_encoding("cl100k_base")
print(f"GPT-4 vocab size: {gpt4_tokenizer.n_vocab:,}")

# LLaMA tokenizer (using Llama-2 as example)
print("\nLoading LLaMA tokenizer...")
try:
    llama_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
except:
    # Fallback to a similar tokenizer if LLaMA not available
    llama_tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
    
print(f"LLaMA vocab size: {len(llama_tokenizer.get_vocab()):,}")
print("\n‚úÖ Tokenizers loaded successfully!")

## üìù Test Cases Setup

In [None]:
# Define test cases
test_texts = {
    "english_simple": "The quick brown fox jumps over the lazy dog.",
    "english_complex": "Artificial intelligence and machine learning algorithms are revolutionizing computational linguistics through advanced neural network architectures.",
    "multilingual": "Hello world! Bonjour le monde! Hola mundo! „Åì„Çì„Å´„Å°„ÅØ‰∏ñÁïåÔºÅ ŸÖÿ±ÿ≠ÿ®ÿß ÿ®ÿßŸÑÿπÿßŸÑŸÖ!",
    "code_mixed": "def tokenize_text(input_str): return tokenizer.encode(input_str)",
    "technical": "The transformer architecture utilizes self-attention mechanisms with multi-head attention layers for sequence-to-sequence modeling.",
    "numbers_symbols": "Price: $1,234.56 | Date: 2024-01-15 | Email: user@example.com | Phone: +1-555-123-4567",
    "long_text": "Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of understanding the contents of documents, including the contextual nuances of the language within them."
}

print(f"üìä Created {len(test_texts)} test cases:")
for name, text in test_texts.items():
    print(f"  ‚Ä¢ {name}: {len(text)} characters")

## üîç Analysis Functions

In [None]:
def analyze_tokenization(text: str, text_name: str) -> Dict:
    """Analyze tokenization for both GPT-4 and LLaMA tokenizers"""
    
    # GPT-4 tokenization
    gpt4_tokens = gpt4_tokenizer.encode(text)
    gpt4_count = len(gpt4_tokens)
    gpt4_decoded = [gpt4_tokenizer.decode([token]) for token in gpt4_tokens]
    
    # LLaMA tokenization
    llama_tokens = llama_tokenizer.encode(text, add_special_tokens=False)
    llama_count = len(llama_tokens)
    llama_decoded = llama_tokenizer.convert_ids_to_tokens(llama_tokens)
    
    # Cost calculation (example rates - adjust based on actual pricing)
    gpt4_cost_per_1k = 0.03  # $0.03 per 1K tokens
    llama_cost_per_1k = 0.01  # $0.01 per 1K tokens (hypothetical)
    
    gpt4_cost = (gpt4_count / 1000) * gpt4_cost_per_1k
    llama_cost = (llama_count / 1000) * llama_cost_per_1k
    
    return {
        "text_name": text_name,
        "text": text,
        "char_count": len(text),
        "gpt4_tokens": gpt4_count,
        "llama_tokens": llama_count,
        "gpt4_decoded": gpt4_decoded,
        "llama_decoded": llama_decoded,
        "gpt4_cost": gpt4_cost,
        "llama_cost": llama_cost,
        "token_ratio": gpt4_count / llama_count if llama_count > 0 else 0,
        "cost_ratio": gpt4_cost / llama_cost if llama_cost > 0 else 0,
        "gpt4_chars_per_token": len(text) / gpt4_count if gpt4_count > 0 else 0,
        "llama_chars_per_token": len(text) / llama_count if llama_count > 0 else 0
    }

print("‚úÖ Analysis function defined!")

## üìä Run Analysis

In [None]:
# Run analysis on all test cases
print("üîÑ Running tokenization analysis...\n")

results = []
for name, text in test_texts.items():
    result = analyze_tokenization(text, name)
    results.append(result)
    
    print(f"üìù {name}:")
    print(f"   Characters: {result['char_count']}")
    print(f"   GPT-4 tokens: {result['gpt4_tokens']} (${result['gpt4_cost']:.6f})")
    print(f"   LLaMA tokens: {result['llama_tokens']} (${result['llama_cost']:.6f})")
    print(f"   Token ratio (GPT-4/LLaMA): {result['token_ratio']:.2f}")
    print(f"   Cost ratio (GPT-4/LLaMA): {result['cost_ratio']:.2f}")
    print()

# Create DataFrame
df = pd.DataFrame(results)
print("‚úÖ Analysis complete!")

## üìà Results Summary Table

In [None]:
# Create summary table
summary_df = df[['text_name', 'char_count', 'gpt4_tokens', 'llama_tokens', 
                 'token_ratio', 'cost_ratio', 'gpt4_chars_per_token', 'llama_chars_per_token']].copy()

summary_df.columns = ['Text Type', 'Characters', 'GPT-4 Tokens', 'LLaMA Tokens', 
                      'Token Ratio', 'Cost Ratio', 'GPT-4 Chars/Token', 'LLaMA Chars/Token']

# Round numerical columns
summary_df = summary_df.round(2)

print("üìä TOKENIZATION COMPARISON SUMMARY")
print("=" * 80)
display(summary_df)

## üìä Visualization: Token Count Comparison

In [None]:
# Token count comparison chart
plt.figure(figsize=(14, 8))

x = np.arange(len(df))
width = 0.35

plt.bar(x - width/2, df['gpt4_tokens'], width, label='GPT-4 (cl100k_base)', alpha=0.8, color='#FF6B6B')
plt.bar(x + width/2, df['llama_tokens'], width, label='LLaMA (SentencePiece)', alpha=0.8, color='#4ECDC4')

plt.xlabel('Text Types', fontsize=12)
plt.ylabel('Token Count', fontsize=12)
plt.title('Token Count Comparison: GPT-4 vs LLaMA', fontsize=14, fontweight='bold')
plt.xticks(x, df['text_name'], rotation=45, ha='right')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)

# Add value labels on bars
for i, (gpt4, llama) in enumerate(zip(df['gpt4_tokens'], df['llama_tokens'])):
    plt.text(i - width/2, gpt4 + 0.5, str(gpt4), ha='center', va='bottom', fontsize=9)
    plt.text(i + width/2, llama + 0.5, str(llama), ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

## üí∞ Cost Analysis Visualization

In [None]:
# Cost comparison visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Cost comparison bar chart
x = np.arange(len(df))
ax1.bar(x - width/2, df['gpt4_cost'] * 1000, width, label='GPT-4', alpha=0.8, color='#FF6B6B')
ax1.bar(x + width/2, df['llama_cost'] * 1000, width, label='LLaMA', alpha=0.8, color='#4ECDC4')
ax1.set_xlabel('Text Types')
ax1.set_ylabel('Cost ($ per 1000 chars)')
ax1.set_title('Cost Comparison by Text Type')
ax1.set_xticks(x)
ax1.set_xticklabels(df['text_name'], rotation=45, ha='right')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Efficiency scatter plot
ax2.scatter(df['gpt4_tokens'], df['gpt4_cost'] * 1000, label='GPT-4', alpha=0.7, s=100, color='#FF6B6B')
ax2.scatter(df['llama_tokens'], df['llama_cost'] * 1000, label='LLaMA', alpha=0.7, s=100, color='#4ECDC4')
ax2.set_xlabel('Token Count')
ax2.set_ylabel('Cost ($ per 1000 chars)')
ax2.set_title('Token Count vs Cost Efficiency')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## üîç Efficiency Analysis

In [None]:
# Calculate overall efficiency metrics
total_chars = df['char_count'].sum()
total_gpt4_tokens = df['gpt4_tokens'].sum()
total_llama_tokens = df['llama_tokens'].sum()
total_gpt4_cost = df['gpt4_cost'].sum()
total_llama_cost = df['llama_cost'].sum()

print("üéØ OVERALL EFFICIENCY METRICS")
print("=" * 50)
print(f"Total characters processed: {total_chars:,}")
print()
print("üìä TOKEN EFFICIENCY:")
print(f"  GPT-4 average: {total_chars/total_gpt4_tokens:.2f} chars/token")
print(f"  LLaMA average: {total_chars/total_llama_tokens:.2f} chars/token")
print(f"  GPT-4 is {((total_chars/total_gpt4_tokens)/(total_chars/total_llama_tokens)-1)*100:.1f}% more efficient")
print()
print("üí∞ COST ANALYSIS:")
print(f"  GPT-4 total cost: ${total_gpt4_cost:.6f}")
print(f"  LLaMA total cost: ${total_llama_cost:.6f}")
print(f"  GPT-4 costs {(total_gpt4_cost/total_llama_cost):.1f}x more than LLaMA")
print(f"  Cost difference: ${(total_gpt4_cost-total_llama_cost):.6f}")

## üî¨ Deep Dive: Token Inspection

In [None]:
# Inspect tokenization for a specific example
example_text = "The transformer architecture utilizes self-attention mechanisms."

gpt4_tokens = gpt4_tokenizer.encode(example_text)
gpt4_decoded = [gpt4_tokenizer.decode([token]) for token in gpt4_tokens]

llama_tokens = llama_tokenizer.encode(example_text, add_special_tokens=False)
llama_decoded = llama_tokenizer.convert_ids_to_tokens(llama_tokens)

print("üîç TOKEN BREAKDOWN ANALYSIS")
print("=" * 60)
print(f"Text: '{example_text}'")
print(f"Length: {len(example_text)} characters")
print()
print("GPT-4 Tokenization:")
for i, (token_id, token_str) in enumerate(zip(gpt4_tokens, gpt4_decoded)):
    print(f"  {i+1:2d}. ID:{token_id:5d} ‚Üí '{token_str}'")
print(f"Total GPT-4 tokens: {len(gpt4_tokens)}")
print()
print("LLaMA Tokenization:")
for i, (token_id, token_str) in enumerate(zip(llama_tokens, llama_decoded)):
    print(f"  {i+1:2d}. ID:{token_id:5d} ‚Üí '{token_str}'")
print(f"Total LLaMA tokens: {len(llama_tokens)}")

## üåç Character Type Analysis

In [None]:
# Test different character types
char_test_cases = {
    'ascii_basic': 'Hello world 123',
    'unicode_accents': 'caf√© na√Øve r√©sum√©',
    'symbols': '!@#$%^&*()_+-=[]{}|;:,.<>?',
    'mixed_unicode': 'Hello ‰∏ñÁïå! Price: $1,234.56',
    'code_syntax': 'function(x) { return x * 2; }',
    'emojis': 'üöÄ Hello! üòä How are you? üåü'
}

print("üåç CHARACTER TYPE EFFICIENCY ANALYSIS")
print("=" * 60)

char_results = []
for char_type, text in char_test_cases.items():
    gpt4_tokens = len(gpt4_tokenizer.encode(text))
    llama_tokens = len(llama_tokenizer.encode(text, add_special_tokens=False))
    
    char_results.append({
        'type': char_type,
        'text': text,
        'chars': len(text),
        'gpt4_tokens': gpt4_tokens,
        'llama_tokens': llama_tokens,
        'gpt4_efficiency': len(text) / gpt4_tokens,
        'llama_efficiency': len(text) / llama_tokens
    })
    
    print(f"{char_type:15s}: '{text[:30]}{'...' if len(text) > 30 else ''}'")
    print(f"                 GPT-4: {gpt4_tokens:2d} tokens ({len(text)/gpt4_tokens:.1f} chars/token)")
    print(f"                 LLaMA: {llama_tokens:2d} tokens ({len(text)/llama_tokens:.1f} chars/token)")
    print()

char_df = pd.DataFrame(char_results)

## üìä Character Type Efficiency Visualization

In [None]:
# Visualize character type efficiency
plt.figure(figsize=(12, 6))

x = np.arange(len(char_df))
width = 0.35

plt.bar(x - width/2, char_df['gpt4_efficiency'], width, label='GPT-4', alpha=0.8, color='#FF6B6B')
plt.bar(x + width/2, char_df['llama_efficiency'], width, label='LLaMA', alpha=0.8, color='#4ECDC4')

plt.xlabel('Character Types')
plt.ylabel('Characters per Token')
plt.title('Tokenization Efficiency by Character Type')
plt.xticks(x, char_df['type'], rotation=45, ha='right')
plt.legend()
plt.grid(True, alpha=0.3)

# Add value labels
for i, (gpt4_eff, llama_eff) in enumerate(zip(char_df['gpt4_efficiency'], char_df['llama_efficiency'])):
    plt.text(i - width/2, gpt4_eff + 0.05, f'{gpt4_eff:.1f}', ha='center', va='bottom', fontsize=9)
    plt.text(i + width/2, llama_eff + 0.05, f'{llama_eff:.1f}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

## üéØ Cost Optimization Function

In [None]:
def optimize_cost_choice(text: str, gpt4_price_per_1k: float = 0.03, llama_price_per_1k: float = 0.01) -> Dict:
    """Determine which tokenizer is more cost-effective for given text"""
    
    gpt4_tokens = len(gpt4_tokenizer.encode(text))
    llama_tokens = len(llama_tokenizer.encode(text, add_special_tokens=False))
    
    gpt4_cost = (gpt4_tokens / 1000) * gpt4_price_per_1k
    llama_cost = (llama_tokens / 1000) * llama_price_per_1k
    
    savings = abs(gpt4_cost - llama_cost)
    savings_percent = (savings / max(gpt4_cost, llama_cost)) * 100
    
    recommendation = "LLaMA" if llama_cost < gpt4_cost else "GPT-4"
    
    return {
        "text_length": len(text),
        "gpt4_tokens": gpt4_tokens,
        "llama_tokens": llama_tokens,
        "gpt4_cost": gpt4_cost,
        "llama_cost": llama_cost,
        "recommendation": recommendation,
        "savings": savings,
        "savings_percent": savings_percent
    }

# Test the optimization function
test_optimization = "Write a Python function that implements a binary search algorithm with proper error handling and documentation."

result = optimize_cost_choice(test_optimization)

print("üí° COST OPTIMIZATION RECOMMENDATION")
print("=" * 50)
print(f"Text: '{test_optimization}'")
print(f"Length: {result['text_length']} characters")
print()
print(f"GPT-4: {result['gpt4_tokens']} tokens ‚Üí ${result['gpt4_cost']:.6f}")
print(f"LLaMA: {result['llama_tokens']} tokens ‚Üí ${result['llama_cost']:.6f}")
print()
print(f"üí∞ Recommendation: Use {result['recommendation']}")
print(f"üíµ Potential savings: ${result['savings']:.6f} ({result['savings_percent']:.1f}%)")

## üß™ Interactive Testing Section

In [None]:
# Interactive testing - modify this cell to test your own text
YOUR_TEST_TEXT = "Enter your own text here to see how different tokenizers handle it!"

# Analyze your text
your_result = analyze_tokenization(YOUR_TEST_TEXT, "your_test")

print("üß™ YOUR TEXT ANALYSIS")
print("=" * 40)
print(f"Text: '{YOUR_TEST_TEXT}'")
print(f"Characters: {your_result['char_count']}")
print()
print(f"GPT-4 tokenization:")
print(f"  Tokens: {your_result['gpt4_tokens']}")
print(f"  Cost: ${your_result['gpt4_cost']:.6f}")
print(f"  Efficiency: {your_result['gpt4_chars_per_token']:.2f} chars/token")
print()
print(f"LLaMA tokenization:")
print(f"  Tokens: {your_result['llama_tokens']}")
print(f"  Cost: ${your_result['llama_cost']:.6f}")
print(f"  Efficiency: {your_result['llama_chars_per_token']:.2f} chars/token")
print()
print(f"Token ratio (GPT-4/LLaMA): {your_result['token_ratio']:.2f}")
print(f"Cost ratio (GPT-4/LLaMA): {your_result['cost_ratio']:.2f}")

## üìã Key Findings Summary

In [None]:
print("üéØ KEY FINDINGS & RECOMMENDATIONS")
print("=" * 60)
print()
print("üìä TOKENIZATION EFFICIENCY:")
print(f"  ‚Ä¢ GPT-4 (cl100k_base): {total_chars/total_gpt4_tokens:.2f} chars/token average")
print(f"  ‚Ä¢ LLaMA (SentencePiece): {total_chars/total_llama_tokens:.2f} chars/token average")
print(f"  ‚Ä¢ GPT-4 is ~{((total_chars/total_gpt4_tokens)/(total_chars/total_llama_tokens)-1)*100:.0f}% more token-efficient")
print()
print("üí∞ COST IMPLICATIONS:")
print(f"  ‚Ä¢ GPT-4 typically costs {(total_gpt4_cost/total_llama_cost):.1f}x more per text")
print(f"  ‚Ä¢ Higher per-token pricing + larger vocabulary = higher costs")
print(f"  ‚Ä¢ Token efficiency doesn't always translate to cost savings")
print()
print("üéØ WHEN TO USE EACH:")
print("  GPT-4 Tokenizer:")
print("    ‚úì Code-heavy applications")
print("    ‚úì Technical/scientific text")
print("    ‚úì Mixed Unicode content")
print("    ‚úì Quality over cost scenarios")
print()
print("  LLaMA Tokenizer:")
print("    ‚úì Cost-sensitive applications")
print("    ‚úì Consistent multilingual text")
print("    ‚úì Memory-constrained environments")
print("    ‚úì Research/academic use")
print()
print("üîç WHY SAME TEXT COSTS DIFFERENT:")
print("  1. Different vocabulary sizes (100K vs 32K tokens)")
print("  2. Different tokenization algorithms (BPE variants)")
print("  3. Different training data and optimization goals")
print("  4. Different pricing models per token")
print("  5. Different handling of special characters and Unicode")

## üöÄ Next Steps & Exercises

### Try These Experiments:
1. **Test with your domain-specific text** (legal, medical, technical)
2. **Compare with other tokenizers** (BERT, T5, etc.)
3. **Analyze different languages** (Chinese, Arabic, etc.)
4. **Test with code in different programming languages**
5. **Measure actual inference speed differences**

### Questions to Explore:
- How do tokenizers handle out-of-vocabulary words?
- What's the impact on model performance vs efficiency?
- How do different tokenizers affect multilingual capabilities?
- What are the memory implications during training vs inference?