# Building LLMs From Scratch (Part 2): The Power of Tokenization

Welcome to Part 2 of our **"Building LLMs from Scratch"** series! 

In [Part 1](https://soloshun.medium.com/building-llms-from-scratch-part-1-the-complete-theoretical-foundation-e66b45b7f379), we established the theoretical foundation. Now it's time to get practical and take our first step into code.

## What We'll Learn Today

1. **Different tokenization strategies** and their trade-offs
2. **Build a simple tokenizer** from scratch using Python
3. **Understand Byte Pair Encoding (BPE)** - the algorithm behind modern LLMs
4. **Use professional tokenizers** like OpenAI's `tiktoken`

Let's start coding! 🚀

---

**📖 This notebook accompanies the Medium article:** [Building LLMs From Scratch (Part 2): The Power of Tokenization]

**📂 Find the clean Python script version:** [src/part02_tokenization.py](../src/part02_tokenization.py)


## Chapter 1: The Tokenization Dilemma

Before a Large Language Model can understand text, it needs to convert that text into numbers. This process is called **tokenization**.

But how should we split a sentence? There are three main approaches:

### 1. Word-Based Tokenization
Split by spaces and punctuation. Simple but has problems with unknown words.

### 2. Character-Based Tokenization  
Split into individual characters. No unknown words, but loses word meaning.

### 3. Sub-word Tokenization ⭐
The modern approach! Keep common words whole, split rare words into meaningful parts.

Let's see these in action:


In [None]:
# Let's see the difference between tokenization approaches
text = "My hobby is playing cricket"

# 1. Word-based tokenization
word_tokens = text.split()
print("Word-based:", word_tokens)
print("Vocabulary size would be HUGE with all possible words!\n")

# 2. Character-based tokenization  
char_tokens = list(text)
print("Character-based:", char_tokens)
print("Very long sequences, meaning is lost!\n")

# 3. Sub-word tokenization (we'll build this!)
print("Sub-word tokenization:")
print("- Common words like 'is' stay as one token")
print("- Rare words like 'snowboarding' might become ['snow', 'board', 'ing']")
print("- Best of both worlds! ✨")


## Chapter 2: Building Our First Tokenizer From Scratch

Now let's build a simple word-based tokenizer to understand the mechanics. We'll use "The Verdict" by Edith Wharton as our training text.

### Step 1: Load the Text Data


In [None]:
# Load our text data
try:
    with open("../data/the-verdict.txt", "r", encoding="utf-8") as f:
        raw_text = f.read()
    print(f"✅ Successfully loaded text! Total characters: {len(raw_text)}")
    print(f"First 100 characters:\n'{raw_text[:100]}'")
except FileNotFoundError:
    print("❌ File not found. Make sure 'the-verdict.txt' is in the '../data/' folder")
    # For demonstration, let's use a sample text
    raw_text = """I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that he had been caught."""
    print(f"Using sample text instead: '{raw_text[:80]}...'")


### Step 2: Split Text into Tokens

We'll use regular expressions to split text by spaces and punctuation, while keeping punctuation as separate tokens.


In [None]:
import re

# First, let's see how regex splitting works on a simple example
example = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', example)
print("Raw split result:", result)

# Clean up empty strings and whitespace
cleaned = [item.strip() for item in result if item.strip()]
print("Cleaned tokens:", cleaned)
print("✨ Notice how punctuation becomes separate tokens!")


In [None]:
# Now let's apply this to our full text
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]

print(f"Total tokens: {len(preprocessed)}")
print("First 20 tokens:", preprocessed[:20])

# Build our vocabulary (mapping tokens to IDs)
all_tokens = sorted(list(set(preprocessed)))
vocab_size = len(all_tokens)
vocab = {token: integer for integer, token in enumerate(all_tokens)}

print(f"\nVocabulary size: {vocab_size}")
print("First 10 vocabulary entries:")
for i, (token, id_) in enumerate(list(vocab.items())[:10]):
    print(f"  '{token}' -> {id_}")


### Step 3: Create the Tokenizer Class

Now let's wrap our logic in a reusable class with `encode` and `decode` methods:


In [None]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab  # token -> ID
        self.int_to_str = {i: s for s, i in vocab.items()}  # ID -> token
        
    def encode(self, text):
        """Convert text to a list of token IDs"""
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    def decode(self, ids):
        """Convert token IDs back to text"""
        text = " ".join([self.int_to_str[i] for i in ids])
        # Clean up spacing around punctuation
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

# Test our tokenizer!
tokenizer = SimpleTokenizerV1(vocab)
test_text = "Hello, world. This is a test!"

encoded = tokenizer.encode(test_text)
decoded = tokenizer.decode(encoded)

print(f"Original:  '{test_text}'")
print(f"Encoded:   {encoded}")
print(f"Decoded:   '{decoded}'")
print(f"Success:   {test_text == decoded}")


### The Problem: Unknown Words! ❌

Our tokenizer works great... until it encounters a word not in its vocabulary:


In [None]:
# Try encoding a word that's not in our vocabulary
try:
    unknown_text = "Hello, do you like pizza?"  # "pizza" probably isn't in our vocabulary
    tokenizer.encode(unknown_text)
    print("✅ Encoded successfully!")
except KeyError as e:
    print(f"❌ KeyError: {e}")
    print("Our tokenizer crashes when it sees unknown words!")


## Chapter 3: Handling Unknown Words with Special Tokens

To fix this, we'll add special tokens to our vocabulary:

- `<|unk|>`: Represents unknown words
- `<|endoftext|>`: Separates different documents/passages


In [None]:
# Create improved vocabulary with special tokens
all_tokens_v2 = sorted(list(set(preprocessed)))
all_tokens_v2.extend(["<|endoftext|>", "<|unk|>"])
vocab_v2 = {token: integer for integer, token in enumerate(all_tokens_v2)}

print(f"New vocabulary size: {len(vocab_v2)}")
print("Special tokens added:")
print(f"  '<|endoftext|>' -> {vocab_v2['<|endoftext|>']}")
print(f"  '<|unk|>' -> {vocab_v2['<|unk|>']}")

class SimpleTokenizerV2(SimpleTokenizerV1):
    def encode(self, text):
        """Convert text to token IDs, handling unknown words"""
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        
        # Replace unknown tokens with <|unk|>
        preprocessed = [
            item if item in self.str_to_int else "<|unk|>" 
            for item in preprocessed
        ]
        
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

# Test the improved tokenizer
tokenizer_v2 = SimpleTokenizerV2(vocab_v2)

# Test with known words
test1 = "Hello, world!"
print(f"Known words: '{test1}' -> {tokenizer_v2.encode(test1)}")

# Test with unknown words  
test2 = "Hello, do you like pizza?"
encoded_with_unknown = tokenizer_v2.encode(test2)
decoded_with_unknown = tokenizer_v2.decode(encoded_with_unknown)

print(f"With unknown: '{test2}'")
print(f"Encoded: {encoded_with_unknown}")
print(f"Decoded: '{decoded_with_unknown}'")
print("✅ No more crashes!")


## Chapter 4: The Industry Standard - Byte Pair Encoding (BPE)

Our simple tokenizer works, but modern LLMs use **Byte Pair Encoding (BPE)** - a subword algorithm that elegantly solves the vocabulary size vs. unknown word trade-off.

### How BPE Works:
1. Start with character-level tokens
2. Iteratively merge the most frequent consecutive pair
3. Build a vocabulary of common subwords

Let's see BPE in action using OpenAI's `tiktoken` library:


In [None]:
# First, let's install tiktoken (run this if not already installed)
# !pip install tiktoken

import tiktoken

# Load GPT-2's BPE tokenizer
gpt2_tokenizer = tiktoken.get_encoding("gpt2")

print("Available encodings:", tiktoken.list_encoding_names())
print(f"GPT-2 tokenizer loaded!")

# Test with the same text
text = "Hello, do you like tea? <|endoftext|> Akwirw ier"

# Encode
integers = gpt2_tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(f"\nOriginal: '{text}'")
print(f"Encoded:  {integers}")

# Decode
decoded = gpt2_tokenizer.decode(integers)
print(f"Decoded:  '{decoded}'")

print(f"\n✨ Notice: BPE breaks unknown 'Akwirw ier' into subword tokens!")
print("This is much better than replacing it with <|unk|>")


### Let's See BPE Magic in Action! ✨


In [None]:
# Compare our simple tokenizer vs. GPT-2's BPE tokenizer
test_words = ["tokenization", "snowboarding", "artificialintelligence", "pizza"]

print("🔄 Tokenization Comparison:\n")
print(f"{'Word':<20} {'Our Tokenizer':<15} {'GPT-2 BPE':<30}")
print("-" * 70)

for word in test_words:
    # Our tokenizer (will use <|unk|> for unknown words)
    try:
        our_result = tokenizer_v2.encode(word)
        our_tokens = [tokenizer_v2.int_to_str[id_] for id_ in our_result]
    except:
        our_tokens = ["<|unk|>"]
    
    # GPT-2 BPE tokenizer
    gpt2_ids = gpt2_tokenizer.encode(word)
    gpt2_tokens = [gpt2_tokenizer.decode([id_]) for id_ in gpt2_ids]
    
    print(f"{word:<20} {str(our_tokens):<15} {str(gpt2_tokens):<30}")

print("\n🎯 Key Insight: BPE breaks unknown words into meaningful subword pieces!")
print("   This allows the model to understand patterns even in words it's never seen.")


## 🎉 Summary & What We've Learned

Congratulations! You've just built your first tokenizer and understand the fundamentals of modern tokenization.

### Key Takeaways:

1. **Tokenization is crucial** - It's the bridge between human text and machine numbers
2. **Word-level is simple** but has vocabulary explosion problems
3. **Character-level solves vocabulary** but loses meaning and creates long sequences  
4. **Subword (BPE) is the sweet spot** - handles unknown words gracefully while keeping sequences manageable
5. **Professional tokenizers like tiktoken** are highly optimized and ready for production use

### What's Next?

Now that we can convert text into token sequences, we need to prepare this data for training our LLM. 

**In Part 3**, we'll build a **data loader** that creates input-target pairs from our token sequences - the exact format our model needs for training!

---

🔗 **Find the complete code:**
- **This notebook**: `notebooks/part02_tokenization.ipynb`
- **Python script**: `src/part02_tokenization.py`  
- **GitHub repository**: [llm-from-scratch](https://github.com/soloeinsteinmit/llm-from-scratch)

📝 **Read the full article**: [Building LLMs From Scratch (Part 2): The Power of Tokenization](https://soloshun.medium.com/building-llms-from-scratch-part-2-the-power-of-tokenization)

Happy coding! 🚀
