# TP1: Understanding Tokenization

**Day 2 - AI for Sciences Winter School**

**Instructor:** Raphael Cousin

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/racousin/ai_for_sciences/blob/main/day2/tp1.ipynb)

## Objectives
1. Understand **why tokenization matters** for machine learning
2. Compare tokenization strategies: character, word, and subword (BPE)
3. Explore **domain-specific tokenizers** for molecules, DNA, and proteins
4. Understand trade-offs for scientific applications

## Setup

In [None]:
!pip install -q git+https://github.com/racousin/ai_for_sciences.git
!pip install -q transformers sentencepiece

import numpy as np
import matplotlib.pyplot as plt
from transformers import AutoTokenizer

from aiforscience import (
    visualize_tokens,
    compare_tokenizers,
    tokenizer_stats,
)

print("Setup complete!")

---
# Part 1: Why Tokenization Matters

**The Problem:** Neural networks only understand numbers, not text.

**Tokenization** converts text into a sequence of integers:

```
"Hello world" → ["Hello", " world"] → [15496, 995]
```

The choice of how to split text has major implications for model performance.

In [None]:
# Load GPT-2 tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# See tokenization in action
text = "Machine learning transforms scientific research."
visualize_tokens(text, tokenizer)
plt.show()

## Three Tokenization Strategies

| Strategy | How it works | Vocab Size | Sequence Length |
|----------|--------------|------------|------------------|
| **Character** | Each character = 1 token | ~100 | Very long |
| **Word** | Each word = 1 token | 100K+ | Short |
| **Subword (BPE)** | Frequent substrings = tokens | ~30-50K | Medium |

In [None]:
text = "Photosynthesis occurs in chloroplasts."

# Character-level
char_tokens = list(text)
print(f"Character-level: {len(char_tokens)} tokens")
print(f"  {char_tokens[:15]}...\n")

# Word-level
word_tokens = text.replace('.', ' .').split()
print(f"Word-level: {len(word_tokens)} tokens")
print(f"  {word_tokens}\n")

# Subword (BPE) - GPT-2
subword_tokens = tokenizer.tokenize(text)
print(f"Subword (GPT-2): {len(subword_tokens)} tokens")
print(f"  {subword_tokens}")

**Key insight:** BPE keeps common words as single tokens but splits rare words into subwords.

This allows handling **any** input while keeping sequences reasonably short.

---
## Exercise 1: Explore Tokenization

How does GPT-2 tokenize text from your domain?

In [None]:
# TODO: Try your own scientific text!
my_text = "Your scientific text here"  # <-- Modify this!

visualize_tokens(my_text, tokenizer)
plt.show()

**Questions:**
1. Are technical terms from your field single tokens or split?
2. What might this mean for a model's understanding?

---
# Part 2: How BPE Works

**Byte Pair Encoding (BPE)** builds vocabulary by iteratively merging frequent character pairs:

1. Start with individual characters
2. Count all adjacent pairs
3. Merge the most frequent pair → new token
4. Repeat until vocabulary size reached

In [None]:
from collections import Counter

def simple_bpe_demo(words, n_merges=5):
    """Demonstrate BPE algorithm."""
    # Initialize: each word split into characters
    vocab = {word: list(word) + ['</w>'] for word in words}
    
    print(f"Corpus: {words}")
    print(f"Initial: {vocab}\n")
    
    for step in range(n_merges):
        # Count pairs
        pairs = Counter()
        for word, tokens in vocab.items():
            for i in range(len(tokens) - 1):
                pairs[(tokens[i], tokens[i+1])] += words.count(word)
        
        if not pairs:
            break
        
        # Find most frequent
        best = max(pairs, key=pairs.get)
        print(f"Step {step+1}: Merge '{best[0]}' + '{best[1]}' → '{best[0]+best[1]}' (count: {pairs[best]})")
        
        # Apply merge
        new_vocab = {}
        for word, tokens in vocab.items():
            new_tokens = []
            i = 0
            while i < len(tokens):
                if i < len(tokens)-1 and (tokens[i], tokens[i+1]) == best:
                    new_tokens.append(tokens[i] + tokens[i+1])
                    i += 2
                else:
                    new_tokens.append(tokens[i])
                    i += 1
            new_vocab[word] = new_tokens
        vocab = new_vocab
    
    print(f"\nFinal: {vocab}")

# Demo
simple_bpe_demo(['low', 'lower', 'lowest', 'new', 'newer'], n_merges=5)

**Result:** BPE discovers common morphemes like `-er`, `-est` automatically!

In [None]:
# Common words → single tokens
print("Common words:")
for word in ["the", "and", "computer", "science"]:
    tokens = tokenizer.tokenize(word)
    print(f"  '{word}' → {tokens}")

print("\nRare words (split into subwords):")
for word in ["antidisestablishmentarianism", "electroencephalography"]:
    tokens = tokenizer.tokenize(word)
    print(f"  '{word}' → {len(tokens)} tokens")
    print(f"    {tokens}")

---
# Part 3: Domain-Specific Tokenizers

**Key insight:** Tokenizers trained on English text don't work well for scientific data!

| Domain | Data | Specialized Tokenizer |
|--------|------|----------------------|
| Chemistry | SMILES strings | ChemBERTa |
| Genomics | DNA (ATCG) | DNABERT (k-mers) |
| Proteomics | Amino acids | ESM-2, ProtBERT |

## 3.1 Chemistry: SMILES Tokenization

**SMILES** represents molecules as strings:
- Aspirin: `CC(=O)OC1=CC=CC=C1C(=O)O`
- Caffeine: `CN1C=NC2=C1C(=O)N(C(=O)N2C)C`

In [None]:
# Load chemistry tokenizer
chem_tokenizer = AutoTokenizer.from_pretrained("seyonec/ChemBERTa-zinc-base-v1")
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")

aspirin = "CC(=O)OC1=CC=CC=C1C(=O)O"

# Compare
compare_tokenizers(aspirin, {
    "ChemBERTa (chemistry)": chem_tokenizer,
    "GPT-2 (English)": gpt2_tokenizer,
})
plt.show()

ChemBERTa understands chemical notation (`C`, `=O`, ring numbers) as meaningful units!

## 3.2 Genomics: DNA k-mer Tokenization

DNA uses 4 bases: **A, T, C, G**

**k-mer tokenization** splits into overlapping windows:
```
ATCGATCG → [ATC, TCG, CGA, GAT, ATC, TCG]  (3-mers)
```

In [None]:
def kmer_tokenize(sequence, k=3):
    """Split DNA sequence into k-mers."""
    return [sequence[i:i+k] for i in range(len(sequence) - k + 1)]

dna = "ATCGATCGATCGATCG"

for k in [3, 4, 6]:
    kmers = kmer_tokenize(dna, k)
    vocab_size = 4 ** k  # 4 bases, k positions
    print(f"{k}-mers: {len(kmers)} tokens, vocab size = {vocab_size}")
    print(f"  {kmers[:6]}...\n")

In [None]:
# How GPT-2 sees DNA (not well!)
dna = "ATCGATCGATCGATCG"
visualize_tokens(dna, gpt2_tokenizer, title="GPT-2 on DNA")
plt.show()
print("GPT-2 doesn't understand DNA structure!")

## 3.3 Proteomics: Amino Acid Tokenization

Proteins are sequences of **20 amino acids** (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y).

Protein models typically use **character-level tokenization** where each amino acid = 1 token.

In [None]:
# Load protein tokenizer
try:
    protein_tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")
    tokenizer_name = "ESM-2"
except:
    protein_tokenizer = AutoTokenizer.from_pretrained("Rostlab/prot_bert")
    tokenizer_name = "ProtBERT"

# Insulin fragment
insulin = "MALWMRLLPLLALLALWGPDPAAA"

# ESM/ProtBERT expects spaces between amino acids
spaced = " ".join(list(insulin))
tokens = protein_tokenizer.tokenize(spaced)

print(f"{tokenizer_name} tokenization of insulin fragment:")
print(f"  Sequence: {insulin}")
print(f"  Length: {len(insulin)} amino acids")
print(f"  Tokens: {len(tokens)}")
print(f"  → Each amino acid is one token!")

---
## Exercise 2: Compare Tokenizers on Your Data

In [None]:
# Load tokenizers
tokenizers = {
    "GPT-2": AutoTokenizer.from_pretrained("gpt2"),
    "ChemBERTa": AutoTokenizer.from_pretrained("seyonec/ChemBERTa-zinc-base-v1"),
}

# Test data
test_samples = [
    "The mitochondria is the powerhouse of the cell.",
    "CC(=O)OC1=CC=CC=C1C(=O)O",  # Aspirin
    "ATCGATCGATCGATCG",           # DNA
    "MVLSPADKTNVKAAWGKVGAHAGEY",  # Protein
]

# Compare token counts
print("Token counts by tokenizer:")
print("-" * 50)
for sample in test_samples:
    short = sample[:30] + "..." if len(sample) > 30 else sample
    print(f"\n{short}")
    for name, tok in tokenizers.items():
        n_tokens = len(tok.tokenize(sample))
        print(f"  {name}: {n_tokens} tokens")

In [None]:
# Vocabulary size comparison
tokenizer_stats(tokenizers)
plt.show()

---
# Part 4: Why This Matters

## The Problem with General Tokenizers on Scientific Data

In [None]:
# Scientific terms from different domains
domains = {
    "Biology": ["mitochondria", "chloroplast", "photosynthesis", "ribosome"],
    "Chemistry": ["stoichiometry", "electronegativity", "chromatography"],
    "Physics": ["thermodynamics", "superconductivity", "electromagnetism"],
    "Medicine": ["pharmacokinetics", "immunotherapy", "pathogenesis"],
}

tokenizer = AutoTokenizer.from_pretrained("gpt2")

print("How GPT-2 tokenizes scientific terms:")
print("=" * 50)

for domain, terms in domains.items():
    print(f"\n{domain}:")
    for term in terms:
        tokens = tokenizer.tokenize(term)
        print(f"  {term:25} → {len(tokens)} tokens: {tokens}")

**Key takeaways:**

1. General tokenizers fragment scientific terms into many subwords
2. This means the model doesn't "see" these as single concepts
3. Domain-specific tokenizers have vocabularies optimized for their field
4. Using the right tokenizer can significantly improve model performance

---
## Exercise 3: Analyze Your Domain

Add technical terms from your research field and see how they're tokenized.

In [None]:
# TODO: Add terms from your domain!
my_domain = "Your Field"  # <-- Change this
my_terms = [
    "term1",  # <-- Add your terms
    "term2",
    "term3",
]

tokenizer = AutoTokenizer.from_pretrained("gpt2")

print(f"\n{my_domain}:")
for term in my_terms:
    tokens = tokenizer.tokenize(term)
    print(f"  {term:25} → {len(tokens)} tokens: {tokens}")

# Calculate statistics
total_tokens = sum(len(tokenizer.tokenize(t)) for t in my_terms)
avg_tokens = total_tokens / len(my_terms)
single_token = sum(1 for t in my_terms if len(tokenizer.tokenize(t)) == 1)

print(f"\nStatistics:")
print(f"  Average tokens per term: {avg_tokens:.1f}")
print(f"  Single-token terms: {single_token}/{len(my_terms)} ({100*single_token/len(my_terms):.0f}%)")

---
# Summary

## Key Takeaways

1. **Tokenization is foundational** - it determines what the model "sees"

2. **Three strategies:**
   - Character: small vocab, long sequences
   - Word: large vocab, OOV problems
   - Subword (BPE): balanced trade-off

3. **Domain matters:**
   - GPT-2 fragments scientific terms
   - ChemBERTa understands molecular notation
   - DNABERT uses k-mers for genetic sequences
   - ESM-2 tokenizes amino acids individually

4. **Practical advice:**
   - Use domain-specific models when available
   - Check how your data is tokenized before training
   - More tokens = more computation + harder learning

## Reflection Questions

1. Are there domain-specific tokenizers/models for your research area?
2. How might tokenization affect the results in your field?
3. What trade-offs would you consider when choosing a tokenization strategy?

---
## Next: TP2 - Embeddings

In the next practical, we'll explore **embeddings** - the dense vector representations that come after tokenization. We'll see how domain-specific models create meaningful representations for molecules, proteins, DNA, and scientific text.