# TP1 - Part 1: Tokenization & Embeddings

**Day 2 - AI for Sciences Winter School**

**Instructor:** Raphael Cousin

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/racousin/ai_for_sciences/blob/main/day2/tp1_part1.ipynb)

---

## Objectives

By the end of this practical, you will understand:

1. **The complete pipeline**: Text → Tokens → Token IDs → Embeddings
2. **What embeddings are**: Dense vector representations where similar things are close
3. **Why pre-trained models matter**: Leveraging knowledge from massive datasets
4. **How to use embeddings**: Similarity search and visualization

---

# Part 1: The Pipeline - From Text to Vectors

Neural networks only understand numbers. To process text (or molecules, proteins, DNA), we need to convert it to numbers.

```
┌─────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────────┐
│  Raw Text   │ →  │   Tokens     │ →  │  Token IDs   │ →  │    Embeddings    │
│  (string)   │    │  (subwords)  │    │  (integers)  │    │ (dense vectors)  │
└─────────────┘    └──────────────┘    └──────────────┘    └──────────────────┘

"The cat sat"  →  ["The", "cat", "sat"] →  [464, 3797, 3332]  →  [[0.1, -0.2, ...], 
                                                                   [0.3, 0.5, ...],
                                                                   [-0.1, 0.4, ...]]
```

In this practical, we'll explore **tokenization** (how text is split) and then focus on **embeddings** (how tokens become meaningful vectors).

## Setup

In [None]:
!pip install -q git+https://github.com/racousin/ai_for_sciences.git
!pip install -q transformers sentence-transformers

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity

from aiforscience import (
    visualize_tokens,
    plot_similarity_matrix,
    semantic_search,
    print_search_results,
)

print("Setup complete!")

## Tokenization: Breaking Text into Pieces

Before we can process text, we need to **tokenize** it - break it into smaller units.

### Three Tokenization Strategies

| Strategy | How it works | Example: "unhappiness" | Pros/Cons |
|----------|--------------|------------------------|-----------|
| **Character** | Split into characters | `['u','n','h','a','p','p','i','n','e','s','s']` | ✓ Small vocab, ✗ Very long sequences |
| **Word** | Split by spaces/punctuation | `['unhappiness']` | ✓ Meaningful units, ✗ Huge vocab, can't handle new words |
| **Subword** | Split into common subparts | `['un', 'happiness']` | ✓ Balanced! Handles new words, reasonable vocab |

**Modern models use subword tokenization** (like BPE - Byte Pair Encoding) because:
- Frequent words stay whole: "the", "cat"
- Rare words are split: "transformers" → "transform" + "ers"
- Can handle any word, even typos: "looooong" → "l" + "oo" + "oo" + "ong"

In [None]:
from transformers import AutoTokenizer

# Load two different tokenizers to compare
tokenizer_gpt2 = AutoTokenizer.from_pretrained("gpt2")
tokenizer_bert = AutoTokenizer.from_pretrained("bert-base-uncased")

print("=" * 50)
print("TOKENIZER COMPARISON")
print("=" * 50)
print(f"\n{'Model':<20} {'Vocabulary Size':<20}")
print("-" * 40)
print(f"{'GPT-2':<20} {len(tokenizer_gpt2):,}")
print(f"{'BERT':<20} {len(tokenizer_bert):,}")
print(f"\nGPT-2 has ~20,000 more tokens than BERT!")
print("Different models make different tokenization choices.")

In [None]:
# Compare how the SAME text is tokenized by different models
examples = [
    "Machine learning transforms scientific research.",
    "photosynthesis",
    "CRISPR-Cas9",
    "COVID-19 vaccine",
]

print("How different tokenizers split the SAME text:\n")
print("=" * 70)

for text in examples:
    tokens_gpt2 = tokenizer_gpt2.tokenize(text)
    tokens_bert = tokenizer_bert.tokenize(text)
    
    print(f"\nText: '{text}'")
    print(f"  GPT-2 ({len(tokens_gpt2)} tokens): {tokens_gpt2}")
    print(f"  BERT  ({len(tokens_bert)} tokens): {tokens_bert}")

### Question: Tokenization Choices

Look at the outputs above and think about:

1. **Why do GPT-2 and BERT tokenize the same text differently?** (Hint: they were trained on different data with different goals)

2. **BERT lowercases everything** (notice "machine" vs "Machine"). What are the trade-offs?
   - Advantage: "Machine" and "machine" become the same token
   - Disadvantage: We lose information (acronyms like "DNA" become "dna")

3. **For scientific text**, which tokenizer seems to handle domain terms better? Why might this matter for your research?

4. **Token count matters!** Models have a maximum context length (e.g., 512 or 4096 tokens). If your text is tokenized into more pieces, you can fit less content. How might this affect working with scientific papers?

In [None]:
# Visual comparison of tokenization
text = "The CRISPR-Cas9 system enables precise DNA editing."

fig, axes = plt.subplots(2, 1, figsize=(12, 4))

# GPT-2
plt.sca(axes[0])
visualize_tokens(text, tokenizer_gpt2)
axes[0].set_title("GPT-2 Tokenization", fontsize=11, fontweight='bold')

# BERT
plt.sca(axes[1])
visualize_tokens(text, tokenizer_bert)
axes[1].set_title("BERT Tokenization", fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

print(f"\nGPT-2: {len(tokenizer_gpt2.tokenize(text))} tokens")
print(f"BERT:  {len(tokenizer_bert.tokenize(text))} tokens")

## From Tokens to Token IDs

Each tokenizer has a fixed **vocabulary** - a dictionary mapping tokens to integer IDs:

| Model | Vocabulary Size |
|-------|-----------------|
| GPT-2 | 50,257 tokens |
| BERT | 30,522 tokens |

When you tokenize text, each token gets a unique ID from the vocabulary:
```
"Machine"  → 33423
"learning" → 4673
" research" → 2267
```

**The Problem: Token IDs are just arbitrary integers!**

Is token 33423 similar to token 33424? We have no idea - they're just numbers in a lookup table. The number itself carries no meaning.

**This is where embeddings come in →**

---

# Part 2: Embeddings - Meaning Becomes Geometry

The key insight of modern NLP:

> **Convert tokens into vectors where similar things are close together.**

```
"King"   → [0.2, -0.4, 0.8, 0.1, ...] (384 numbers)
"Queen"  → [0.3, -0.3, 0.7, 0.2, ...] (384 numbers)  ← Close to "King"!
"Banana" → [-0.5, 0.9, -0.2, 0.6, ...] (384 numbers)  ← Far from both
```

This is the magic of **embeddings**: meaning becomes geometry.

## The Naive Approach: One-Hot Encoding

The simplest way to represent words as vectors:

In [None]:
# One-hot encoding: each word is a vector with one "1" and rest "0"s
one_hot = {
    "cat":  np.array([1, 0, 0, 0, 0]),
    "dog":  np.array([0, 1, 0, 0, 0]),
    "fish": np.array([0, 0, 1, 0, 0]),
    "bird": np.array([0, 0, 0, 1, 0]),
    "tree": np.array([0, 0, 0, 0, 1]),
}

print("One-hot representations:")
for word, vec in one_hot.items():
    print(f"  {word:6} → {vec}")

In [None]:
# What's the distance between words?
cat, dog, tree = one_hot["cat"], one_hot["dog"], one_hot["tree"]

print(f"Distance 'cat' ↔ 'dog':  {np.linalg.norm(cat - dog):.2f}")
print(f"Distance 'cat' ↔ 'tree': {np.linalg.norm(cat - tree):.2f}")
print()
print("Problem: ALL words are equally distant from each other!")
print("We lose the semantic relationship: cat and dog are both animals.")

## The Better Approach: Dense Embeddings

Instead of sparse one-hot vectors, we use **dense vectors** where each dimension captures some aspect of meaning:

In [None]:
# Conceptual example of learned embeddings
# (Real embeddings have hundreds of dimensions, learned from data)
#
#                [is_animal, has_fur, can_fly, is_pet]
embeddings = {
    "cat":   np.array([0.9, 0.8, 0.0, 0.7]),
    "dog":   np.array([0.9, 0.9, 0.0, 0.9]),
    "fish":  np.array([0.7, 0.0, 0.0, 0.4]),
    "bird":  np.array([0.8, 0.0, 0.9, 0.5]),
    "tree":  np.array([0.0, 0.0, 0.0, 0.0]),
}

# Now distances reflect semantic similarity!
cat, dog, tree = embeddings["cat"], embeddings["dog"], embeddings["tree"]

print("With dense embeddings:")
print(f"  Distance 'cat' ↔ 'dog':  {np.linalg.norm(cat - dog):.2f}  (close - both pets!)")
print(f"  Distance 'cat' ↔ 'tree': {np.linalg.norm(cat - tree):.2f}  (far - different things)")

### Question 1

Looking at the conceptual embeddings above:
1. Why are "cat" and "dog" close in this embedding space?
2. What other words would you expect to be close to "bird"?

---

# Part 3: Cosine Similarity - Measuring Closeness

To measure similarity between embeddings, we use **cosine similarity**:

$$\text{cosine\_similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|}$$

- Measures the **angle** between vectors (not distance)
- Range: -1 (opposite) to +1 (identical direction)
- Scale-invariant: `[1, 2, 3]` and `[2, 4, 6]` have similarity = 1.0

In [None]:
def cosine_sim(a, b):
    """Cosine similarity between two vectors."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Compare all pairs
words = list(embeddings.keys())
print("Cosine similarities:\n")

for i, w1 in enumerate(words):
    for w2 in words[i+1:]:
        sim = cosine_sim(embeddings[w1], embeddings[w2])
        bar = "█" * int(max(0, sim) * 20)
        print(f"  {w1:5} ↔ {w2:5}: {sim:+.2f}  {bar}")

---

# Part 4: The Power of Pre-trained Models

The conceptual embeddings above were hand-crafted. In practice, we use models that **learned embeddings from massive datasets**.

## Why This Matters for Scientists

| Your Situation | Pre-trained Model Advantage |
|----------------|----------------------------|
| Small dataset (100s of samples) | Model learned from billions of examples |
| Limited compute | Training already done for you |
| Domain-specific task | General language understanding transfers |

**Key insight**: You can leverage the knowledge captured by models trained on massive datasets, even if your own dataset is small.

Let's use a real model: `all-MiniLM-L6-v2` (trained on 1+ billion sentence pairs)

In [None]:
from sentence_transformers import SentenceTransformer

# Load pre-trained model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

print(f"Model loaded!")
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")
print(f"\nThis model was trained on over 1 billion sentence pairs.")
print("You get to use all that learning for free!")

In [None]:
# Embed some sentences
sentences = [
    "The cat is sleeping on the couch.",
    "A dog is resting on the sofa.",
    "Machine learning transforms data into insights.",
    "Deep learning models can recognize images.",
    "The weather is sunny today.",
]

embeddings = model.encode(sentences)

print(f"Embedded {len(sentences)} sentences")
print(f"Each sentence → {embeddings.shape[1]} numbers")
print(f"\nFirst embedding (first 8 values):")
print(f"  {embeddings[0][:8]}...")

## Visualizing Similarity

Let's see if the model understands that semantically similar sentences should have similar embeddings:

In [None]:
# Compute similarity matrix
sim_matrix = cosine_similarity(embeddings)

# Create short labels
labels = [s[:25] + "..." if len(s) > 25 else s for s in sentences]

fig = plot_similarity_matrix(sim_matrix, labels, title="Sentence Similarities")
plt.show()

### Question 2

Look at the similarity matrix:
1. Which two sentences are most similar? Why?
2. Why is "The weather is sunny today" different from all others?
3. Notice that "cat/couch" and "dog/sofa" are similar - the model understands synonyms!

---

# Part 5: Visualizing the Embedding Space

Embeddings have 384 dimensions. We can project them to 2D using **PCA** to visualize clusters:

In [None]:
# More diverse sentences
sentences_diverse = [
    # Animals
    "The cat is sleeping.",
    "Dogs are loyal companions.",
    "Fish swim in the ocean.",
    # Technology
    "Computers process data quickly.",
    "Software engineers write code.",
    "Artificial intelligence is advancing.",
    # Nature
    "Mountains are covered in snow.",
    "The forest is full of trees.",
    "Rivers flow to the sea.",
]

categories = ["Animals"] * 3 + ["Technology"] * 3 + ["Nature"] * 3

# Compute embeddings and reduce to 2D
emb = model.encode(sentences_diverse)
pca = PCA(n_components=2)
emb_2d = pca.fit_transform(emb)

print(f"Reduced {emb.shape[1]}D → 2D")
print(f"Variance explained: {sum(pca.explained_variance_ratio_)*100:.1f}%")

In [None]:
# Visualize
fig, ax = plt.subplots(figsize=(10, 7))

colors = {'Animals': 'red', 'Technology': 'blue', 'Nature': 'green'}

for cat in set(categories):
    mask = [c == cat for c in categories]
    ax.scatter(emb_2d[mask, 0], emb_2d[mask, 1], 
               c=colors[cat], label=cat, s=120, alpha=0.7)

# Add labels
for i, sent in enumerate(sentences_diverse):
    short = sent[:18] + "..." if len(sent) > 18 else sent
    ax.annotate(short, (emb_2d[i, 0], emb_2d[i, 1]), 
                xytext=(5, 5), textcoords='offset points', fontsize=8)

ax.set_xlabel('PCA 1', fontsize=11)
ax.set_ylabel('PCA 2', fontsize=11)
ax.set_title('Embedding Space (2D projection)', fontsize=12, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### Question 3

1. Do sentences from the same category cluster together?
2. PCA only explains ~30-40% of variance. What does this tell us about the embedding space?

---

# Part 6: Application - Semantic Search

One powerful application: **find similar items without keyword matching**.

Unlike keyword search, semantic search understands meaning.

In [None]:
# A "database" of scientific topics
database = [
    "CRISPR gene editing technology for treating genetic diseases",
    "Protein folding prediction using deep learning models",
    "Climate change impact on marine ecosystems",
    "Quantum computing algorithms for optimization problems",
    "Drug discovery using molecular simulations",
    "Machine learning for analyzing medical images",
    "Renewable energy storage in batteries",
    "Neuroimaging techniques for brain research",
]

# Pre-compute embeddings for the database
db_embeddings = model.encode(database)

print(f"Database: {len(database)} items, each embedded as {db_embeddings.shape[1]}D vector")

In [None]:
# Search!
query = "How can AI help with biology?"

results = semantic_search(query, database, db_embeddings, model, top_k=3)
print_search_results(query, results)

### Exercise: Try Your Own Search

Modify the query below. Try queries related to your research!

In [None]:
# TODO: Try different queries!
my_query = "neural networks for medical diagnosis"  # <-- Modify this!

results = semantic_search(my_query, database, db_embeddings, model, top_k=3)
print_search_results(my_query, results)

### Question 4

1. Try a query that doesn't use any exact words from the database. Does it still work?
2. Why does semantic search work without keyword matching?
3. Can you think of cases where it might fail?

---

# Summary

## Key Takeaways

1. **The pipeline**: Text → Tokens → Token IDs → **Embeddings**

2. **Embeddings** are dense vectors where similar things are close
   - One-hot: all words equidistant (bad)
   - Dense embeddings: capture semantic similarity (good)

3. **Pre-trained models** let you leverage knowledge from massive datasets
   - Even with small data, you benefit from billions of training examples
   - This is the power of **transfer learning**

4. **Cosine similarity** measures how close embeddings are

5. **Applications**: semantic search, clustering, classification, visualization

## The Key Insight

> **Meaning becomes geometry.** With good embeddings, reasoning about meaning becomes reasoning about distances and directions in vector space.


---

## Reflection Questions

Before moving to Part 2, think about:

1. **In your research domain**, what data could be embedded? (text, sequences, structures?)

2. **What would "similarity" mean** for your data? What should be close in embedding space?

3. **How could pre-trained models help** your research? Do you have limited data? Limited compute?