# TP1 - Part 1: Tokenization & Embeddings

**Day 2 - AI for Sciences Winter School**

**Instructor:** Raphael Cousin

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/racousin/ai_for_sciences/blob/main/day2/tp1_part1.ipynb)

---

## Objectives

By the end of this practical, you will understand:

1. **The complete pipeline**: Text → Tokens → Token IDs → Embeddings
2. **What embeddings are**: Dense vector representations where similar things are close
3. **Why pre-trained models matter**: Leveraging knowledge from massive datasets
4. **How to use embeddings**: Similarity search and visualization

---

# Part 1: The Pipeline - From Text to Vectors

Machine Learning models only understand numbers. To use text data (or molecules, proteins, DNA), we need to convert it to numbers.

```
┌─────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────────┐
│  Raw Text   │ →  │   Tokens     │ →  │  Token IDs   │ →  │    Embeddings    │
│  (string)   │    │  (subwords)  │    │  (integers)  │    │ (dense vectors)  │
└─────────────┘    └──────────────┘    └──────────────┘    └──────────────────┘

"The cat sat"  →  ["The", "cat", "sat"] →  [464, 3797, 3332]  →  [[0.1, -0.2, ...], 
                                                                   [0.3, 0.5, ...],
                                                                   [-0.1, 0.4, ...]]
```

In this practical, we'll explore **tokenization** (how text is split) and then focus on **embeddings** (how tokens become meaningful vectors).

## Setup

In [None]:
!pip install -q git+https://github.com/racousin/ai_for_sciences.git
!pip install -q transformers sentence-transformers

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity

from aiforscience import (
    visualize_tokens,
    plot_similarity_matrix,
    plot_embeddings_pca,
    semantic_search,
    print_search_results,
)

print("Setup complete!")

## Tokenization: Breaking Text into Pieces

Before we can process text, we need to **tokenize** it - break it into units.

### Three Tokenization Strategies

| Strategy | How it works | Example: "unhappiness" | Pros/Cons |
|----------|--------------|------------------------|-----------|
| **Character** | Split into characters | `['u','n','h','a','p','p','i','n','e','s','s']` | ✓ Small vocab, ✗ Very long sequences |
| **Word** | Split by spaces/punctuation | `['unhappiness']` | ✓ Meaningful units, ✗ Huge vocab, can't handle new words |
| **Subword** | Split into common subparts | `['un', 'happiness']` | ✓ Balanced! Handles new words, reasonable vocab |

**Modern models use subword tokenization** (like BPE - Byte Pair Encoding) because:
- Frequent words stay whole: "the", "cat"
- Rare words are split: "transformers" → "transform" + "ers"
- Can handle any word, even typos: "looooong" → "l" + "oo" + "oo" + "ong"

## Tokenizers Should Preserve Data Structure

A good tokenizer splits data into **meaningful units** that preserve its structure.

### Natural Language Has Structure

Even "simple" text has structure a tokenizer must handle:

| Structure | Examples | Tokenizer choices |
|-----------|----------|-------------------|
| **Case** | "Cat" vs "cat" vs "CAT" | Keep case? Lowercase all? |
| **Alphabets** | Latin, Cyrillic (Привет), Chinese (你好), Arabic (مرحبا) | Which scripts to support? |
| **Punctuation** | "don't", "e-mail", "Ph.D." | Split on punctuation? Keep together? |
| **Morphology** | "un-believe-able", "run-ning" | Recognize prefixes/suffixes? |
| **Spaces** | English uses spaces, Chinese doesn't | How to find word boundaries? |

Different tokenizers make different choices. For example:
- **BERT uncased** lowercases everything: "DNA" → "dna"
- **GPT-2** preserves case: "DNA" stays "DNA"

Let's compare how different tokenizers handle the same text!

In [None]:
from transformers import AutoTokenizer

# GPT-2: case-sensitive, BPE tokenization
tokenizer_gpt2 = AutoTokenizer.from_pretrained("gpt2")

# BERT uncased: lowercases everything, WordPiece tokenization
tokenizer_bert = AutoTokenizer.from_pretrained("bert-base-uncased")

# TODO: Choose your own tokenizer from https://huggingface.co/models
# Examples: "mistralai/Mistral-7B-v0.1", "deepseek-ai/DeepSeek-V3.2"
your_tokenizer = # <-- Modify this!

print("=" * 60)
print("TOKENIZER COMPARISON")
print("=" * 60)
print(f"\n{'Model':<25} {'Vocab Size':<15} {'Case':<15}")
print("-" * 55)
print(f"{'GPT-2':<25} {len(tokenizer_gpt2):,}          {'Preserved':<15}")
print(f"{'BERT uncased':<25} {len(tokenizer_bert):,}          {'Lowercased':<15}")
print(f"{'Your choice':<25} {len(your_tokenizer):,}")

In [None]:
# Compare tokenizers on text with different structures
text_examples = [
    "The CRISPR-Cas9 system edits DNA.",  # Uppercase acronyms
    "COVID-19 vaccine efficacy study",     # Numbers and hyphens
    "Dr. Smith's research paper",          # Punctuation and possessives
]

print("=" * 70)
print("TOKENIZING TEXT")
print("=" * 70)

for text in text_examples:
    tokens_gpt2 = tokenizer_gpt2.tokenize(text)
    tokens_bert = tokenizer_bert.tokenize(text)
    tokens_yours = your_tokenizer.tokenize(text)
    
    print(f"\nText: '{text}'")
    print(f"  GPT-2  ({len(tokens_gpt2):2} tokens): {tokens_gpt2}")
    print(f"  BERT   ({len(tokens_bert):2} tokens): {tokens_bert}")
    print(f"  Yours  ({len(tokens_yours):2} tokens): {tokens_yours}")

### Question: Analyzing Tokenizer Outputs

Look at the outputs above and think about:

1. **Case sensitivity**: How does BERT handle "CRISPR" and "DNA"? What information is lost?

2. **Special characters**: How do the tokenizers handle hyphens in "CRISPR-Cas9" and "COVID-19"?

3. **Subword splits**: Notice `Ġ` (GPT-2) and `##` (BERT) - these mark word boundaries. Why encode this?

4. **Trade-offs**: BERT uncased has a smaller vocabulary but loses case. When might this matter for scientific text?

5. **For your research**: What text structure is important in your domain? Would lowercasing hurt?

In [None]:
# Visual comparison
text = "The CRISPR-Cas9 system enables precise DNA editing."

print(f"Text: '{text}'\n")

print("GPT-2:")
visualize_tokens(text, tokenizer_gpt2)

print("\nBERT uncased:")
visualize_tokens(text, tokenizer_bert)

print("\nYours:")
visualize_tokens(text, your_tokenizer)

print(f"\nToken counts: GPT-2={len(tokenizer_gpt2.tokenize(text))}, BERT={len(tokenizer_bert.tokenize(text))}, Yours={len(your_tokenizer.tokenize(text))}")

In [None]:
# Try your own text!
your_text = "Your scientific text here"  # <-- Modify this!

print(f"Your text: '{your_text}'\n")

print("GPT-2:")
visualize_tokens(your_text, tokenizer_gpt2)

print("\nBERT uncased:")
visualize_tokens(your_text, tokenizer_bert)

print("\nYours:")
visualize_tokens(your_text, your_tokenizer)

print(f"\nToken counts: GPT-2={len(tokenizer_gpt2.tokenize(your_text))}, BERT={len(tokenizer_bert.tokenize(your_text))}, Yours={len(your_tokenizer.tokenize(your_text))}")

## From Tokens to Token IDs

Each tokenizer has a fixed **vocabulary** - a dictionary mapping tokens to integer IDs:

| Model | Vocabulary Size | Tokenization |
|-------|-----------------|--------------|
| GPT-2 | ~50,000 tokens | BPE |
| BERT uncased | ~30,000 tokens | WordPiece |

When you tokenize, each token gets a unique ID from the vocabulary:
```
"Machine"  → 33423
"learning" → 4673
```

**The Problem: Token IDs are just arbitrary integers!**

Is token 33423 similar to token 33424? We have no idea - they're just numbers in a lookup table.

**This is where embeddings come in →**

---

# Part 2: Embeddings - Meaning Becomes Geometry

The key insight of modern NLP:

> **Convert tokens into vectors where similar things are close together.**

```
"King"   → [0.2, -0.4, 0.8, 0.1, ...] (384 numbers)
"Queen"  → [0.3, -0.3, 0.7, 0.2, ...] (384 numbers)  ← Close to "King"!
"Banana" → [-0.5, 0.9, -0.2, 0.6, ...] (384 numbers)  ← Far from both
```

This is the magic of **embeddings**: meaning becomes geometry.

## The Naive Approach: One-Hot Encoding

The simplest way to represent words as vectors:

In [None]:
# One-hot encoding: each word is a vector with one "1" and rest "0"s
one_hot = {
    "cat":  np.array([1, 0, 0, 0, 0]),
    "dog":  np.array([0, 1, 0, 0, 0]),
    "fish": np.array([0, 0, 1, 0, 0]),
    "bird": np.array([0, 0, 0, 1, 0]),
    "tree": np.array([0, 0, 0, 0, 1]),
}

print("One-hot representations:")
for word, vec in one_hot.items():
    print(f"  {word:6} → {vec}")

In [None]:
# What's the distance between words?
cat, dog, tree = one_hot["cat"], one_hot["dog"], one_hot["tree"]

print(f"Distance 'cat' ↔ 'dog':  {np.linalg.norm(cat - dog):.2f}")
print(f"Distance 'cat' ↔ 'tree': {np.linalg.norm(cat - tree):.2f}")
print()
print("Problem: ALL words are equally distant from each other!")
print("We lose the semantic relationship: cat and dog are both animals.")

## The Better Approach: Dense Embeddings

Instead of sparse one-hot vectors, we use **dense vectors** where each dimension captures some aspect of meaning:

In [None]:
# Conceptual example of learned embeddings
# (Real embeddings have hundreds of dimensions, learned from data)
#
#                [is_animal, has_fur, can_fly, is_pet]
embeddings = {
    "cat":   np.array([0.9, 0.9, 0.8, 0.9]),
    "dog":   np.array([0.9, 0.9, 0.9, 0.9]),
    "fish":  np.array([0.9, 0.8, 0.7, 0.0]),
    "bird":  np.array([0.9, 0.8, 0.9, 0.3]),
    "tree":  np.array([0.9, 0.0, 0.0, 0.0]),
}

# Now distances reflect semantic similarity!
cat, dog, tree = embeddings["cat"], embeddings["dog"], embeddings["tree"]

print("With dense embeddings:")
print(f"  Distance 'cat' ↔ 'dog':  {np.linalg.norm(cat - dog):.2f}  (close - both pets!)")
print(f"  Distance 'cat' ↔ 'tree': {np.linalg.norm(cat - tree):.2f}  (far - different things)")

### Question 1

Looking at the conceptual embeddings above:
1. Why are "cat" and "dog" close in this embedding space?
2. What other words would you expect to be close to "bird"?

---

# Part 3: Cosine Similarity - Measuring Closeness

To measure similarity between embeddings, we use **cosine similarity**:

$$\text{cosine\_similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|}$$

- Measures the **angle** between vectors (not distance)
- Range: -1 (opposite) to +1 (identical direction)
- Scale-invariant: `[1, 2, 3]` and `[2, 4, 6]` have similarity = 1.0

In [None]:
def cosine_sim(a, b):
    """Cosine similarity between two vectors."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Compare all pairs
words = list(embeddings.keys())
print("Cosine similarities:\n")

for i, w1 in enumerate(words):
    for w2 in words[i+1:]:
        sim = cosine_sim(embeddings[w1], embeddings[w2])
        bar = "█" * int(max(0, sim) * 20)
        print(f"  {w1:5} ↔ {w2:5}: {sim:+.2f}  {bar}")

---

# Part 4: How Models Learn Embeddings

## Token IDs Are Categorical, Not Ordinal

Remember: token IDs are **arbitrary integers**. Token 2300 is not "less than" token 6500 - they're just different categories in a lookup table.

```
"cat"      → 3797    (just an ID, no meaning)
"dog"      → 3826    (close number ≠ close meaning!)
"computer" → 3644    (between cat and dog numerically, but unrelated semantically)
```

**In models, we treat token IDs as categorical data.** During training (classification, translation, reconstruction...), the model learns a **dense vector representation** for each category.

## The Embedding Layer: `nn.Embedding`

In PyTorch, `nn.Embedding` is simply a **learnable lookup table**:
- Input: token ID (integer)
- Output: dense vector (learned during training)

```
nn.Embedding(vocab_size=50000, embedding_dim=128)
         ↓
Token ID 3797 → [0.12, -0.34, 0.56, ...] (128 numbers)
```

Initially, these vectors are **random**. Through training, the model adjusts them so that tokens used in similar contexts get similar embeddings.

In [None]:
import torch
import torch.nn as nn

# Create an embedding layer (like in real models)
vocab_size = 50000   # Number of possible tokens
embedding_dim = 128  # Size of each embedding vector

embedding_layer = nn.Embedding(vocab_size, embedding_dim)

# Look up embeddings for some token IDs
token_ids = torch.tensor([3797, 3826, 3644])  # "cat", "dog", "computer" (hypothetical)

# Get embeddings (random at initialization!)
embeddings = embedding_layer(token_ids)

print(f"Embedding layer: {vocab_size:,} tokens → {embedding_dim}D vectors")
print(f"\nToken IDs: {token_ids.tolist()}")
print(f"Embeddings shape: {embeddings.shape}")
print(f"\nEmbedding for token 3797 (first 10 values):")
print(f"  {embeddings[0, :10].detach().numpy().round(3)}")
print(f"\n⚠️  These are RANDOM - no meaning yet!")
print(f"   The model learns meaningful embeddings during training.")

## Leveraging Pre-trained Models

Training meaningful embeddings requires data, compute, and time. Instead, you can use **pre-trained models** that have already learned rich representations from large datasets.

## From Token Embeddings to Sequence Embeddings

Pre-trained models give embeddings for each **token**. To embed a whole **sequence** (sentence, document), a common approach is **mean pooling**:

```
"The cat sat" → tokens: ["The", "cat", "sat"]
             → embeddings: [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]]
             → mean: [0.3, 0.4]  ← single vector for the sequence
```

This is what **Sentence Transformers** do (with some refinements).

## What Can You Do With Embeddings?

1. **Representation & visualization**: Understand your data structure
2. **Search**: Find similar items without keyword matching (see Part 6)
3. **Train a classifier**: Use embeddings as features for a smaller model (see TP1 Part 3)

Let's load a pre-trained model!

In [None]:
from sentence_transformers import SentenceTransformer

# Load pre-trained model (trained on 1+ billion sentence pairs)
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# TODO: Choose your own embedding model from https://huggingface.co/models?library=sentence-transformers
# Examples: "sentence-transformers/all-mpnet-base-v2", "BAAI/bge-small-en-v1.5"
your_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')  # <-- Modify this!

print(f"Model loaded!")
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")
print(f"\nYour model loaded!")
print(f"Your model embedding dimension: {your_model.get_sentence_embedding_dimension()}")

In [None]:
# Embed some sentences
sentences = [
    "The cat is sleeping on the couch.",
    "A dog is resting on the sofa.",
    "Machine learning transforms data into insights.",
    "Deep learning models can recognize images.",
    "The weather is sunny today.",
]

# Compare both models
embeddings = model.encode(sentences)
embeddings_yours = your_model.encode(sentences)

print(f"Embedded {len(sentences)} sentences")
print(f"\nDefault model: each sentence → {embeddings.shape[1]} numbers")
print(f"Your model:    each sentence → {embeddings_yours.shape[1]} numbers")
print(f"\nFirst embedding from default model (first 8 values):")
print(f"  {embeddings[0][:8]}...")
print(f"\nFirst embedding from your model (first 8 values):")
print(f"  {embeddings_yours[0][:8]}...")

## Visualizing Similarity

Let's see if the model understands that semantically similar sentences should have similar embeddings:

In [None]:
# Compute similarity matrices for both models
sim_matrix = cosine_similarity(embeddings)
sim_matrix_yours = cosine_similarity(embeddings_yours)

# Create short labels
labels = [s[:25] + "..." if len(s) > 25 else s for s in sentences]

# Default model
fig = plot_similarity_matrix(sim_matrix, labels, title="Default Model (all-MiniLM-L6-v2)")
plt.show()

# Your model
fig = plot_similarity_matrix(sim_matrix_yours, labels, title="Your Model", cmap='Greens')
plt.show()

### Question 2

Look at the similarity matrix:
1. Which two sentences are most similar? Why?
2. Why is "The weather is sunny today" different from all others?
3. Notice that "cat/couch" and "dog/sofa" are similar - the model understands synonyms!

---

# Part 5: Visualizing the Embedding Space

Embeddings have 384 dimensions. We can project them to 2D using **PCA** to visualize clusters:

In [None]:
# More diverse sentences
sentences_diverse = [
    # Animals
    "The cat is sleeping.",
    "Dogs are loyal companions.",
    "Fish swim in the ocean.",
    # Technology
    "Computers process data quickly.",
    "Software engineers write code.",
    "Artificial intelligence is advancing.",
    # Nature
    "Mountains are covered in snow.",
    "The forest is full of trees.",
    "Rivers flow to the sea.",
]

categories = ["Animals"] * 3 + ["Technology"] * 3 + ["Nature"] * 3

# Compute embeddings with both models
emb = model.encode(sentences_diverse)
emb_yours = your_model.encode(sentences_diverse)

print(f"Embedded {len(sentences_diverse)} sentences")
print(f"Default model: {emb.shape[1]}D embeddings")
print(f"Your model: {emb_yours.shape[1]}D embeddings")

In [None]:
# Visualize with PCA - Default model
short_labels = [s[:18] + "..." if len(s) > 18 else s for s in sentences_diverse]

fig = plot_embeddings_pca(emb, labels=short_labels, categories=categories, 
                          title="Default Model - Embedding Space (PCA)", annotate=True)
plt.show()

# Visualize with PCA - Your model
fig = plot_embeddings_pca(emb_yours, labels=short_labels, categories=categories, 
                          title="Your Model - Embedding Space (PCA)", annotate=True)
plt.show()

### Question 3

1. Do sentences from the same category cluster together in both models?
2. How do the clusters differ between the default model and your chosen model?
3. PCA typically explains ~30-40% of variance. What does this tell us about the embedding space?

---

# Part 6: Application - Semantic Search

One powerful application: **find similar items without keyword matching**.

Unlike keyword search, semantic search understands meaning.

In [None]:
# A "database" of scientific topics
database = [
    "CRISPR gene editing technology for treating genetic diseases",
    "Protein folding prediction using deep learning models",
    "Climate change impact on marine ecosystems",
    "Quantum computing algorithms for optimization problems",
    "Drug discovery using molecular simulations",
    "Machine learning for analyzing medical images",
    "Renewable energy storage in batteries",
    "Neuroimaging techniques for brain research",
]

# Pre-compute embeddings for the database with both models
db_embeddings = model.encode(database)
db_embeddings_yours = your_model.encode(database)

print(f"Database: {len(database)} items")
print(f"Default model: {db_embeddings.shape[1]}D embeddings")
print(f"Your model: {db_embeddings_yours.shape[1]}D embeddings")

In [None]:
# Search with both models!
query = "How can AI help with biology?"

print("=" * 60)
print("DEFAULT MODEL RESULTS")
print("=" * 60)
results = semantic_search(query, database, db_embeddings, model, top_k=3)
print_search_results(query, results)

print("\n" + "=" * 60)
print("YOUR MODEL RESULTS")
print("=" * 60)
results_yours = semantic_search(query, database, db_embeddings_yours, your_model, top_k=3)
print_search_results(query, results_yours)

### Exercise: Try Your Own Search

Modify the query below. Try queries related to your research!

In [None]:
# TODO: Try different queries!
my_query = "neural networks for medical diagnosis"  # <-- Modify this!

print("=" * 60)
print("DEFAULT MODEL RESULTS")
print("=" * 60)
results = semantic_search(my_query, database, db_embeddings, model, top_k=3)
print_search_results(my_query, results)

print("\n" + "=" * 60)
print("YOUR MODEL RESULTS")
print("=" * 60)
results_yours = semantic_search(my_query, database, db_embeddings_yours, your_model, top_k=3)
print_search_results(my_query, results_yours)

### Question 4

1. Try a query that doesn't use any exact words from the database. Does it still work?
2. Do the two models return the same results? Why might they differ?
3. Why does semantic search work without keyword matching?
4. Can you think of cases where it might fail?

---

# Summary

## Key Takeaways

1. **The pipeline**: Text → Tokens → Token IDs → **Embeddings**

2. **Embeddings** are dense vectors where similar things are close
   - One-hot: all words equidistant (bad)
   - Dense embeddings: capture semantic similarity (good)

3. **Pre-trained models** let you leverage knowledge from massive datasets
   - Even with small data, you benefit from billions of training examples
   - This is the power of **transfer learning**

4. **Cosine similarity** measures how close embeddings are

5. **Applications**: semantic search, clustering, classification, visualization

## The Key Insight

> **Meaning becomes geometry.** With good embeddings, reasoning about meaning becomes reasoning about distances and directions in vector space.


---

## Reflection Questions

Before moving to Part 2, think about:

1. **In your research domain**, what data could be embedded? (text, sequences, structures?)

2. **What would "similarity" mean** for your data? What should be close in embedding space?

3. **How could pre-trained models help** your research? Do you have limited data? Limited compute?