# TP1: Transformers - From Tokens to Language Understanding

**Day 1 - AI for Sciences Winter School**

**Instructor:** Raphael Cousin

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/racousin/ai_for_sciences/blob/main/day1/tp2.ipynb)

## Objectives
1. Understand how text is converted to numbers (tokenization)
2. Explore the attention mechanism - the core of Transformers
3. Use pre-trained models from Hugging Face
4. Visualize word embeddings and their semantic relationships
5. See how Transformers extend beyond text to molecules, signals, and more

## Setup

Run the cell below to install and import the required packages.

In [None]:
# Install required packages
!pip install -q git+https://github.com/racousin/ai_for_sciences.git
!pip install -q transformers

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt

from aiforscience import (
    plot_attention_weights,
    plot_embeddings_2d,
    print_tokenization,
    print_model_summary,
)

print("Setup complete!")
print(f"PyTorch version: {torch.__version__}")

---
# Part 1: Text Representation - Tokenization

**Key Question:** How do we convert text (strings) into numbers that neural networks can process?

## The Problem

Neural networks work with numbers, not text. We need a way to:
1. Break text into meaningful units (tokens)
2. Convert each token to a unique number (token ID)
3. Optionally convert IDs to dense vectors (embeddings)

```
"Hello world" → ["Hello", "world"] → [15496, 995] → [[0.1, -0.3, ...], [0.4, 0.2, ...]]
```

## Simple Tokenization Strategies

Let's explore different ways to tokenize text:

In [None]:
text = "The quick brown fox jumps over the lazy dog."

# Strategy 1: Character-level tokenization
char_tokens = list(text)
print("Character-level tokenization:")
print(f"  Text: '{text}'")
print(f"  Tokens: {char_tokens}")
print(f"  Number of tokens: {len(char_tokens)}")
print(f"  Vocabulary size: {len(set(char_tokens))} unique characters")

print("\n" + "="*60 + "\n")

# Strategy 2: Word-level tokenization
word_tokens = text.lower().replace('.', '').split()
print("Word-level tokenization:")
print(f"  Text: '{text}'")
print(f"  Tokens: {word_tokens}")
print(f"  Number of tokens: {len(word_tokens)}")

## Trade-offs

| Strategy | Vocabulary Size | Sequence Length | Handles Unknown Words? |
|----------|----------------|-----------------|------------------------|
| Character | ~100 (small) | Very long | Yes (all chars known) |
| Word | Very large | Short | No (OOV problem) |
| **Subword (BPE)** | Medium (~50K) | Medium | Yes (breaks into subwords) |

**Modern models use Subword tokenization** (like BPE or WordPiece) - a compromise between character and word level.

## Using Real Tokenizers

Let's use the tokenizer from GPT-2:

In [None]:
from transformers import AutoTokenizer

# Load GPT-2 tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

text = "The quick brown fox jumps over the lazy dog."

# Tokenize
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)

print("GPT-2 Tokenization:")
print(f"  Text: '{text}'")
print(f"  Tokens: {tokens}")
print(f"  Token IDs: {token_ids}")
print(f"  Number of tokens: {len(tokens)}")
print(f"  Vocabulary size: {tokenizer.vocab_size:,}")

In [None]:
# Visualize the tokenization
print_tokenization(text, tokenizer)

## Exercise 1: Explore Tokenization

Try different texts and observe how they are tokenized:
- Common words vs rare words
- Scientific terms (like "photosynthesis" or "mitochondria")
- Numbers and special characters
- Words in different languages

**Questions:**
1. How are rare/scientific words handled?
2. What happens with numbers?
3. Are common words single tokens?

In [None]:
# TODO: Try different texts!
test_texts = [
    "Hello world!",                          # Simple
    "Photosynthesis occurs in chloroplasts", # Scientific
    "The price is $123.45",                  # Numbers
    "café résumé naïve",                     # Accented characters
    "CRISPR-Cas9 gene editing",              # Technical
]

for text in test_texts:
    print_tokenization(text, tokenizer)
    print()

---
# Part 2: The Attention Mechanism

**Key Question:** How do Transformers understand relationships between words in a sentence?

## The Core Idea

Attention allows each word to "look at" other words in the sentence and decide which ones are most relevant.

For example, in:
> "The cat sat on the mat because **it** was tired."

The word "it" should attend to "cat" to understand what "it" refers to.

## Self-Attention Step by Step

Given an input sequence, self-attention computes:

1. **Query (Q)**: What am I looking for?
2. **Key (K)**: What do I contain?
3. **Value (V)**: What information do I provide?

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

In [None]:
# Let's build a simple self-attention mechanism
class SimpleSelfAttention(nn.Module):
    """A simplified self-attention layer."""
    
    def __init__(self, embed_dim):
        super().__init__()
        self.embed_dim = embed_dim
        
        # Linear projections for Q, K, V
        self.query = nn.Linear(embed_dim, embed_dim)
        self.key = nn.Linear(embed_dim, embed_dim)
        self.value = nn.Linear(embed_dim, embed_dim)
    
    def forward(self, x):
        """
        Args:
            x: Input tensor of shape (batch_size, seq_len, embed_dim)
        Returns:
            output: Attended output
            attention_weights: Attention weights for visualization
        """
        # Project to Q, K, V
        Q = self.query(x)  # (batch, seq_len, embed_dim)
        K = self.key(x)
        V = self.value(x)
        
        # Compute attention scores: Q @ K^T / sqrt(d)
        d_k = self.embed_dim ** 0.5
        scores = torch.matmul(Q, K.transpose(-2, -1)) / d_k  # (batch, seq_len, seq_len)
        
        # Apply softmax to get attention weights
        attention_weights = F.softmax(scores, dim=-1)
        
        # Apply attention to values
        output = torch.matmul(attention_weights, V)
        
        return output, attention_weights

# Count parameters
embed_dim = 64
attention = SimpleSelfAttention(embed_dim)
n_params = sum(p.numel() for p in attention.parameters())
print(f"Self-Attention Layer:")
print(f"  Embedding dimension: {embed_dim}")
print(f"  Total parameters: {n_params:,}")
print(f"    - Query projection: {embed_dim * embed_dim} (W_Q)")
print(f"    - Key projection: {embed_dim * embed_dim} (W_K)")
print(f"    - Value projection: {embed_dim * embed_dim} (W_V)")
print(f"    - Biases: {3 * embed_dim}")

## Visualizing Attention

Let's see how attention works on a simple sentence:

In [None]:
# Create a simple example
sentence = ["The", "cat", "sat", "on", "the", "mat"]
seq_len = len(sentence)

# Create random embeddings for our words (in practice, these come from an embedding layer)
torch.manual_seed(42)
x = torch.randn(1, seq_len, embed_dim)  # (batch=1, seq_len=6, embed_dim=64)

# Apply self-attention
attention = SimpleSelfAttention(embed_dim)
output, attn_weights = attention(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {attn_weights.shape}")

# Visualize attention weights
plot_attention_weights(attn_weights[0].detach().numpy(), sentence, 
                       title="Self-Attention Weights (Random Init)")
plt.show()

## A Complete Transformer Block

A real Transformer layer combines:
1. **Multi-Head Attention**: Multiple attention heads learning different relationships
2. **Feed-Forward Network**: A small neural network applied to each position
3. **Layer Normalization**: Stabilizes training
4. **Residual Connections**: Help gradients flow

In [None]:
class TransformerBlock(nn.Module):
    """A single Transformer block."""
    
    def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
        super().__init__()
        
        # Multi-head self-attention
        self.attention = nn.MultiheadAttention(embed_dim, num_heads, batch_first=True)
        
        # Feed-forward network
        self.ff = nn.Sequential(
            nn.Linear(embed_dim, ff_dim),
            nn.GELU(),
            nn.Linear(ff_dim, embed_dim),
        )
        
        # Layer normalization
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
        
        # Dropout
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        # Self-attention with residual connection
        attn_output, _ = self.attention(x, x, x)
        x = self.norm1(x + self.dropout(attn_output))
        
        # Feed-forward with residual connection
        ff_output = self.ff(x)
        x = self.norm2(x + self.dropout(ff_output))
        
        return x

# Create a Transformer block
block = TransformerBlock(embed_dim=64, num_heads=4, ff_dim=256)

# Count parameters
print_model_summary(block, "Transformer Block")

## Why Transformers Work Well for Sequences

| Property | RNNs/LSTMs | Transformers |
|----------|------------|---------------|
| Parallelization | Sequential (slow) | Fully parallel (fast) |
| Long-range dependencies | Difficult (vanishing gradients) | Easy (direct attention) |
| Training speed | Slow | Fast (on GPU) |
| Memory | O(1) per step | O(n²) for attention |

## Exercise 2: Modify the Transformer Block

Try changing the hyperparameters and observe how the model size changes:
- `embed_dim`: Try 128, 256, 512
- `num_heads`: Try 2, 4, 8 (must divide embed_dim)
- `ff_dim`: Try 512, 1024, 2048

**Question:** How does GPT-3 (175B parameters) achieve its size?

In [None]:
# TODO: Modify these parameters
embed_dim = 64   # <-- Try 128, 256, 512
num_heads = 4    # <-- Try 2, 4, 8
ff_dim = 256     # <-- Try 512, 1024

block = TransformerBlock(embed_dim=embed_dim, num_heads=num_heads, ff_dim=ff_dim)
print_model_summary(block, f"Transformer Block (d={embed_dim}, h={num_heads}, ff={ff_dim})")

---
# Part 3: Using Pre-trained Models from Hugging Face

**Key Insight:** Training a language model from scratch requires massive compute (GPT-3 cost ~$4.6M to train). Instead, we can use **pre-trained models**.

Hugging Face Hub hosts thousands of pre-trained models: https://huggingface.co/models

## Loading a Pre-trained GPT-2 Model

GPT-2 is a decoder-only Transformer trained to predict the next word.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load pre-trained GPT-2 (small version: 124M parameters)
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Model info
n_params = sum(p.numel() for p in model.parameters())
print(f"Model: {model_name}")
print(f"Parameters: {n_params:,} ({n_params/1e6:.1f}M)")
print(f"Vocabulary size: {tokenizer.vocab_size:,}")

In [None]:
# Look at the model architecture
print(model)

## Text Generation with GPT-2

GPT-2 generates text by predicting the next token, then using that to predict the next, and so on.

In [None]:
def generate_text(prompt, max_new_tokens=50, temperature=0.7):
    """
    Generate text continuation from a prompt.
    
    Args:
        prompt: Starting text
        max_new_tokens: Number of tokens to generate
        temperature: Higher = more creative, Lower = more deterministic
    """
    # Tokenize input
    inputs = tokenizer(prompt, return_tensors="pt")
    
    # Generate
    with torch.no_grad():
        outputs = model.generate(
            inputs["input_ids"],
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    # Decode and return
    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated

# Try it!
prompt = "The future of artificial intelligence is"
print(f"Prompt: '{prompt}'")
print("\nGenerated text:")
print("-" * 50)
print(generate_text(prompt))

## Exercise 3: Experiment with Text Generation

Try different prompts and parameters:

**Prompts to try:**
- Scientific: "The experiment showed that"
- Creative: "Once upon a time"
- Technical: "To implement a neural network,"

**Parameters:**
- `temperature`: 0.1 (conservative) to 1.5 (creative)
- `max_new_tokens`: 20, 50, 100

**Questions:**
1. How does temperature affect the output?
2. What are the limitations of GPT-2?

In [None]:
# TODO: Try your own prompts!
prompt = "In machine learning, the most important concept is"  # <-- Modify this!
temperature = 0.7  # <-- Try 0.1, 0.5, 1.0, 1.5
max_tokens = 50    # <-- Try 20, 50, 100

print(f"Prompt: '{prompt}'")
print(f"Temperature: {temperature}")
print("\nGenerated:")
print("-" * 50)
print(generate_text(prompt, max_new_tokens=max_tokens, temperature=temperature))

In [None]:
# Compare different temperatures
prompt = "The best way to learn programming is"

for temp in [0.1, 0.7, 1.5]:
    print(f"\n{'='*60}")
    print(f"Temperature = {temp}")
    print(f"{'='*60}")
    print(generate_text(prompt, max_new_tokens=40, temperature=temp))

---
# Part 4: Word Embeddings

**Key Question:** How do neural networks represent the meaning of words?

## From IDs to Vectors

Each token ID is mapped to a dense vector (embedding). These embeddings capture semantic relationships:

- Similar words have similar embeddings
- Relationships are encoded as directions: king - man + woman ≈ queen

In [None]:
# Get the embedding layer from GPT-2
embedding_layer = model.transformer.wte  # word token embeddings

print(f"Embedding layer:")
print(f"  Vocabulary size: {embedding_layer.num_embeddings:,}")
print(f"  Embedding dimension: {embedding_layer.embedding_dim}")
print(f"  Total parameters: {embedding_layer.num_embeddings * embedding_layer.embedding_dim:,}")

In [None]:
def get_word_embedding(word):
    """Get the embedding for a word."""
    token_ids = tokenizer.encode(word, add_special_tokens=False)
    if len(token_ids) > 1:
        print(f"Note: '{word}' is split into {len(token_ids)} tokens")
    
    # Get embeddings for each token and average
    embeddings = embedding_layer(torch.tensor(token_ids))
    return embeddings.mean(dim=0).detach()

# Get embeddings for some words
words = ["king", "queen", "man", "woman", "cat", "dog"]
embeddings = {word: get_word_embedding(word) for word in words}

print("Embedding shapes:")
for word, emb in embeddings.items():
    print(f"  {word}: {emb.shape}")

## Measuring Similarity

Cosine similarity measures how similar two embeddings are:

In [None]:
def cosine_similarity(v1, v2):
    """Compute cosine similarity between two vectors."""
    return F.cosine_similarity(v1.unsqueeze(0), v2.unsqueeze(0)).item()

# Compare similarities
print("Cosine Similarities:")
print(f"  king - queen:  {cosine_similarity(embeddings['king'], embeddings['queen']):.4f}")
print(f"  king - man:    {cosine_similarity(embeddings['king'], embeddings['man']):.4f}")
print(f"  king - cat:    {cosine_similarity(embeddings['king'], embeddings['cat']):.4f}")
print(f"  cat - dog:     {cosine_similarity(embeddings['cat'], embeddings['dog']):.4f}")
print(f"  man - woman:   {cosine_similarity(embeddings['man'], embeddings['woman']):.4f}")

## Visualizing Embeddings in 2D

We can use dimensionality reduction (PCA or t-SNE) to visualize embeddings:

In [None]:
from sklearn.decomposition import PCA

# Get embeddings for related words
word_groups = {
    "Royalty": ["king", "queen", "prince", "princess", "crown"],
    "Animals": ["cat", "dog", "bird", "fish", "mouse"],
    "Science": ["physics", "chemistry", "biology", "math", "science"],
    "Programming": ["code", "program", "software", "computer", "algorithm"],
}

all_words = []
all_embeddings = []
all_categories = []

for category, words in word_groups.items():
    for word in words:
        emb = get_word_embedding(word)
        all_words.append(word)
        all_embeddings.append(emb.numpy())
        all_categories.append(category)

# Stack embeddings
embedding_matrix = np.stack(all_embeddings)
print(f"Embedding matrix shape: {embedding_matrix.shape}")

# Reduce to 2D with PCA
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embedding_matrix)

# Visualize
plot_embeddings_2d(embeddings_2d, all_words, all_categories,
                   title="Word Embeddings Visualization (PCA)")
plt.show()

## Exercise 4: Explore Word Relationships

Try adding your own word groups and see how they cluster:
- Countries and cities
- Emotions (happy, sad, angry, ...)
- Scientific domains relevant to your research

**Question:** Do semantically similar words cluster together?

In [None]:
# TODO: Add your own word groups!
custom_groups = {
    "Countries": ["France", "Germany", "Italy", "Spain", "Japan"],  # <-- Modify!
    "Emotions": ["happy", "sad", "angry", "fear", "love"],           # <-- Modify!
    # Add more groups here!
}

all_words = []
all_embeddings = []
all_categories = []

for category, words in custom_groups.items():
    for word in words:
        emb = get_word_embedding(word)
        all_words.append(word)
        all_embeddings.append(emb.numpy())
        all_categories.append(category)

embedding_matrix = np.stack(all_embeddings)
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embedding_matrix)

plot_embeddings_2d(embeddings_2d, all_words, all_categories,
                   title="Custom Word Embeddings")
plt.show()

---
# Part 5: Transformers Beyond Text

**Key Insight:** The Transformer architecture is not limited to text! The attention mechanism works on any sequential or structured data.

## Transformers in Different Domains

### 1. Vision Transformers (ViT)

Images are split into patches, each patch becomes a "token":

```
Image (224x224) → 196 patches (16x16 each) → Transformer → Classification
```

In [None]:
# Visualize how images become sequences
def image_to_patches(image_size=224, patch_size=16):
    """Demonstrate how images are converted to sequences of patches."""
    n_patches = (image_size // patch_size) ** 2
    
    print("Vision Transformer (ViT) tokenization:")
    print(f"  Image size: {image_size}x{image_size} pixels")
    print(f"  Patch size: {patch_size}x{patch_size} pixels")
    print(f"  Number of patches: {n_patches}")
    print(f"  Sequence length: {n_patches} (like {n_patches} 'words')")
    print(f"\n  Each patch becomes a 'token' that attends to other patches!")
    
    # Visualize
    fig, ax = plt.subplots(figsize=(6, 6))
    
    # Draw grid
    for i in range(0, image_size + 1, patch_size):
        ax.axhline(y=i, color='blue', linewidth=1)
        ax.axvline(x=i, color='blue', linewidth=1)
    
    # Number some patches
    for i in range(image_size // patch_size):
        for j in range(image_size // patch_size):
            patch_num = i * (image_size // patch_size) + j
            if patch_num < 10 or patch_num >= n_patches - 3:
                ax.text(j * patch_size + patch_size/2, 
                       (image_size // patch_size - 1 - i) * patch_size + patch_size/2,
                       str(patch_num), ha='center', va='center', fontsize=8)
    
    ax.set_xlim(0, image_size)
    ax.set_ylim(0, image_size)
    ax.set_aspect('equal')
    ax.set_title(f'Image → {n_patches} patches (tokens)', fontsize=12)
    plt.tight_layout()
    plt.show()

image_to_patches()

### 2. Transformers for Molecules

Molecules can be represented as:
- **SMILES strings**: Text representation of molecules
- **Graph structure**: Atoms as nodes, bonds as edges

```
Aspirin: CC(=O)OC1=CC=CC=C1C(=O)O
```

Transformers like **ChemBERTa** are trained on molecular data!

In [None]:
# Example: SMILES tokenization
smiles_examples = [
    ("Water", "O"),
    ("Ethanol", "CCO"),
    ("Aspirin", "CC(=O)OC1=CC=CC=C1C(=O)O"),
    ("Caffeine", "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"),
]

print("Molecules as sequences (SMILES notation):")
print("="*60)
for name, smiles in smiles_examples:
    print(f"\n{name}:")
    print(f"  SMILES: {smiles}")
    print(f"  Length: {len(smiles)} characters")
    # Simple character-level tokenization
    tokens = list(smiles)
    print(f"  Tokens: {tokens[:10]}{'...' if len(tokens) > 10 else ''}")

### 3. Transformers for Time Series / Signals

Time series data (ECG, stock prices, sensor data) can be:
- Divided into fixed-length windows (patches)
- Each window becomes a token

Examples:
- **Informer**: Long sequence time-series forecasting
- **Temporal Fusion Transformer**: Multi-horizon forecasting

In [None]:
# Generate a sample signal
np.random.seed(42)
t = np.linspace(0, 10, 500)
signal = np.sin(2 * np.pi * 0.5 * t) + 0.5 * np.sin(2 * np.pi * 2 * t) + 0.2 * np.random.randn(500)

# Show how it's converted to patches
patch_size = 50
n_patches = len(signal) // patch_size

fig, axes = plt.subplots(2, 1, figsize=(12, 6))

# Original signal
axes[0].plot(t, signal, 'b-', linewidth=1)
axes[0].set_title('Original Signal', fontsize=12)
axes[0].set_xlabel('Time')
axes[0].grid(True, alpha=0.3)

# Signal with patches highlighted
colors = plt.cm.tab10(np.linspace(0, 1, n_patches))
for i in range(n_patches):
    start = i * patch_size
    end = start + patch_size
    axes[1].plot(t[start:end], signal[start:end], color=colors[i], linewidth=2)
    axes[1].axvline(x=t[start], color='gray', linestyle='--', alpha=0.5)

axes[1].set_title(f'Signal divided into {n_patches} patches (tokens)', fontsize=12)
axes[1].set_xlabel('Time')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nSignal tokenization:")
print(f"  Total samples: {len(signal)}")
print(f"  Patch size: {patch_size} samples")
print(f"  Number of patches (tokens): {n_patches}")
print(f"  → Each patch is a vector of {patch_size} values")
print(f"  → Attention learns which time windows are related!")

### 4. AlphaFold: Transformers for Protein Structure

AlphaFold uses attention to understand relationships between amino acids:

```
Protein sequence: MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH...
                   ↓ (Attention + MSA)
               3D Structure prediction
```

Each amino acid attends to others to predict how they fold in 3D space!

In [None]:
# Simple demo: protein sequence as tokens
hemoglobin_seq = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH"

amino_acids = list("ACDEFGHIKLMNPQRSTVWY")
aa_to_idx = {aa: i for i, aa in enumerate(amino_acids)}

print("Protein as sequence (like text!):")
print(f"  Sequence: {hemoglobin_seq[:30]}...")
print(f"  Length: {len(hemoglobin_seq)} amino acids")
print(f"  Vocabulary: {len(amino_acids)} amino acid types")
print(f"\n  Each amino acid is a 'token'")
print(f"  Attention learns which amino acids interact in 3D!")

## Summary: The Universal Pattern

| Domain | Input | Tokenization | What Attention Learns |
|--------|-------|--------------|----------------------|
| Text | Words/sentences | Subword (BPE) | Word relationships |
| Images | Pixels | Patches | Spatial relationships |
| Molecules | SMILES/Graphs | Characters/Atoms | Chemical bonds |
| Time series | Signal | Windows/Patches | Temporal patterns |
| Proteins | Amino acid sequence | Single residues | 3D interactions |

## Exercise 5: Think About Your Domain

**Questions to consider:**
1. What type of data do you work with in your research?
2. How could it be tokenized for a Transformer?
3. What relationships would attention learn?

Think about:
- Genomic sequences (DNA/RNA)
- Medical images (CT scans, X-rays)
- Climate data (spatial-temporal)
- Chemical reactions
- Social networks

In [None]:
# Your notes here:
# Data type: ?
# Tokenization strategy: ?
# What attention could learn: ?
print("Think about how Transformers could apply to your research domain!")

---
# Summary

In this practical, you learned:

1. **Tokenization**: Converting text to numbers
   - Character, word, and subword (BPE) approaches
   - Modern tokenizers balance vocabulary size and sequence length

2. **Attention Mechanism**: The core of Transformers
   - Query, Key, Value projections
   - Softmax attention weights
   - Allows direct connections between any positions

3. **Pre-trained Models**: Using Hugging Face
   - Load models with `AutoModel.from_pretrained()`
   - Generate text with temperature control

4. **Embeddings**: Dense vector representations
   - Similar words have similar embeddings
   - Can visualize with PCA/t-SNE

5. **Beyond Text**: Transformers are universal
   - Vision, molecules, time series, proteins
   - Key: tokenization + attention

## Key Takeaways

- **Transformers are flexible**: Any sequential/structured data can be processed
- **Attention is the key**: Learns which parts of the input relate to each other
- **Pre-training is powerful**: Billions of parameters capture rich representations
- **Embeddings encode meaning**: Vector representations capture semantics