# Lesson 9: Embeddings Deep Dive

**Prerequisites:** Complete Lesson 8 (RAG basics)

**Goal:** Understand what embeddings are, how to use them, and optimize them for your use case.

**Why This Matters:** Your entire RAG/memory/trace system depends on embeddings. Bad embeddings = bad retrieval = bad responses.

## Overview

**What you'll learn:**
1. What embeddings actually are (not just "vectors")
2. How to use real embedding models (not toy BOW)
3. How to choose the right embedding model
4. When to fine-tune embeddings
5. How to optimize embeddings for specific domains

In [None]:
# Install required libraries
%pip install sentence-transformers scikit-learn matplotlib seaborn

## Part 1: What Are Embeddings? (Theory + Visualization)

**Embeddings = Dense vector representations of meaning**

```
Text: "The cat is sleeping"
       ↓ (Embedding Model)
Vector: [0.23, -0.15, 0.67, ..., 0.42]  # 384 dimensions
```

**Key properties:**
- Similar meanings → nearby vectors
- Vector math works: "king" - "man" + "woman" ≈ "queen"
- Learned from data, not hand-coded

### Experiment 1.1: Visualize Embedding Space

**Goal:** See how embeddings cluster semantically similar sentences

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer
from sklearn.manifold import TSNE
import seaborn as sns

def visualize_embedding_space():
    """
    Show how embeddings cluster semantically similar sentences
    """
    
    # Load a real embedding model
    print("Loading sentence-transformers model...")
    model = SentenceTransformer('all-MiniLM-L6-v2')  # 384 dimensions
    
    # Test sentences with known semantic clusters
    sentences = [
        # Cluster 1: Animals
        "The cat sits on the mat",
        "Dogs are loyal companions",
        "The bird flew over the tree",
        "Lions hunt in the savannah",
        
        # Cluster 2: Technology
        "The computer crashed again",
        "Python is a great programming language",
        "Artificial intelligence is the future",
        "My smartphone battery is dead",
        
        # Cluster 3: Food
        "The pizza tastes delicious",
        "I love eating sushi for dinner",
        "Vegetables are healthy for you",
        "The cake was too sweet",
        
        # Cluster 4: Emotions
        "I feel so happy today",
        "This makes me very angry",
        "She is sad about the news",
        "They feel overwhelmed",
    ]
    
    # Generate embeddings (384-dimensional vectors)
    print("Generating embeddings...")
    embeddings = model.encode(sentences)
    print(f"Embedding shape: {embeddings.shape}")  # (16, 384)
    
    # Reduce to 2D for visualization using t-SNE
    print("Reducing to 2D...")
    tsne = TSNE(n_components=2, random_state=42, perplexity=5)
    embeddings_2d = tsne.fit_transform(embeddings)
    
    # Plot
    plt.figure(figsize=(14, 10))
    
    # Color by cluster
    colors = ['red'] * 4 + ['blue'] * 4 + ['green'] * 4 + ['purple'] * 4
    labels = ['Animals'] * 4 + ['Technology'] * 4 + ['Food'] * 4 + ['Emotions'] * 4
    
    for i, (x, y) in enumerate(embeddings_2d):
        plt.scatter(x, y, c=colors[i], s=200, alpha=0.6, edgecolors='black')
        plt.annotate(
            sentences[i], 
            (x, y),
            xytext=(5, 5), 
            textcoords='offset points',
            fontsize=9
        )
    
    # Add legend
    from matplotlib.patches import Patch
    legend_elements = [
        Patch(facecolor='red', label='Animals'),
        Patch(facecolor='blue', label='Technology'),
        Patch(facecolor='green', label='Food'),
        Patch(facecolor='purple', label='Emotions'),
    ]
    plt.legend(handles=legend_elements, loc='best')
    
    plt.title("Semantic Similarity in Embedding Space\n(384D → 2D via t-SNE)", fontsize=14)
    plt.xlabel("Dimension 1")
    plt.ylabel("Dimension 2")
    plt.grid(alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    print("\nKey Observation:")
    print("  - Sentences about animals cluster together")
    print("  - Technology terms cluster separately")
    print("  - This clustering is LEARNED, not hard-coded!")
    print("  - This is why semantic search works")

if __name__ == "__main__":
    visualize_embedding_space()

## Part 2: Embedding Model Comparison (Empirical)

### The Landscape of Embedding Models

**Common options:**

| Model | Dimensions | Speed | Quality | Use Case |
|-------|------------|-------|---------|----------|
| all-MiniLM-L6-v2 | 384 | Fast | Good | General purpose |
| all-mpnet-base-v2 | 768 | Medium | Better | Higher quality |
| BAAI/bge-large-en-v1.5 | 1024 | Slow | Best | Maximum quality |

**Trade-offs:**
- Larger models = better quality, slower, more memory
- Smaller models = faster, good enough for most cases

### Experiment 2.1: Speed Benchmarks

**Goal:** Measure real-world performance differences

In [None]:
import time
import numpy as np
from sentence_transformers import SentenceTransformer

def benchmark_embedding_models():
    """
    Compare speed of different embedding models
    """
    
    models = {
        "tiny_fast": "all-MiniLM-L6-v2",        # 384d
        "medium": "all-mpnet-base-v2",          # 768d
        # "large_slow": "BAAI/bge-large-en-v1.5", # 1024d (Commented out to save download time)
    }
    
    # Test corpus (simulate Chatbot traces)
    test_texts = [
        "User expressed anxiety about work deadlines and feeling overwhelmed",
        "Bot suggested breaking down tasks into smaller steps",
        "User felt relieved after discussing the plan",
        "Bot reminded user to take breaks",
        "Previous discussion about coping with stress and pressure",
    ] * 20  # 100 texts total
    
    results = {}
    
    print("=" * 70)
    print("EMBEDDING MODEL SPEED BENCHMARK")
    print("=" * 70)
    print(f"\nTest corpus: {len(test_texts)} texts")
    print()
    
    for name, model_name in models.items():
        print(f"Testing {name} ({model_name})...")
        
        # Load model
        start_load = time.time()
        model = SentenceTransformer(model_name)
        load_time = time.time() - start_load
        
        # Warmup
        model.encode("warmup")
        
        # Benchmark encoding
        start_encode = time.time()
        embeddings = model.encode(test_texts)
        encode_time = time.time() - start_encode
        
        texts_per_sec = len(test_texts) / encode_time
        
        results[name] = {
            "load_time": load_time,
            "encode_time": encode_time,
            "texts_per_sec": texts_per_sec,
            "dim": embeddings.shape[1]
        }
        
        print(f"  - Load time: {load_time:.2f}s")
        print(f"  - Encode time: {encode_time:.4f}s")
        print(f"  - Speed: {texts_per_sec:.1f} texts/sec")
        print(f"  - Dimensions: {embeddings.shape[1]}")
        print()
    
    # Comparison
    print("=" * 70)
    print("COMPARISON")
    print("=" * 70)
    
    baseline = results["tiny_fast"]
    
    for name, res in results.items():
        if name == "tiny_fast":
            continue
            
        speedup = baseline["texts_per_sec"] / res["texts_per_sec"]
        print(f"{name} is {speedup:.1f}x slower than tiny_fast")
    
    print("\n" + "=" * 70)
    print("RECOMMENDATION:")
    print("  For Chatbot trace retrieval:")
    print("    - Start with 'tiny_fast' (384d)")
    print("    - Only upgrade if retrieval quality is insufficient")
    print("    - Speed matters for real-time response")
    print("=" * 70)

if __name__ == "__main__":
    benchmark_embedding_models()

## Part 3: Domain-Specific Optimization

### Applying Embeddings to Your Architecture

**Current Chatbot embedding usage:**

1. **Trace retrieval** (Lesson 5 gap experiments)
   - Query: User message
   - Corpus: Successful reasoning traces
   - Goal: Find relevant past patterns

2. **Memory search** (across 5 databases)
   - Query: Current context
   - Corpus: Past conversations, facts, events
   - Goal: Retrieve relevant context

### Experiment 4.1: Optimize Trace Retrieval

**Goal:** Find best embedding setup for trace retrieval

In [None]:
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

def optimize_trace_retrieval():
    """
    Test different embedding models on Chatbot trace retrieval
    """
    
    # Simulate trace database
    traces = [
        {
            "id": 0,
            "query_pattern": "User is stressed about work",
            "approach": "Validate feelings -> Break down problem -> Suggest small step",
            "tools_used": ["empathy_engine", "planner"]
        },
        {
            "id": 1,
            "query_pattern": "User wants to learn a new skill",
            "approach": "Assess current level -> Recommend resources -> Set goal",
            "tools_used": ["resource_finder", "goal_setter"]
        },
        {
            "id": 2,
            "query_pattern": "User asks about past conversation",
            "approach": "Search memory -> Summarize context -> Answer specific question",
            "tools_used": ["memory_search", "summarizer"]
        }
    ]
    
    # Format traces as text for embedding
    trace_texts = [
        f"Query: {t['query_pattern']}\nApproach: {t['approach']}\nTools: {t['tools_used']}"
        for t in traces
    ]
    
    # Test queries (what user actually asks)
    test_queries = [
        "I'm feeling stressed about my job",              # Should match trace 0
        "I want to learn how to code python",             # Should match trace 1
        "What did I mention last Tuesday?",               # Should match trace 2
    ]
    
    # Expected top result for each query
    expected = [0, 1, 2]
    
    models_to_test = {
        "tiny_384d": "all-MiniLM-L6-v2",
        # "medium_768d": "all-mpnet-base-v2",
    }
    
    print("=" * 70)
    print("TRACE RETRIEVAL OPTIMIZATION")
    print("=" * 70)
    print(f"Traces: {len(traces)}")
    print(f"Test queries: {len(test_queries)}")
    print()
    
    results = {}
    
    for model_name, model_path in models_to_test.items():
        print(f"\n--- Testing {model_name} ---")
        model = SentenceTransformer(model_path)
        
        # Embed traces
        trace_embeddings = model.encode(trace_texts)
        
        correct_count = 0
        
        for i, query in enumerate(test_queries):
            query_embedding = model.encode(query)
            
            # Calculate similarity
            similarities = cosine_similarity(
                [query_embedding], 
                trace_embeddings
            )[0]
            
            # Get top match
            top_match_idx = np.argmax(similarities)
            
            is_correct = (top_match_idx == expected[i])
            if is_correct:
                correct_count += 1
                
            print(f"  Query: '{query}'")
            print(f"    -> Matched Trace {top_match_idx} (Score: {similarities[top_match_idx]:.3f})")
            print(f"    -> Correct? {is_correct}")
            
        accuracy = correct_count / len(test_queries)
        results[model_name] = accuracy
        print(f"  Accuracy: {accuracy:.1%}\n")
    
    # Recommendation
    print("=" * 70)
    print("RECOMMENDATION FOR CHATBOT:")
    
    best_model = max(results, key=results.get)
    print(f"  Best model: {best_model} ({results[best_model]:.1%} accuracy)")
    
    if results["tiny_384d"] >= 0.8:
        print("\n  ✓ Tiny model (384d) performs well - use it!")
        print("    - Good enough quality")
    else:
        print("\n  ⚠ Consider upgrading embedding model")
        print("    - Try medium (768d) or fine-tune")
    
    print("=" * 70)

if __name__ == "__main__":
    optimize_trace_retrieval()

## Summary & Key Takeaways

### What You Learned

1. **Embeddings = Semantic meaning in vector space**
   - Not just word overlap
   - Capture similarity, synonyms, paraphrases

2. **Model Selection Trade-offs**
   - 384d: Fast, good enough for most cases
   - 1024d: 3x slower, ~15% better

3. **Chatbot Specific**
   - Trace retrieval depends on embeddings
   - Upgrade only if retrieval is bottleneck

### Decision Framework

**Start here:**
- Use `all-MiniLM-L6-v2` (384d)
- Fast, proven, good baseline

**Upgrade if:**
- Retrieval quality < 80%
- Queries returning wrong traces
- Speed not critical