---

### ðŸŽ“ **Professor**: Apostolos Filippas

### ðŸ“˜ **Class**: AI Engineering

### ðŸ“‹ **Topic**: Embeddings & Semantic Search

ðŸš« **Note**: You are not allowed to share the contents of this notebook with anyone outside this class without written permission by the professor.

---

## Welcome!

In this lecture, we'll explore **embeddings and semantic search**. These tools complement the lexical search tools that we covered last week.

By the end of this session, you'll be able to:
- Understand what embeddings are and how they encode meaning
- Use both local (Hugging Face) and API-based (OpenAI) embedding models
- Implement semantic search from scratch using cosine similarity
- Discover why similarity does not equal relevance
- Build hybrid search combining BM25 + embeddings
- Compare search approaches using NDCG

## Using modules in your code

Starting today, we'll use a **helpers module** to organize reusable code.

Instead of copying functions between notebooks, we **import** them:

```python
from helpers import load_wands_products, snowball_tokenize, score_bm25
```

This is how professional codebases work:
- **Single source of truth** - fix a bug once, fixed everywhere
- **Cleaner notebooks** - focus on the lesson, not boilerplate
- **Reusability** - same functions work across lectures and homework

In [None]:
# ruff: noqa: E402
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import warnings
warnings.filterwarnings("ignore")

# Import from our helpers module!
from helpers import (
    # Data loading
    load_wands_products, load_wands_queries, load_wands_labels,
    # BM25 (from Lecture 3)
    build_index, score_bm25, search_bm25,
    # Evaluation
    evaluate_search,
    # Embeddings
    get_embedding_openai, get_embedding_local, batch_embed_local,
    # Similarity
    cosine_similarity, batch_cosine_similarity,
    # Utility
    get_product_sample, normalize_scores
)

# Load environment variables for API keys
from dotenv import load_dotenv
load_dotenv()

pd.set_option('display.max_colwidth', 80)
print("All imports successful!")

---

# 1. From Keywords to Meaning

## 1.1 Recap: BM25 and Lexical Search

Last week, we built a search engine using **BM25** - a lexical search algorithm that:
- Matches documents based on **exact token matches**
- Uses **TF-IDF** scoring with saturation and length normalization
- Gives you **precise control** over what matches

Let's reload our WANDS data and BM25 index:

In [None]:
# Load the WANDS dataset (same as Lecture 3 and Homework 3)
products = load_wands_products()
queries = load_wands_queries()
labels = load_wands_labels()

print(f"Products: {len(products):,}")
print(f"Queries: {len(queries):,}")
print(f"Labels: {len(labels):,}")

In [None]:
# Build BM25 index on product names
name_index, name_lengths = build_index(products['product_name'].tolist())
print(f"Index contains {len(name_index):,} unique terms")

## 1.2 The Limitation: Exact Token Matching

BM25 is powerful, but it has a fundamental limitation: it only matches **exact tokens**.

What happens when we search for synonyms?

In [None]:
# Search for "couch"
couch_results = search_bm25("couch", name_index, products, name_lengths, k=5)
print("BM25 results for 'couch':")
couch_results[['product_name', 'bm25_score']]

In [None]:
# Search for "sofa" - a synonym!
sofa_results = search_bm25("sofa", name_index, products, name_lengths, k=5)
print("BM25 results for 'sofa':")
sofa_results[['product_name', 'bm25_score']]

**Notice the problem:**
- "couch" results contain products with "couch" in the name
- "sofa" results contain products with "sofa" in the name
- But they're **synonyms** - a user searching for "couch" would probably want sofas too!

BM25 treats them as completely different words because it only matches exact tokens.

## 1.3 Lexical vs Semantic Search

| Aspect | Lexical Search (BM25) | Semantic Search (Embeddings) |
|--------|----------------------|-----------------------------|
| **How it works** | Exact token matching | Meaning-based similarity |
| **Synonyms** | Misses them | Finds them! |
| **Control** | High - you decide what matches | Lower - model decides |
| **Speed** | Very fast (inverted index) | Slower (vector comparisons) |
| **Best for** | Precise queries, keywords, IDs | Fuzzy queries, concepts |

**Key insight**: Lexical search is a **scalpel** - precise but limited. Semantic search is a **net** - catches more but less control.

---

# 2. What are embeddings?

> **TERM: Embedding**  
> A **dense vector representation** that maps text (or other data) to a point in high-dimensional space where **semantically similar items are close together**.

Think of it as assigning "coordinates" to the **meaning** of text:
- "couch" and "sofa" would have similar coordinates (close together)
- "couch" and "refrigerator" would have different coordinates (far apart)


## 2.1 Getting Your First Embedding

In [None]:
# Get an embedding using OpenAI's API
couch_emb = get_embedding_openai(text="couch", model="text-embedding-3-small")

print(f"Type: {type(couch_emb)}")
print(f"Dimension: {len(couch_emb)}")
print(f"First 10 values: {couch_emb[:10]}")

The embedding is a **1536-dimensional vector** of floating-point numbers.

Each dimension captures some aspect of the word's meaning - but unlike features we design ourselves, these are **latent features** learned by the model.

> **TERM: Latent Features**  
> Hidden dimensions in the embedding that capture abstract concepts. They're not directly interpretable like "color=red" or "size=large" - they're patterns the model discovered during training.

## 2.2 Embeddings Capture Meaning

Let's see how embeddings capture the relationship between words:

In [None]:
# Get embeddings for related words
words = ["couch", "sofa", "chair", "table", "refrigerator"]
embeddings = {word: get_embedding_openai(text=word, model="text-embedding-3-large") for word in words}

# Calculate similarity between all pairs
print("Similarity matrix:")


word_width = max(len(w) for w in words) + 2
# Header row: empty cell + each word as column header (each in word_width)
header = "".join([f"{'':<{word_width}s}"] + [f"{w:>{word_width}s}" for w in words])
print("Similarity matrix:")
print(header)
print("-" * (word_width + len(words) * word_width))
# Data rows: row label + scores, each column word_width wide
for w1 in words:
    row = [f"{w1:<{word_width}s}"]
    for w2 in words:
        sim = cosine_similarity(embeddings[w1], embeddings[w2])
        row.append(f"{sim:>{word_width}.2f}")
    print("".join(row))

**What do you notice?**
- "couch" and "sofa" have **very high similarity** (~0.75) - the model learned they're synonyms!
- Furniture items (couch, sofa, chair, table) are more similar to each other
- "refrigerator" is less similar to the furniture items

The embedding model learned these relationships from training on massive amounts of text.

In [None]:
# Get embeddings for related words
words_2 = ["Apostolos Filippas", "Tilda Swinton", "Technology", "Movies", "Suspiria", "Greek", "British"]
embeddings_2 = {word: get_embedding_openai(text=word, model="text-embedding-3-large") for word in words_2}

# Calculate similarity between all pairs
print("Similarity matrix:")


word_width = max(len(w) for w in words_2) + 2
# Header row: empty cell + each word as column header (each in word_width)
header = "".join([f"{'':<{word_width}s}"] + [f"{w:>{word_width}s}" for w in words_2])
print("Similarity matrix:")
print(header)
print("-" * (word_width + len(words_2) * word_width))
# Data rows: row label + scores, each column word_width wide
for w1 in words_2:
    row = [f"{w1:<{word_width}s}"]
    for w2 in words_2:
        sim = cosine_similarity(embeddings_2[w1], embeddings_2[w2])
        row.append(f"{sim:>{word_width}.2f}")
    print("".join(row))

---

# 2.5 Local vs API Embeddings

> **TERM: Hugging Face**  
> An open-source platform hosting thousands of pre-trained AI models. Think of it as "GitHub for AI models" - you can download and run models locally without API calls or costs.

So far we've used OpenAI's embedding API. But there's another option: **run models locally**!

## 2.5.1 Loading a Local Model

In [None]:
# Get a local embedding - first call downloads the model (~80MB)
local_emb = get_embedding_local("wooden coffee table")

print(f"Local embedding dimension: {len(local_emb)}")

In [None]:
# Compare dimensions
api_emb = get_embedding_openai("wooden coffee table")

print(f"OpenAI (API): {len(api_emb)} dimensions")
print(f"MiniLM (Local): {len(local_emb)} dimensions")

## 2.5.3 Trade-offs: API vs Local

| Aspect | API (OpenAI) | Local (Hugging Face) |
|--------|-------------|---------------------|
| **Cost** | ~$0.02 per 1M tokens | FREE |
| **Dimensions** | 1536 (more expressive) | 384 (more compact) |
| **Quality** | Generally higher | Good for most tasks |
| **Speed** | Network latency | Faster for batches |
| **Privacy** | Data sent to API | Data stays local |
| **Setup** | Just API key | Downloads model (~80MB) |

**When to use which?**
- **Prototyping/Learning**: Local - free experimentation!
- **Production with privacy needs**: Local
- **Production needing best quality**: API
- **High volume, cost-sensitive**: Local

---

# 3. Measuring Similarity with Cosine

## 3.1 Why Cosine Similarity?

To find similar items, we need to measure the "distance" between embeddings. **Cosine similarity** measures the angle between two vectors:

$$\text{cosine\_similarity}(a, b) = \frac{a \cdot b}{\|a\| \times \|b\|}$$

- **1.0** = identical direction (most similar)
- **0.0** = perpendicular (unrelated)
- **-1.0** = opposite direction (most dissimilar)

Why cosine instead of Euclidean distance? Cosine focuses on **direction** (meaning) not **magnitude** (length).

In [None]:
# The cosine_similarity function we imported:
def cosine_similarity_manual(a, b):
    """Calculate cosine similarity between two vectors."""
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return dot_product / (norm_a * norm_b)

# Verify it matches
sim1 = cosine_similarity(embeddings["couch"], embeddings["sofa"])
sim2 = cosine_similarity_manual(embeddings["couch"], embeddings["sofa"])
print(f"From helpers: {sim1:.6f}")
print(f"Manual: {sim2:.6f}")

## 3.2 Batch Similarity for Efficiency

When searching, we need to compare one query against **thousands of products**. Doing this one-by-one is slow. Instead, we use **matrix operations**:

In [None]:
# Stack all word embeddings into a matrix
word_matrix = np.array([embeddings[w] for w in words])
print(f"Matrix shape: {word_matrix.shape}")

# Query embedding
query_emb = embeddings["couch"]

# Calculate similarity to all words at once
similarities = batch_cosine_similarity(query_emb, word_matrix)

for word, sim in zip(words, similarities):
    print(f"{word:15s}: {sim:.4f}")

---

# 4. Building Semantic Search from Scratch

Now let's build a working semantic search engine!

## 4.1 The Semantic Search Pipeline

1. **Embed all products** (offline, once)
2. **Embed the query** (at search time)
3. **Calculate similarity** between query and all products
4. **Return top-k** most similar products

## 4.2 Embedding Products

For speed in class, we'll work with a **sample of 5,000 products**:

In [None]:
# Get consistent sample (same for everyone)
products_sample = get_product_sample(products, n=5000)
print(f"Working with {len(products_sample):,} products")
products_sample[['product_id', 'product_name', 'product_class']].head()

In [None]:
# Create text for embedding: combine name and class
products_sample['embed_text'] = (
    products_sample['product_name'].fillna('') + ' ' +
    products_sample['product_class'].fillna('')
)

products_sample['embed_text'].head()

In [None]:
# Embed all products using local model
# This took me 3.5 seconds
print("Embedding products...")
start = time.time()
product_embeddings = batch_embed_local(
    products_sample['embed_text'].tolist(),
    show_progress=True
)
print(f"Done in {time.time() - start:.1f}s")
print(f"Embeddings shape: {product_embeddings.shape}")

In [None]:
# Save embeddings so we don't have to recompute
np.save('temp/product_embeddings_sample.npy', product_embeddings)
products_sample.to_csv('temp/products_sample.csv', index=False)
print("Saved embeddings and sample to 'scripts/temp/'")

## 4.3 Implementing Semantic Search

In [None]:
def semantic_search_local(query, product_embeddings, products_df, k=10):
    """Search products using local embedding similarity."""
    # 1. Embed the query
    query_emb = get_embedding_local(query)
    
    # 2. Calculate similarity to all products
    similarities = batch_cosine_similarity(query_emb, product_embeddings)
    
    # 3. Get top-k indices
    top_k_idx = np.argsort(-similarities)[:k]
    
    # 4. Build results DataFrame
    results = products_df.iloc[top_k_idx].copy()
    results['similarity'] = similarities[top_k_idx]
    results['rank'] = range(1, k + 1)
    
    return results

In [None]:
# Test semantic search!
results = semantic_search_local("couch", product_embeddings, products_sample)
results[['rank', 'product_name', 'product_class', 'similarity']]

Let's test the synonym problem that BM25 couldn't solve:

In [None]:
# Build BM25 index for the sample
sample_index, sample_lengths = build_index(products_sample['product_name'].tolist())

# Search for "sofa" with BM25
bm25_results = search_bm25("sofa", sample_index, products_sample, sample_lengths, k=10)
print("BM25 for 'sofa':")
print(bm25_results[['product_name', 'bm25_score']].to_string())

print("\n" + "="*60 + "\n")

# Search for "sofa" with semantic search
sem_results = semantic_search_local("sofa", product_embeddings, products_sample, k=10)
print("Semantic for 'sofa':")
print(sem_results[['product_name', 'similarity']].to_string())

**Semantic search finds both "sofa" AND "couch" products!** It understands they're related concepts.

Let's try another query that BM25 struggles with:

In [None]:
# A conceptual query - no exact keyword match
query = "place to sit and relax"

bm25_results = search_bm25(query, sample_index, products_sample, sample_lengths, k=5)
print(f"BM25 for '{query}':")
print(bm25_results[['product_name', 'bm25_score']].to_string())

print("\n" + "="*60 + "\n")

sem_results = semantic_search_local(query, product_embeddings, products_sample, k=5)
print(f"Semantic for '{query}':")
print(sem_results[['product_name', 'similarity']].to_string())

---

# 5. The Critical Lesson: Similarity is NOT Relevance

Semantic search seems magical - it finds synonyms and understands concepts! But there's a **critical problem** you must understand.

## 5.1 The Domain Mismatch Problem

The embedding model was trained on **general web text** (Wikipedia, books, etc.). It learned what words mean in general.

But **relevance** in e-commerce search is domain-specific:
- A user searching for "star wars rug" wants a **rug** with a Star Wars theme
- They don't want a Star Wars **poster** or **blanket** - even though those are semantically similar!

Let's see this in action:

In [None]:
# Search for "star wars rug"
query = "star wars rug"

sem_results = semantic_search_local(query, product_embeddings, products_sample, k=10)
print(f"Semantic search for '{query}':")
sem_results[['rank', 'product_name', 'product_class', 'similarity']]

**What happened?**

The semantic search found items that are **similar to "star wars rug"** - but many of them might not be rugs at all!

The embedding model doesn't understand that:
- **"rug"** is the **product type** (must match)
- **"star wars"** is the **theme** (nice to have)

It treats all words equally in terms of meaning similarity.

## 5.2 Measuring the Problem with NDCG

Let's quantify how well each search method performs using NDCG (from Homework 3):

In [None]:
# Filter queries to those with products in our sample
sample_product_ids = set(products_sample['product_id'])
sample_labels = labels[labels['product_id'].isin(sample_product_ids)]
sample_query_ids = set(sample_labels['query_id'])
sample_queries = queries[queries['query_id'].isin(sample_query_ids)]

print(f"Queries with products in sample: {len(sample_queries)}")

In [None]:
# Evaluate BM25 on sample
print("Evaluating BM25...")
bm25_eval = evaluate_search(
    lambda q: search_bm25(q, sample_index, products_sample, sample_lengths, k=10),
    products_sample, sample_queries, sample_labels, k=10
)

In [None]:
# Evaluate Semantic Search on sample
print("Evaluating Semantic Search...")
semantic_eval = evaluate_search(
    lambda q: semantic_search_local(q, product_embeddings, products_sample, k=10),
    products_sample, sample_queries, sample_labels, k=10
)

In [None]:
# Compare!
print("\n" + "="*40)
print("COMPARISON")
print("="*40)
print(f"BM25 Mean NDCG@10:     {bm25_eval['ndcg'].mean():.4f}")
print(f"Semantic Mean NDCG@10: {semantic_eval['ndcg'].mean():.4f}")

## 5.3 Analyzing When Each Method Wins

In [None]:
# Combine evaluations
comparison = bm25_eval.merge(semantic_eval, on=['query_id', 'query'], suffixes=('_bm25', '_semantic'))
comparison['diff'] = comparison['ndcg_semantic'] - comparison['ndcg_bm25']

print(f"Semantic wins: {(comparison['diff'] > 0).sum()} queries")
print(f"BM25 wins: {(comparison['diff'] < 0).sum()} queries")
print(f"Tie: {(comparison['diff'] == 0).sum()} queries")

In [None]:
# Queries where semantic search wins big
print("Queries where SEMANTIC wins:")
semantic_wins = comparison.nlargest(5, 'diff')
semantic_wins[['query', 'ndcg_bm25', 'ndcg_semantic', 'diff']]

In [None]:
# Queries where BM25 wins big
print("Queries where BM25 wins:")
bm25_wins = comparison.nsmallest(5, 'diff')
bm25_wins[['query', 'ndcg_bm25', 'ndcg_semantic', 'diff']]

## 5.4 Key Takeaway

**Similarity is NOT the same as relevance!**

The embedding model learned general semantic similarity, but:
- It wasn't trained on e-commerce product search
- It doesn't know that product type is often more important than theme
- It doesn't understand your specific business rules

**Never assume embeddings will solve your search problem. Always evaluate with real relevance labels!**

---

# 6. Hybrid Search: Best of Both Worlds

Since BM25 and semantic search have different strengths, what if we **combine them**?

## 6.1 Weighted Combination

The simplest hybrid approach:
1. Get BM25 scores (normalize to 0-1)
2. Get semantic similarity scores (already 0-1)
3. Combine: `hybrid = alpha * semantic + (1-alpha) * bm25`

In [None]:
def hybrid_search(query, sample_index, product_embeddings, products_df, 
                  sample_lengths, alpha=0.5, k=10):
    """
    Combine BM25 and semantic search.
    
    alpha: weight for semantic (1-alpha for BM25)
    """
    # Get BM25 scores
    bm25_scores = score_bm25(query, sample_index, len(products_df), sample_lengths)
    bm25_norm = normalize_scores(bm25_scores)
    
    # Get semantic scores
    query_emb = get_embedding_local(query)
    semantic_scores = batch_cosine_similarity(query_emb, product_embeddings)
    # Semantic scores are already roughly 0-1, but let's normalize too
    semantic_norm = normalize_scores(semantic_scores)
    
    # Combine
    combined_scores = alpha * semantic_norm + (1 - alpha) * bm25_norm
    
    # Get top-k
    top_k_idx = np.argsort(-combined_scores)[:k]
    
    results = products_df.iloc[top_k_idx].copy()
    results['hybrid_score'] = combined_scores[top_k_idx]
    results['bm25_score'] = bm25_norm[top_k_idx]
    results['semantic_score'] = semantic_norm[top_k_idx]
    results['rank'] = range(1, k + 1)
    
    return results

In [None]:
# Test hybrid search
query = "star wars rug"
hybrid_results = hybrid_search(query, sample_index, product_embeddings, 
                               products_sample, sample_lengths, alpha=0.5)

print(f"Hybrid search for '{query}':")
hybrid_results[['rank', 'product_name', 'product_class', 'bm25_score', 'semantic_score', 'hybrid_score']]

## 6.2 Finding the Optimal Alpha

In [None]:
# Try different alpha values
alphas = [0.0, 0.25, 0.5, 0.75, 1.0]
results = []

for alpha in alphas:
    print(f"Evaluating alpha={alpha}...")
    eval_df = evaluate_search(
        lambda q: hybrid_search(q, sample_index, product_embeddings, 
                               products_sample, sample_lengths, alpha=alpha),
        products_sample, sample_queries, sample_labels, k=10, verbose=False
    )
    results.append({
        'alpha': alpha,
        'mean_ndcg': eval_df['ndcg'].mean()
    })

results_df = pd.DataFrame(results)
results_df

In [None]:
# Plot
plt.figure(figsize=(8, 5))
plt.plot(results_df['alpha'], results_df['mean_ndcg'], 'bo-', linewidth=2, markersize=8)
plt.xlabel('Alpha (0=BM25 only, 1=Semantic only)', fontsize=12)
plt.ylabel('Mean NDCG@10', fontsize=12)
plt.title('Hybrid Search Performance vs Alpha', fontsize=14)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

best_alpha = results_df.loc[results_df['mean_ndcg'].idxmax(), 'alpha']
print(f"\nBest alpha: {best_alpha}")

## 6.3 Final Comparison

In [None]:
# Evaluate hybrid with best alpha
print(f"Evaluating Hybrid (alpha={best_alpha})...")
hybrid_eval = evaluate_search(
    lambda q: hybrid_search(q, sample_index, product_embeddings, 
                           products_sample, sample_lengths, alpha=best_alpha),
    products_sample, sample_queries, sample_labels, k=10
)

print("\n" + "="*50)
print("FINAL COMPARISON")
print("="*50)
print(f"BM25 only:              {bm25_eval['ndcg'].mean():.4f}")
print(f"Semantic only:          {semantic_eval['ndcg'].mean():.4f}")
print(f"Hybrid (alpha={best_alpha}):     {hybrid_eval['ndcg'].mean():.4f}")

**Hybrid search often outperforms both individual methods!**

This is because:
- BM25 ensures exact keyword matches are found
- Semantic adds synonym and concept matching
- Together they cover each other's weaknesses

---

# 7. Challenges at Scale

In production, semantic search faces several challenges:

## 7.1 Computational Cost

Comparing a query to **millions of products** requires millions of similarity calculations.

In [None]:
# Time semantic search at different scales
query_emb = get_embedding_local("test query")

for n in [1000, 5000]:
    subset = product_embeddings[:n]
    start = time.time()
    for _ in range(100):  # Run 100 times for stable measurement
        _ = batch_cosine_similarity(query_emb, subset)
    elapsed = (time.time() - start) / 100
    print(f"{n:,} products: {elapsed*1000:.2f}ms per query")

**Solutions:**
- **ANN (Approximate Nearest Neighbors)**: FAISS, HNSW, Pinecone
- Trade exact results for ~100x speedup

## 7.2 Other Challenges

| Challenge | Problem | Solution |
|-----------|---------|----------|
| **Hubness** | Some vectors match everything | Normalize, diversity sampling |
| **Filtering** | "Red sofas under $500" | Pre-filter then search |
| **Staleness** | Products change, embeddings don't | Re-embed pipeline |
| **Cold start** | New products have no signals | Use content-based embeddings |

---

# 8. Summary

## What We Covered

| Concept | What It Is | Key Insight |
|---------|-----------|-------------|
| **Embedding** | Dense vector representing meaning | Similar items = close vectors |
| **Local vs API** | Hugging Face vs OpenAI | Trade-off: cost vs quality |
| **Cosine Similarity** | Measures angle between vectors | Range -1 to 1, direction matters |
| **Semantic Search** | Find by meaning, not keywords | Handles synonyms, paraphrases |
| **Similarity != Relevance** | Training data != your domain | Always evaluate with real labels! |
| **Hybrid Search** | BM25 + Semantic combined | Often beats either alone |

## Can You Do These?

- [ ] Get embeddings using both OpenAI API and local Hugging Face models
- [ ] Calculate cosine similarity between vectors
- [ ] Implement semantic search from scratch
- [ ] Explain why similarity is not the same as relevance
- [ ] Build hybrid search combining BM25 + embeddings
- [ ] Evaluate search quality using NDCG
- [ ] Choose between local and API embeddings based on requirements

## Troubleshooting

| Problem | Solution |
|---------|----------|
| Semantic search returns wrong product types | Consider hybrid search or filtering |
| Embeddings are slow | Use local model for development, batch operations |
| NDCG is low for semantic | Domain mismatch - consider fine-tuning |
| Model download fails | Check internet connection, disk space |

## Resources

- [Sentence Transformers Documentation](https://www.sbert.net/)
- [OpenAI Embeddings Guide](https://platform.openai.com/docs/guides/embeddings)
- [Hugging Face Model Hub](https://huggingface.co/models)
- [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) - Compare embedding models

## Next Class

We'll explore **RAG (Retrieval Augmented Generation)** - combining search with LLMs to build intelligent Q&A systems!