# Hybrid Search Comparison: BM25 vs Semantic vs Hybrid

**Learning Objectives:**
- Compare BM25 (lexical), semantic (embedding-based), and hybrid search
- Understand when each retrieval method excels
- Implement Reciprocal Rank Fusion (RRF) for hybrid search
- Measure retrieval quality using precision metrics

**Execution Time:** <8 minutes (DEMO mode), ~20 minutes (FULL mode)  
**Cost Estimate:** $0.30 (DEMO), $1.50 (FULL)

---

## Setup

In [None]:
import sys
sys.path.append('..')

import json
import numpy as np
from rank_bm25 import BM25Okapi
from backend.semantic_retrieval import (
    generate_embeddings,
    build_vector_index,
    semantic_search,
    hybrid_search,
    reciprocal_rank_fusion
)
from backend.context_judges import ContextPrecisionJudge
import os

# Set execution mode: DEMO (cheap, fast) or FULL (comprehensive)
MODE = os.getenv("EXECUTION_MODE", "DEMO")  # Change to "FULL" for comprehensive evaluation
print(f"üîß Execution Mode: {MODE}")

# Load recipe data
with open('../homeworks/hw4/data/processed_recipes.json', 'r') as f:
    recipes = json.load(f)

# Sample size based on mode
SAMPLE_SIZE = 50 if MODE == "DEMO" else 200
recipes = recipes[:SAMPLE_SIZE]

print(f"üìö Loaded {len(recipes)} recipes")

## Data Preparation

In [None]:
# Extract recipe documents (name + description + ingredients)
documents = []
for recipe in recipes:
    doc = f"{recipe['name']}. {recipe['description']}. Ingredients: {', '.join(recipe['ingredients'][:10])}"
    documents.append(doc)

print(f"üìù Created {len(documents)} recipe documents")
print(f"\nüìÑ Sample document:\n{documents[0][:200]}...")

## Build Search Indices

In [None]:
# 1. Build BM25 index (lexical)
tokenized_corpus = [doc.lower().split() for doc in documents]
bm25_index = BM25Okapi(tokenized_corpus)
print("‚úÖ BM25 index built")

# 2. Build vector index (semantic)
print("\nüîÑ Generating embeddings (this may take 2-5 minutes)...")
embeddings = generate_embeddings(documents)
vector_index = build_vector_index(embeddings)
print(f"‚úÖ Vector index built: {embeddings.shape[0]} documents, {embeddings.shape[1]} dimensions")

## Test Queries

We'll test three query types:
1. **Exact match queries** (BM25 should excel)
2. **Semantic queries** (Embeddings should excel)
3. **Mixed queries** (Hybrid should excel)

In [None]:
test_queries = [
    # Exact match queries (favor BM25)
    {"query": "lasagna with cheese", "type": "exact", "expected_terms": ["lasagna", "cheese"]},
    {"query": "chicken recipe", "type": "exact", "expected_terms": ["chicken"]},
    
    # Semantic queries (favor embeddings)
    {"query": "healthy breakfast ideas", "type": "semantic", "expected_concepts": ["nutritious", "morning"]},
    {"query": "comfort food for cold weather", "type": "semantic", "expected_concepts": ["warm", "hearty"]},
    
    # Mixed queries (favor hybrid)
    {"query": "quick vegetarian pasta", "type": "mixed", "expected_terms": ["pasta"], "expected_concepts": ["fast", "meatless"]},
    {"query": "chocolate dessert easy to make", "type": "mixed", "expected_terms": ["chocolate"], "expected_concepts": ["simple", "sweet"]},
]

if MODE == "DEMO":
    test_queries = test_queries[:3]

print(f"üîç Testing {len(test_queries)} queries")

## Run Comparison

In [None]:
# Initialize context precision judge
judge = ContextPrecisionJudge()

results = []
k = 5  # Top-5 retrieval

for i, test in enumerate(test_queries):
    query = test["query"]
    print(f"\n{'='*60}")
    print(f"Query {i+1}/{len(test_queries)}: '{query}' (Type: {test['type']})")
    print(f"{'='*60}")
    
    # 1. BM25 search
    tokenized_query = query.lower().split()
    bm25_scores = bm25_index.get_scores(tokenized_query)
    bm25_top_k = np.argsort(bm25_scores)[::-1][:k]
    bm25_results = [documents[idx] for idx in bm25_top_k]
    
    # 2. Semantic search
    query_embedding = generate_embeddings([query])[0]
    semantic_results_raw = semantic_search(query_embedding, vector_index, k=k)
    semantic_top_k = [idx for idx, score in semantic_results_raw]
    semantic_results = [documents[idx] for idx in semantic_top_k]
    
    # 3. Hybrid search (alpha=0.5: equal weight to BM25 and semantic)
    hybrid_top_k = hybrid_search(query, bm25_index, vector_index, alpha=0.5, k=k)
    hybrid_results = [documents[idx] for idx in hybrid_top_k]
    
    # Evaluate precision using AI judge
    print("\nü§ñ Evaluating with AI judge...")
    bm25_eval = judge.evaluate(query, bm25_results)
    semantic_eval = judge.evaluate(query, semantic_results)
    hybrid_eval = judge.evaluate(query, hybrid_results)
    
    print(f"\nüìä Precision@{k}:")
    print(f"  BM25:     {bm25_eval['precision']:.2f}")
    print(f"  Semantic: {semantic_eval['precision']:.2f}")
    print(f"  Hybrid:   {hybrid_eval['precision']:.2f}")
    
    # Show top-1 result from each method
    print(f"\nü•á Top-1 Results:")
    print(f"\n  BM25: {bm25_results[0][:150]}...")
    print(f"\n  Semantic: {semantic_results[0][:150]}...")
    print(f"\n  Hybrid: {hybrid_results[0][:150]}...")
    
    results.append({
        "query": query,
        "query_type": test["type"],
        "bm25_precision": bm25_eval['precision'],
        "semantic_precision": semantic_eval['precision'],
        "hybrid_precision": hybrid_eval['precision'],
        "bm25_top_k": bm25_top_k.tolist(),
        "semantic_top_k": semantic_top_k,
        "hybrid_top_k": hybrid_top_k
    })

## Analysis & Insights

In [None]:
import pandas as pd

# Create summary dataframe
summary = pd.DataFrame([
    {
        "Query": r["query"],
        "Type": r["query_type"],
        "BM25": r["bm25_precision"],
        "Semantic": r["semantic_precision"],
        "Hybrid": r["hybrid_precision"]
    }
    for r in results
])

print("\n" + "="*80)
print("üìà SUMMARY: Precision@5 by Query Type")
print("="*80)
print(summary.to_string(index=False))

# Aggregate by query type
print("\n" + "="*80)
print("üìä AVERAGE PRECISION BY QUERY TYPE")
print("="*80)
for qtype in ["exact", "semantic", "mixed"]:
    subset = summary[summary["Type"] == qtype]
    if len(subset) > 0:
        print(f"\n{qtype.upper()} queries:")
        print(f"  BM25:     {subset['BM25'].mean():.3f}")
        print(f"  Semantic: {subset['Semantic'].mean():.3f}")
        print(f"  Hybrid:   {subset['Hybrid'].mean():.3f}")

# Overall average
print("\n" + "="*80)
print("üèÜ OVERALL AVERAGE PRECISION")
print("="*80)
print(f"  BM25:     {summary['BM25'].mean():.3f}")
print(f"  Semantic: {summary['Semantic'].mean():.3f}")
print(f"  Hybrid:   {summary['Hybrid'].mean():.3f}")

## Key Insights

**Expected Patterns:**

1. **BM25 (Lexical Search)**
   - ‚úÖ Excels at exact term matching (e.g., "lasagna with cheese")
   - ‚ùå Struggles with semantic similarity (e.g., "comfort food")
   - ‚ö° Fast: No embedding computation needed

2. **Semantic Search (Embeddings)**
   - ‚úÖ Excels at conceptual matching (e.g., "healthy breakfast" ‚Üí nutritious recipes)
   - ‚ùå May miss exact term matches if semantically distant
   - üê¢ Slower: Requires embedding generation

3. **Hybrid Search (RRF)**
   - ‚úÖ Best of both worlds: Combines lexical + semantic signals
   - ‚úÖ Most robust across diverse query types
   - ‚öñÔ∏è Tunable with alpha parameter (0=pure semantic, 1=pure BM25)

**Recommendation:** Use hybrid search with `alpha=0.5` as default for production RAG systems.

## Save Results

In [None]:
# Save results for dashboard
output = {
    "mode": MODE,
    "sample_size": SAMPLE_SIZE,
    "k": k,
    "queries": results,
    "summary": {
        "avg_bm25_precision": float(summary['BM25'].mean()),
        "avg_semantic_precision": float(summary['Semantic'].mean()),
        "avg_hybrid_precision": float(summary['Hybrid'].mean()),
        "improvement_over_bm25": float((summary['Hybrid'].mean() - summary['BM25'].mean()) / summary['BM25'].mean() * 100),
        "improvement_over_semantic": float((summary['Hybrid'].mean() - summary['Semantic'].mean()) / summary['Semantic'].mean() * 100)
    }
}

os.makedirs('results', exist_ok=True)
with open('results/hybrid_search_results.json', 'w') as f:
    json.dump(output, f, indent=2)

print("\n‚úÖ Results saved to: lesson-12/results/hybrid_search_results.json")
print(f"\nüéØ Hybrid search improved precision by:")
print(f"  vs BM25:     {output['summary']['improvement_over_bm25']:+.1f}%")
print(f"  vs Semantic: {output['summary']['improvement_over_semantic']:+.1f}%")

---

## Next Steps

1. **Tune Alpha:** Experiment with different `alpha` values (0.3, 0.5, 0.7) in `hybrid_search()`
2. **Try RRF:** Use `reciprocal_rank_fusion()` for more sophisticated merging
3. **Context Quality:** Complete `chunking_optimization.ipynb` to optimize chunk size
4. **Advanced Retrieval:** Study `lesson-12/hybrid_search_strategies.md` for query expansion techniques

**Related Tutorials:**
- [Concept: Hybrid Search Strategies](hybrid_search_strategies.md)
- [Concept: Context Quality Evaluation](context_quality_evaluation.md)
- [Homework 4: RAG Evaluation](../homeworks/hw4/TUTORIAL_INDEX.md)