# Traditional to Modern RAG: Evolution of Retrieval Methods (2020-2024)

This comprehensive tutorial traces the evolution of Retrieval-Augmented Generation (RAG) from the original 2020 paper to modern state-of-the-art implementations, with hands-on implementation of each advancement.

## 🎯 Learning Objectives

By the end of this tutorial, you will be able to:

1. **Understand RAG Fundamentals**
   - Implement the exact retrieval method from the original RAG paper (Lewis et al., 2020)
   - Understand Dense Passage Retrieval (DPR) and Maximum Inner Product Search (MIPS)
   - Recognize that DPR = vector search used in modern vector databases

2. **Diagnose Performance Issues**
   - Identify domain mismatch problems (single-hop vs multihop)
   - Analyze why strong models on one task struggle on another
   - Conduct error analysis and failure case studies

3. **Compare Modern Embedders**
   - Evaluate 4 embedding models: DPR (2020) → Contriever (2022) → E5/BGE (2023)
   - Understand model evolution and selection criteria
   - Make informed decisions about embedder choice for your task

4. **Build Production RAG Pipelines**
   - Implement cross-encoder reranking for improved precision
   - Combine sparse (BM25) and dense (embeddings) search for hybrid retrieval
   - Measure each component's contribution through experimental evaluation

5. **Evaluate and Optimize**
   - Apply 6 comprehensive metrics for multihop reasoning evaluation
   - Profile latency and optimize performance bottlenecks
   - Compare custom implementations with frameworks (LlamaIndex)

## 🗺️ Tutorial Roadmap

### **PART 1: Foundations (Sections 1-4)**
```
Setup → Data Loading → BM25 Baseline → RAG Paper DPR Baseline
└─ Build understanding of sparse and dense retrieval
```

### **PART 2: Analysis (Section 5)**
```
Why RAG Baseline Struggles on Multihop Tasks
└─ Learn from failures, understand domain mismatch
```

### **PART 3: Modern Embedders (Section 6)**
```
Compare 4 Embedders: DPR → Contriever → E5 → BGE
└─ See 3 years of embedding model evolution (2020-2023)
```

### **PART 4: Pipeline Enhancements (Sections 7-8)**
```
Add Reranking → Add Hybrid Search
└─ Build components incrementally, measure each contribution
```

## 🔍 Key Insight: RAG Paper Uses Vector Search!

:::{admonition} Understanding the RAG Paper Baseline
:class: important

**The Original RAG Paper (Lewis et al., 2020) Uses:**
- **Dense Passage Retrieval (DPR)** - Bi-encoder architecture
- **Pre-trained on Natural Questions** - Single-hop factoid QA
- **No Fine-tuning** - Frozen retriever (this is our baseline!)
- **MIPS** - Maximum Inner Product Search for similarity
- **Top-k Retrieval** - Return most similar document vectors

**This IS vector search** - the same technology behind:
- OpenAI embeddings + Pinecone/Weaviate/Chroma
- LlamaIndex vector stores
- LangChain vector retrievers
- Modern production RAG systems

**Our Tutorial Journey:**
1. Implement RAG 2020 baseline (DPR-NQ)
2. Test on HotpotQA multihop → Measure actual performance
3. Analyze failures → Domain mismatch (trained on single-hop, testing on multihop)
4. Upgrade to modern embedders → Measure improvements
5. Add enhancements (reranking, hybrid) → Quantify gains
:::

## 📊 What We'll Build & Measure

We'll implement and compare retrieval components:

| Component | Baseline | Modern | Measured Via |
|-----------|----------|--------|--------------|
| **Embedder** | DPR-NQ (2020) | BGE/E5 (2023) | Experiments will show |
| **Retrieval** | Dense only | Hybrid (BM25 + Dense) | Document Recall@10 |
| **Reranking** | None | Cross-encoder | Precision improvement |
| **Evaluation** | Per-question | Comprehensive | 6 HotpotQA metrics |

## 🔧 Learning Philosophy

**Learn by Implementing, Not Just Using**

While frameworks like LlamaIndex provide excellent abstractions, we implement from scratch to understand:
- ✅ How vector embeddings work mathematically
- ✅ Different similarity metrics (cosine vs inner product vs MIPS)
- ✅ Trade-offs between sparse and dense representations
- ✅ Why and when each component improves performance
- ✅ How to debug and optimize RAG systems
- ✅ When to use frameworks vs custom implementations

**Progressive Complexity**
- Start with RAG paper baseline (understand foundations)
- Analyze failures (learn from mistakes)
- Add one improvement at a time (measure each contribution)
- Build complete modern RAG (production-ready system)

**Comparative Learning**
- Side-by-side comparisons at every step
- Visualizations showing performance differences
- Understanding evolution from 2020 to 2024

Let's begin the journey from traditional to modern RAG! 🚀

In [None]:
# Install required packages
!pip install transformers datasets torch sentence-transformers rank-bm25 numpy scikit-learn matplotlib seaborn
!pip install llama-index  # For comparison later

import torch
import torch.nn.functional as F  # Added for DPR normalization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datasets import load_dataset
from transformers import (
    DPRQuestionEncoder, DPRContextEncoder,
    DPRQuestionEncoderTokenizer, DPRContextEncoderTokenizer,
    AutoTokenizer, AutoModelForCausalLM
)
from sentence_transformers import CrossEncoder
from rank_bm25 import BM25Okapi
from sklearn.metrics.pairwise import cosine_similarity
import string
import re
from typing import List, Dict, Tuple, Optional
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

print("✅ All packages installed and imported successfully!")
print(f"🔥 PyTorch version: {torch.__version__}")
print(f"🎯 CUDA available: {torch.cuda.is_available()}")

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"🚀 Using device: {device}")

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

## 📊 Data Loading and Preprocessing

Let's load the HotpotQA dataset and prepare it for our vector search implementation.

In [None]:
# Load HotpotQA dataset
print("🔄 Loading HotpotQA dataset...")
dataset = load_dataset('hotpotqa/hotpot_qa', 'distractor')
train_data = dataset['train']
validation_data = dataset['validation']

print(f"📊 Dataset loaded successfully!")
print(f"   Training examples: {len(train_data):,}")
print(f"   Validation examples: {len(validation_data):,}")

# Take a smaller subset for faster processing during development
SAMPLE_SIZE = 100  # Increase this for full evaluation
train_sample = train_data.shuffle(seed=42).select(range(min(SAMPLE_SIZE, len(train_data))))
val_sample = validation_data.shuffle(seed=42).select(range(min(SAMPLE_SIZE, len(validation_data))))

print(f"🎯 Working with sample: {len(train_sample)} train, {len(val_sample)} validation")

# Inspect a sample to understand the structure
sample_example = train_sample[0]
print("\n📋 Sample HotpotQA Example Structure:")
print(f"   Question: {sample_example['question']}")
print(f"   Answer: {sample_example['answer']}")
print(f"   Type: {sample_example['type']}")
print(f"   Level: {sample_example['level']}")
print(f"   Number of context paragraphs: {len(list(sample_example['context']))}")
print(f"   Supporting facts: {len(list(sample_example['supporting_facts']))}")

print("\n🔍 First few context titles:")
context_list = list(sample_example['context'])
for i, (title, sentences) in enumerate(context_list[:3]):
    print(f"   {i+1}. {title} ({len(sentences)} sentences)")

print("\n📍 Supporting facts:")
facts_list = list(sample_example['supporting_facts'])
for title, sent_idx in facts_list:
    print(f"   - {title}, sentence {sent_idx}")

In [None]:
# Preprocessing functions for BM25 and DPR
def preprocess_text_for_bm25(text):
    """Preprocess text for BM25 sparse retrieval"""
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    return text.split()

def extract_passages_from_example(example):
    """
    Extract individual passages from a SINGLE HotpotQA example
    
    KEY INSIGHT FOR HOTPOTQA:
    - Each example has 10 context paragraphs (2 gold + 8 distractors)
    - Retrieval happens WITHIN these 10 paragraphs per question
    - This simulates real-world multihop QA where we filter from a candidate set
    """
    passages = []
    passage_metadata = []
    
    context_list = list(example['context'])
    
    for context_idx, (title, sentences) in enumerate(context_list):
        # Each sentence becomes a separate passage
        for sent_idx, sentence in enumerate(sentences):
            passage_text = sentence.strip()
            if passage_text:  # Only add non-empty passages
                passages.append(passage_text)
                passage_metadata.append({
                    'title': title,
                    'context_idx': context_idx,
                    'sentence_idx': sent_idx,
                    'full_passage': ' '.join(sentences)  # Full paragraph for context
                })
    
    return passages, passage_metadata

def simple_answer_extraction(question, retrieved_passages, top_k=3):
    """
    Simple answer extraction strategy for tutorial purposes
    
    Strategy:
    1. Combine top-k retrieved passages
    2. Find the shortest span that appears in passages and overlaps with question entities
    3. For tutorial: just return key entities from top passage
    """
    if not retrieved_passages:
        return "Unable to answer"
    
    # Simple strategy: extract key entities from top passages
    # Combine top-k passages
    combined_text = ' '.join(retrieved_passages[:top_k])
    
    # Remove question words to find novel information
    question_words = set(preprocess_text_for_bm25(question))
    passage_words = preprocess_text_for_bm25(combined_text)
    
    # Find candidate answer phrases (simple heuristic: capitalized words or numbers)
    import re
    # Look for capitalized phrases (proper nouns) or numbers or years
    candidates = re.findall(r'\b[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*\b|\b\d+\b', combined_text)
    
    if candidates:
        # Return first candidate that's not in question
        for candidate in candidates:
            if candidate.lower() not in [q.lower() for q in question.split()]:
                return candidate
    
    # Fallback: return first few words of top passage
    words = combined_text.split()
    return ' '.join(words[:min(5, len(words))])

print("✅ Preprocessing functions ready!")
print("📝 Functions available:")
print("   - preprocess_text_for_bm25(): Clean text for sparse retrieval")
print("   - extract_passages_from_example(): Extract passages from ONE example")
print("   - simple_answer_extraction(): Generate answer from retrieved passages")

## 🔍 BM25 Sparse Retrieval Implementation

BM25 is a probabilistic ranking function that scores documents based on query term frequency and document length. It's a traditional sparse method that forms the foundation of modern search engines.

### 🎯 Critical HotpotQA Setup Understanding

**Per-Question Retrieval (Correct Approach):**
- Each HotpotQA example provides 10 context paragraphs (2 gold + 8 distractors)
- The task: retrieve the correct 2 documents from these 10 candidates
- We build a separate BM25 index for each question's context
- This tests the model's ability to filter signal from noise

**Why NOT Global Corpus Retrieval:**
- ❌ Building one index from training data and searching from validation won't work
- ❌ Gold documents for validation questions aren't in the training corpus
- ❌ Document Recall@10 would be 0% because documents don't exist!
- ✅ Per-question retrieval matches the actual HotpotQA benchmark setup

In [None]:
# Demonstrate BM25 with PER-QUESTION retrieval (correct approach for HotpotQA)
print("🔄 BM25 Per-Question Retrieval Demonstration")
print("="*60)

# Get a test example from validation set
test_example = val_sample[0]
test_question = test_example['question']

print(f"🎯 Test Question: {test_question}")
print(f"🎯 Gold Answer: {test_example['answer']}")

# Extract passages from THIS example's context (10 paragraphs)
example_passages, example_metadata = extract_passages_from_example(test_example)
print(f"\n📊 Extracted {len(example_passages)} passages from {len(test_example['context'])} context paragraphs")

# Show context paragraph titles
context_titles = [title for title, _ in test_example['context']]
print(f"\n📚 Available context paragraphs (2 gold + 8 distractors):")
for i, title in enumerate(context_titles):
    # Check if this is a gold supporting document
    gold_titles = set([fact[0] for fact in test_example['supporting_facts']])
    marker = "🟢 GOLD" if title in gold_titles else "🔴 DISTRACTOR"
    print(f"   {i+1}. {title} {marker}")

# Build BM25 index for THIS example's passages
print(f"\n🏗️ Building BM25 index for this example's passages...")
tokenized_passages = [preprocess_text_for_bm25(passage) for passage in example_passages]
bm25 = BM25Okapi(tokenized_passages)
print(f"✅ BM25 index built with {len(tokenized_passages)} passages")

# Tokenize query and search
test_query_tokens = preprocess_text_for_bm25(test_question)
print(f"\n🔍 Query tokens: {test_query_tokens[:10]}...")

# Get top-k BM25 scores
k = 10
bm25_scores = bm25.get_scores(test_query_tokens)
top_k_indices = np.argsort(bm25_scores)[::-1][:k]

print(f"\n📊 Top-{k} BM25 Retrieval Results:")
retrieved_titles_bm25 = []
for i, idx in enumerate(top_k_indices):
    score = bm25_scores[idx]
    passage = example_passages[idx][:100] + "..."
    title = example_metadata[idx]['title']
    
    # Mark if it's a gold document
    is_gold = "✓ GOLD" if title in gold_titles else ""
    print(f"   {i+1}. Score: {score:.3f} | {title} {is_gold}")
    print(f"      {passage}")
    
    if title not in retrieved_titles_bm25:
        retrieved_titles_bm25.append(title)

# Calculate document recall
gold_titles_list = list(gold_titles)
doc_recall = len(set(retrieved_titles_bm25[:k]).intersection(gold_titles)) / len(gold_titles)
print(f"\n📈 Document Recall@{k}: {doc_recall:.3f} ({len(set(retrieved_titles_bm25[:k]).intersection(gold_titles))}/{len(gold_titles)} gold docs retrieved)")

print("\n✅ BM25 per-question retrieval demonstration complete!")
print("💡 Key insight: We retrieve from each question's 10 context paragraphs, not a global corpus!")

## 🎯 SECTION 4: RAG Paper Baseline - Dense Passage Retrieval (2020)

**This implements the EXACT retrieval method from the original RAG paper (Lewis et al., 2020)!**

### 📄 What is the RAG Paper Baseline?

The Retrieval-Augmented Generation (RAG) paper by Lewis et al. (2020) introduced a powerful paradigm: combine neural retrieval with neural generation. The retrieval component uses:

- **Dense Passage Retrieval (DPR)** - Bi-encoder architecture with separate encoders for questions and passages
- **Pre-trained on Natural Questions (NQ)** - Google's single-hop factoid QA dataset  
- **No fine-tuning** - Frozen retriever weights (this is the baseline configuration)
- **MIPS (Maximum Inner Product Search)** - Find passages with highest dot product to query vector
- **Top-k retrieval** - Return k most similar passages based on vector similarity

### 🔍 DPR IS Vector Search!

Dense Passage Retrieval is essentially **vector search** - the same underlying technology used in:
- OpenAI embeddings + Pinecone/Weaviate/Chroma
- LlamaIndex vector stores
- LangChain vector retrievers
- All modern RAG systems

**The architecture:**
```
Question → BERT Encoder → 768-dim query vector
Passages → BERT Encoder → 768-dim passage vectors  
Similarity → Dot product (or cosine) between query and passage vectors
Retrieval → Top-k passages with highest similarity scores
```

### 📊 Task Characteristics

**Original RAG Paper Tasks (where DPR-NQ was trained and evaluated):**
- ✅ Natural Questions: "Who is the president of France?"
- ✅ TriviaQA: "What year did the Berlin Wall fall?"  
- ✅ WebQuestions: "When was Barack Obama born?"
- All are **single-hop** questions requiring one document

**Our Task - HotpotQA (different from training):**
- ❓ Multihop reasoning: "What year was the director of Inception born?"
  - Step 1: Find document about Inception → Learn director is Christopher Nolan
  - Step 2: Find document about Christopher Nolan → Find birth year 1970
- Requires **two documents** and reasoning across them

### 🎯 Why Implement This Baseline?

1. **Understand Foundations**: See how RAG paper's retriever actually works
2. **Learn from Failures**: Observe performance on mismatched task domains
3. **Appreciate Evolution**: Understand motivations for modern improvements
4. **Build Intuition**: Know when domain-specific training matters

### 📈 What We'll Measure

When we test DPR-NQ on HotpotQA multihop questions, we'll measure:
- Document Recall@10: Can it find both required documents?
- Supporting-Fact F1: Sentence-level retrieval accuracy
- Latency: Encoding and search time
- Comparison baseline for modern methods

The experiments will reveal how task mismatch affects performance. This isn't about DPR being "bad" - it's about understanding when and why models struggle on out-of-domain tasks.

Let's implement the RAG paper baseline and run experiments!

In [None]:
# Load pre-trained DPR models (this is vector search!)
print("🔄 Loading pre-trained DPR models for vector search...")

# Question encoder (queries → vectors)
q_encoder = DPRQuestionEncoder.from_pretrained('facebook/dpr-question_encoder-single-nq-base')
q_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained('facebook/dpr-question_encoder-single-nq-base')

# Context encoder (passages → vectors) 
c_encoder = DPRContextEncoder.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')
c_tokenizer = DPRContextEncoderTokenizer.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')

# Move to device
q_encoder.to(device)
c_encoder.to(device)

print(f"✅ DPR models loaded on {device}")
print("🔍 Vector Search Components Ready:")
print(f"   - Question Encoder: Transforms questions → 768-dim vectors")
print(f"   - Context Encoder: Transforms passages → 768-dim vectors")
print(f"   - Similarity: Cosine similarity / Inner product search")

def encode_questions_dpr(questions, batch_size=32):
    """Encode questions into dense vectors (Vector Search Step 1)"""
    q_encoder.eval()
    all_embeddings = []
    
    with torch.no_grad():
        for i in range(0, len(questions), batch_size):
            batch_questions = questions[i:i+batch_size]
            
            # Tokenize questions
            encoded = q_tokenizer(
                batch_questions,
                padding=True,
                truncation=True,
                max_length=512,
                return_tensors='pt'
            )
            
            # Move to device
            input_ids = encoded['input_ids'].to(device)
            attention_mask = encoded['attention_mask'].to(device)
            
            # Encode to vectors
            embeddings = q_encoder(input_ids=input_ids, attention_mask=attention_mask)
            all_embeddings.append(embeddings.pooler_output.cpu())
    
    return torch.cat(all_embeddings, dim=0)

def encode_passages_dpr(passages, batch_size=32):
    """Encode passages into dense vectors (Vector Search Step 2)"""
    c_encoder.eval()
    all_embeddings = []
    
    with torch.no_grad():
        for i in tqdm(range(0, len(passages), batch_size), desc="Encoding passages"):
            batch_passages = passages[i:i+batch_size]
            
            # Tokenize passages
            encoded = c_tokenizer(
                batch_passages,
                padding=True,
                truncation=True,
                max_length=512,
                return_tensors='pt'
            )
            
            # Move to device
            input_ids = encoded['input_ids'].to(device)
            attention_mask = encoded['attention_mask'].to(device)
            
            # Encode to vectors
            embeddings = c_encoder(input_ids=input_ids, attention_mask=attention_mask)
            all_embeddings.append(embeddings.pooler_output.cpu())
    
    return torch.cat(all_embeddings, dim=0)

def vector_search_dpr(query_embedding, passage_embeddings, passages, metadata, top_k=10):
    """Perform vector search using cosine similarity (Vector Search Step 3)"""
    # Normalize embeddings for cosine similarity
    query_embedding = F.normalize(query_embedding, p=2, dim=1)
    passage_embeddings = F.normalize(passage_embeddings, p=2, dim=1)
    
    # Compute similarity matrix (this IS vector search!)
    similarities = torch.mm(query_embedding, passage_embeddings.transpose(0, 1))
    similarities = similarities.squeeze(0)  # Remove batch dimension
    
    # Get top-k most similar vectors
    top_k_scores, top_k_indices = torch.topk(similarities, min(top_k, len(similarities)))
    
    results = []
    for score, idx in zip(top_k_scores, top_k_indices):
        results.append({
            'idx': int(idx),
            'score': float(score),
            'passage': passages[int(idx)],
            'title': metadata[int(idx)]['title'],
            'metadata': metadata[int(idx)]
        })
    
    return results

print("\n✅ DPR Vector Search functions ready!")
print("🎯 Key insight: DPR = Vector Database functionality!")
print("   - encode_questions_dpr(): Query → Vector")
print("   - encode_passages_dpr(): Documents → Vectors") 
print("   - vector_search_dpr(): Similarity search (MIPS/Cosine)")

In [None]:
# Demonstrate DPR Vector Search with PER-QUESTION retrieval
print("🔄 DPR Vector Search Per-Question Demonstration")
print("="*60)
print("⚡ This demonstrates vector search on a per-question basis!")

# Use the same test example for comparison
print(f"🎯 Test Question: {test_question}")
print(f"🎯 Gold Answer: {test_example['answer']}")

# Build DPR vector embeddings for THIS example's passages
print(f"\n🧠 Encoding {len(example_passages)} passages into dense vectors...")
passage_embeddings = encode_passages_dpr(example_passages, batch_size=16)

print(f"✅ Vector index built!")
print(f"📊 Vector Statistics:")
print(f"   - Number of vectors: {passage_embeddings.shape[0]}")
print(f"   - Vector dimension: {passage_embeddings.shape[1]}")

# Encode query into vector
print(f"\n🎯 Encoding question into query vector...")
query_embedding = encode_questions_dpr([test_question])
print(f"   Query vector shape: {query_embedding.shape}")

# Perform vector search
print(f"\n🔍 Performing vector similarity search...")
vector_results = vector_search_dpr(query_embedding, passage_embeddings, example_passages, example_metadata, top_k=10)

print(f"\n📊 Top-10 Vector Search Results:")
retrieved_titles_dpr = []
for i, result in enumerate(vector_results):
    title = result['title']
    is_gold = "✓ GOLD" if title in gold_titles else ""
    print(f"   {i+1}. Score: {result['score']:.3f} | {title} {is_gold}")
    print(f"      {result['passage'][:100]}...")
    
    if title not in retrieved_titles_dpr:
        retrieved_titles_dpr.append(title)

# Calculate document recall for DPR
doc_recall_dpr = len(set(retrieved_titles_dpr[:10]).intersection(gold_titles)) / len(gold_titles)
print(f"\n📈 Document Recall@10: {doc_recall_dpr:.3f} ({len(set(retrieved_titles_dpr[:10]).intersection(gold_titles))}/{len(gold_titles)} gold docs retrieved)")

print("\n✅ DPR Vector Search demonstration complete!")
print("🔍 This demonstrates the core technology behind:")
print("   - OpenAI Embeddings + Vector DBs")
print("   - LlamaIndex vector stores") 
print("   - LangChain vector retrievers")
print("   - Pinecone, Weaviate, Chroma databases")

print("\n💡 Key Difference from Production:")
print("   - Production: One large vector DB with millions of documents")
print("   - HotpotQA: Per-question retrieval from 10 candidate paragraphs")
print("   - This setup tests multihop reasoning with controlled distractors")

## 🔬 SECTION 5: Why RAG Baseline Struggles - Multihop Analysis

Now that we've implemented the RAG paper baseline (DPR-NQ), let's analyze **why** it struggles on HotpotQA multihop questions.

### 🎯 Key Question: Why Does Strong Performance on NQ Become Poor on HotpotQA?

**Domain Mismatch in Action:**

The DPR model was trained on Natural Questions, where:
- Questions are **single-hop**: "Who won the 2020 NBA championship?"
- Answer requires **ONE document**: Lakers team page
- Encoder learns: "Find the most relevant single document"

HotpotQA requires **multihop reasoning**:
- Questions span **two+ documents**: "What position did the 2020 NBA championship MVP play?"
- Step 1: Find 2020 NBA championship → Lakers  
- Step 2: Find Lakers MVP → LeBron James
- Step 3: Find LeBron's position → Small Forward
- Encoder needs: "Find MULTIPLE related but different documents"

### 📊 What We'll Analyze

1. **Failure Case Examples**: Show 3-5 questions where DPR-NQ retrieves wrong documents
2. **Error Categorization**: Classify types of retrieval failures
3. **Retrieval Overlap**: Visualize retrieved docs vs gold docs
4. **Patterns**: Identify which question types fail most

This analysis will motivate our move to modern embedders in Section 6!

In [None]:
# Error Analysis Functions for DPR Baseline on Multihop Tasks

def analyze_retrieval_failure(example, retrieved_titles, method_name="DPR-NQ"):
    """
    Analyze why retrieval failed for a multihop question
    
    Returns failure type and explanation
    """
    gold_titles = list(set([fact[0] for fact in example['supporting_facts']]))
    retrieved_set = set(retrieved_titles[:10])  # Top-10
    gold_set = set(gold_titles)
    
    overlap = len(retrieved_set.intersection(gold_set))
    
    # Categorize failure type
    if overlap == len(gold_set):
        failure_type = "SUCCESS"
        explanation = f"Retrieved all {len(gold_set)} required documents"
    elif overlap == 1 and len(gold_set) == 2:
        failure_type = "PARTIAL_RETRIEVAL"
        missing = gold_set - retrieved_set
        explanation = f"Found 1/2 gold docs. Missing: {list(missing)[0]}"
    elif overlap == 0:
        failure_type = "COMPLETE_MISS"
        explanation = f"Retrieved 0/{len(gold_set)} gold documents. Retrieved distractors instead."
    else:
        failure_type = "PARTIAL_RETRIEVAL"
        explanation = f"Retrieved {overlap}/{len(gold_set)} gold documents"
    
    return {
        'failure_type': failure_type,
        'overlap': overlap,
        'total_required': len(gold_set),
        'explanation': explanation,
        'gold_titles': gold_titles,
        'retrieved_titles': retrieved_titles[:10],
        'question': example['question'],
        'answer': example['answer'],
        'question_type': example['type'],
        'difficulty': example['level']
    }

def categorize_errors(analysis_results):
    """Categorize all errors by type"""
    from collections import Counter
    
    failure_types = [r['failure_type'] for r in analysis_results]
    counts = Counter(failure_types)
    
    return counts

def find_failure_examples(analysis_results, failure_type, n=5):
    """Find n examples of a specific failure type"""
    examples = [r for r in analysis_results if r['failure_type'] == failure_type]
    return examples[:n]

print("✅ Error analysis functions ready!")
print("📊 Functions available:")
print("   - analyze_retrieval_failure(): Categorize individual failures")
print("   - categorize_errors(): Count failure types")
print("   - find_failure_examples(): Get examples of specific failures")

In [None]:
# Run Error Analysis on DPR Baseline
print("🔬 Running Failure Analysis on DPR Baseline")
print("="*60)

# Analyze a subset for demonstration
analysis_subset_size = 50
analysis_subset = val_sample.select(range(min(analysis_subset_size, len(val_sample))))

dpr_failure_analysis = []

print(f"📊 Analyzing {len(analysis_subset)} validation examples...")
print()

for i, example in enumerate(tqdm(analysis_subset, desc="Analyzing DPR failures")):
    question = example['question']
    
    # Extract passages and encode with DPR
    example_passages, example_metadata = extract_passages_from_example(example)
    passage_embeddings = encode_passages_dpr(example_passages, batch_size=16)
    
    # DPR vector search
    query_embedding = encode_questions_dpr([question])
    vector_results = vector_search_dpr(query_embedding, passage_embeddings, 
                                       example_passages, example_metadata, top_k=10)
    
    # Extract retrieved titles
    retrieved_titles = []
    for result in vector_results:
        title = result['title']
        if title not in retrieved_titles:
            retrieved_titles.append(title)
    
    # Analyze this retrieval
    analysis = analyze_retrieval_failure(example, retrieved_titles, method_name="DPR-NQ")
    dpr_failure_analysis.append(analysis)

# Categorize all errors
error_counts = categorize_errors(dpr_failure_analysis)

print("\n📈 DPR-NQ Retrieval Performance Analysis")
print("="*60)
print(f"\n🎯 Overall Statistics:")
print(f"   Total examples analyzed: {len(dpr_failure_analysis)}")
print(f"\n📊 Failure Type Distribution:")
for failure_type, count in error_counts.most_common():
    percentage = (count / len(dpr_failure_analysis)) * 100
    print(f"   {failure_type}: {count} ({percentage:.1f}%)")

# Calculate average document recall
avg_recall = np.mean([a['overlap'] / a['total_required'] for a in dpr_failure_analysis])
print(f"\n📏 Average Document Recall@10: {avg_recall:.3f}")

print("\n🔍 Example Failures by Type:")
print("="*60)

# Show examples of each failure type
for failure_type in ['COMPLETE_MISS', 'PARTIAL_RETRIEVAL', 'SUCCESS']:
    examples = find_failure_examples(dpr_failure_analysis, failure_type, n=2)
    
    if examples:
        print(f"\n🔸 {failure_type} Examples:")
        for j, ex in enumerate(examples[:2], 1):
            print(f"\n   Example {j}:")
            print(f"   Question: {ex['question'][:100]}...")
            print(f"   Answer: {ex['answer']}")
            print(f"   Type: {ex['question_type']} | Difficulty: {ex['difficulty']}")
            print(f"   Gold docs: {ex['gold_titles']}")
            print(f"   Retrieved: {ex['retrieved_titles'][:3]}...")
            print(f"   {ex['explanation']}")

print("\n✅ Failure analysis complete!")

## 🚀 SECTION 6: Modern Embedder Comparison (2020 → 2023)

We've seen that DPR-NQ (2020) was trained on single-hop Natural Questions. Let's compare it with a modern embedder to understand the evolution.

### 📈 From 2020 to 2023

**2020: Dense Passage Retrieval (DPR)**
- First successful dense retrieval for open-domain QA
- Task-specific training (Natural Questions)
- Limitation: Domain-specific, requires fine-tuning for new tasks

**2023: BGE (BAAI)**
- Multi-task training across diverse datasets
- State-of-the-art on MTEB benchmark
- General-purpose, no task-specific fine-tuning needed

### 🎯 Simple Comparison

We'll test 2 embedders on the SAME HotpotQA validation set:

| Model | Year | Training | Design Focus |
|-------|------|----------|--------------|
| **DPR-NQ** | 2020 | Natural Questions (single-hop) | Task-specific QA |
| **BGE-large** | 2023 | Multi-task (diverse datasets) | General-purpose SOTA |

**Questions to answer through experiments:**
- How does performance differ between 2020 and 2023 models?
- Does multi-task training help with multihop reasoning?
- What's the latency trade-off?

### 🔧 Implementation Strategy

We'll create a **unified embedder interface** that works with both:
- DPR (separate question/context encoders)
- SentenceTransformers (unified encoder)

Both will use the same `vector_search_dpr()` function for fair comparison.

### 📊 Metrics We'll Measure

For each embedder:
1. **Document Recall@10**: Can it find both required documents?
2. **Latency**: Encoding time
3. **Improvement**: BGE vs DPR-NQ baseline

Let's implement and run experiments!

In [None]:
# Unified Embedder Interface for Fair Comparison

from sentence_transformers import SentenceTransformer
import time

class UnifiedEmbedder:
    """
    Unified interface for different embedding models
    
    Supports:
    - DPR (separate question/context encoders)
    - SentenceTransformers (unified models like BGE)
    """
    
    def __init__(self, model_name, model_type='sentence-transformer'):
        """
        Initialize embedder
        
        Args:
            model_name: Model identifier
            model_type: 'dpr' or 'sentence-transformer'
        """
        self.model_name = model_name
        self.model_type = model_type
        
        if model_type == 'dpr':
            # Use existing DPR encoders (already loaded)
            self.q_encoder = q_encoder
            self.c_encoder = c_encoder
            self.q_tokenizer = q_tokenizer
            self.c_tokenizer = c_tokenizer
            print(f"✅ Using existing DPR encoders for {model_name}")
            
        elif model_type == 'sentence-transformer':
            # Load SentenceTransformer model
            print(f"🔄 Loading {model_name}...")
            self.model = SentenceTransformer(model_name)
            if torch.cuda.is_available():
                self.model = self.model.to(device)
            print(f"✅ Loaded {model_name} on {device}")
        
        else:
            raise ValueError(f"Unknown model_type: {model_type}")
    
    def encode_queries(self, queries, batch_size=32, show_progress=False):
        """Encode queries into vectors"""
        if self.model_type == 'dpr':
            return encode_questions_dpr(queries, batch_size)
        else:
            # SentenceTransformer
            embeddings = self.model.encode(
                queries,
                batch_size=batch_size,
                show_progress_bar=show_progress,
                convert_to_tensor=True,
                device=device
            )
            return embeddings
    
    def encode_passages(self, passages, batch_size=32, show_progress=True):
        """Encode passages into vectors"""
        if self.model_type == 'dpr':
            return encode_passages_dpr(passages, batch_size)
        else:
            # SentenceTransformer
            embeddings = self.model.encode(
                passages,
                batch_size=batch_size,
                show_progress_bar=show_progress,
                convert_to_tensor=True,
                device=device
            )
            return embeddings
    
    def search(self, query_embedding, passage_embeddings, passages, metadata, top_k=10):
        """Perform vector similarity search using the detailed vector_search_dpr function"""
        # Use the existing vector_search_dpr function (works for any embeddings)
        return vector_search_dpr(query_embedding, passage_embeddings, passages, metadata, top_k)

# Initialize 2 embedders for comparison
print("🚀 Initializing Embedders for Comparison")
print("="*60)

embedders = {}

# 1. DPR-NQ (RAG 2020 baseline) - already loaded
embedders['DPR-NQ (2020)'] = UnifiedEmbedder('facebook/dpr-nq', model_type='dpr')

# 2. BGE-large (2023 SOTA)
embedders['BGE-large (2023)'] = UnifiedEmbedder('BAAI/bge-large-en-v1.5', model_type='sentence-transformer')

print("\n✅ Both embedders loaded and ready!")
print("\n📊 Loaded Models:")
for name in embedders.keys():
    print(f"   - {name}")
print("\n🔍 Both use the same vector_search_dpr() function for fair comparison")

In [None]:
# Compare 2 Embedders on Same Validation Set
print("🏁 EMBEDDER COMPARISON: 2020 vs 2023")
print("="*70)

# Use a subset for comparison
comparison_size = 30
comparison_subset = val_sample.select(range(min(comparison_size, len(val_sample))))

print(f"📊 Evaluating {len(comparison_subset)} validation examples on 2 embedders")
print()

# Store results for each embedder
embedder_results = {name: [] for name in embedders.keys()}

# Evaluate each embedder
for embedder_name, embedder in embedders.items():
    print(f"\n{'='*70}")
    print(f"🔄 Testing: {embedder_name}")
    print(f"{'='*70}")
    
    start_total = time.time()
    
    for i, example in enumerate(tqdm(comparison_subset, desc=f"Evaluating {embedder_name}")):
        question = example['question']
        gold_supporting_facts = list(example['supporting_facts'])
        gold_titles = list(set([fact[0] for fact in gold_supporting_facts]))
        
        # Extract passages from this example
        example_passages, example_metadata = extract_passages_from_example(example)
        
        # Encode passages
        start_encode = time.time()
        passage_embeddings = embedder.encode_passages(example_passages, batch_size=16, show_progress=False)
        
        # Encode query
        query_embedding = embedder.encode_queries([question], show_progress=False)
        encode_time = time.time() - start_encode
        
        # Search using vector_search_dpr function
        start_search = time.time()
        if embedder.model_type == 'dpr':
            # DPR: query_embedding already 2D from encode_questions_dpr
            search_results = embedder.search(query_embedding, passage_embeddings, 
                                            example_passages, example_metadata, top_k=10)
        else:
            # SentenceTransformer: reshape to 2D
            query_embedding_2d = query_embedding.unsqueeze(0) if query_embedding.dim() == 1 else query_embedding
            search_results = embedder.search(query_embedding_2d, passage_embeddings,
                                            example_passages, example_metadata, top_k=10)
        search_time = time.time() - start_search
        
        # Extract retrieved titles
        retrieved_titles = []
        for result in search_results:
            title = result['title']
            if title not in retrieved_titles:
                retrieved_titles.append(title)
        
        # Calculate Document Recall@10
        retrieved_set = set(retrieved_titles[:10])
        gold_set = set(gold_titles)
        doc_recall = len(retrieved_set.intersection(gold_set)) / len(gold_set)
        
        # Store results
        embedder_results[embedder_name].append({
            'doc_recall@10': doc_recall,
            'encode_time': encode_time,
            'search_time': search_time,
            'total_time': encode_time + search_time,
            'retrieved_count': len(retrieved_set.intersection(gold_set)),
            'gold_count': len(gold_set)
        })
    
    total_time = time.time() - start_total
    avg_doc_recall = np.mean([r['doc_recall@10'] for r in embedder_results[embedder_name]])
    avg_latency = np.mean([r['total_time'] for r in embedder_results[embedder_name]])
    
    print(f"\n✅ {embedder_name} Results:")
    print(f"   Document Recall@10: {avg_doc_recall:.3f}")
    print(f"   Avg Latency: {avg_latency:.3f}s per question")
    print(f"   Total Time: {total_time:.1f}s")

# ========== COMPARISON SUMMARY ==========
print(f"\n\n{'='*70}")
print(f"📊 FINAL COMPARISON SUMMARY")
print(f"{'='*70}\n")

# Create comparison table
comparison_data = []
baseline_recall = np.mean([r['doc_recall@10'] for r in embedder_results['DPR-NQ (2020)']])

for embedder_name in embedders.keys():
    results = embedder_results[embedder_name]
    avg_recall = np.mean([r['doc_recall@10'] for r in results])
    avg_latency = np.mean([r['total_time'] for r in results])
    improvement = ((avg_recall - baseline_recall) / baseline_recall * 100) if baseline_recall > 0 else 0
    
    comparison_data.append({
        'Model': embedder_name,
        'Doc Recall@10': f"{avg_recall:.3f}",
        'Improvement': f"{improvement:+.1f}%",
        'Avg Latency (s)': f"{avg_latency:.3f}"
    })

# Print as table
import pandas as pd
df_comparison = pd.DataFrame(comparison_data)
print(df_comparison.to_string(index=False))

print(f"\n\n🎯 KEY OBSERVATIONS:")
print(f"="*50)

bge_recall = np.mean([r['doc_recall@10'] for r in embedder_results['BGE-large (2023)']])
improvement_pct = ((bge_recall - baseline_recall) / baseline_recall * 100) if baseline_recall > 0 else 0

print(f"\n1. **Performance Evolution:**")
print(f"   • DPR-NQ (2020): {baseline_recall:.3f} (baseline)")
print(f"   • BGE-large (2023): {bge_recall:.3f} ({improvement_pct:+.1f}% change)")

print(f"\n2. **Why the difference?**")
print(f"   • DPR-NQ: Trained only on single-hop Natural Questions")
print(f"   • BGE-large: Multi-task training across diverse datasets")
print(f"   • Task match matters for retrieval performance")

print(f"\n3. **Latency:**")
for embedder_name in embedders.keys():
    avg_latency = np.mean([r['total_time'] for r in embedder_results[embedder_name]])
    print(f"   • {embedder_name}: {avg_latency:.3f}s")

print(f"\n✅ Embedder comparison complete!")

## 🎯 SECTION 7: Cross-Encoder Reranking - Precision Boost

We've improved retrieval by upgrading our embedder (Section 6). Now let's add **reranking** - a second-stage refinement that can improve precision.

### 🔍 Bi-Encoder vs Cross-Encoder

**Bi-Encoder (What we've used so far):**
```
Question → Encoder A → Query vector [768]
Passage → Encoder B → Passage vector [768]
Similarity → Dot product of vectors
```
- ✅ **Fast**: Pre-compute passage vectors, quick dot product
- ✅ **Scalable**: Can index millions of passages
- ❌ **No interaction**: Question and passage never "see" each other

**Cross-Encoder (For reranking):**
```
[Question + Passage] → Joint Encoder → Relevance score [0-1]
```
- ✅ **Accurate**: Question and passage processed together (full attention)
- ✅ **Better relevance**: Direct relevance modeling
- ❌ **Slow**: Must process every question-passage pair separately
- ❌ **Not scalable**: Can't pre-compute, O(n) complexity

### 🎯 Two-Stage Retrieval Strategy

Combine both for best results:

**Stage 1: Bi-Encoder (Fast Retrieval)**
- Retrieve top-100 candidates from full corpus
- Uses BGE/E5 embeddings
- Fast, scalable

**Stage 2: Cross-Encoder (Precise Reranking)**
- Rerank top-100 → top-10
- Uses cross-encoder for accurate scoring
- Slow but only on 100 candidates (acceptable)

### 📊 What We'll Measure

We'll compare bi-encoder retrieval with and without reranking:
- Document Recall@10: Does reranking improve document selection?
- Rank changes: Which passages move up/down?
- Latency impact: Cost of reranking stage
- Precision@k: Quality of top-k results

### 🔧 Implementation

We'll use `BAAI/bge-reranker-large` - current SOTA reranker:
- Trained specifically for reranking
- Optimized for question-passage relevance
- Compatible with BGE embeddings (but works with any)

Experiments will show whether the latency cost is worth the accuracy gain.

In [None]:
# Load Cross-Encoder Reranker
print("🔄 Loading Cross-Encoder Reranker...")

from sentence_transformers import CrossEncoder

# Load BGE reranker (current SOTA)
reranker = CrossEncoder('BAAI/bge-reranker-large')

print("✅ Cross-encoder reranker loaded!")
print(f"   Model: BAAI/bge-reranker-large")
print(f"   Purpose: Rerank retrieved passages for better precision")

def rerank_passages(question, search_results, top_k=10, rerank_from_k=100):
    """
    Rerank passages using cross-encoder
    
    Args:
        question: Query string
        search_results: List of search results from bi-encoder
        top_k: Number of final results to return
        rerank_from_k: Number of candidates to rerank
    
    Returns:
        Reranked search results (top_k)
    """
    # Take top rerank_from_k candidates from bi-encoder
    candidates = search_results[:min(rerank_from_k, len(search_results))]
    
    # Prepare question-passage pairs for cross-encoder
    pairs = [[question, result['passage']] for result in candidates]
    
    # Get cross-encoder scores
    ce_scores = reranker.predict(pairs)
    
    # Combine original results with new scores
    for i, result in enumerate(candidates):
        result['rerank_score'] = float(ce_scores[i])
        result['original_score'] = result['score']  # Keep bi-encoder score
        result['original_rank'] = i + 1
    
    # Sort by rerank scores
    reranked = sorted(candidates, key=lambda x: x['rerank_score'], reverse=True)
    
    # Return top-k
    return reranked[:top_k]

print("\n✅ Reranking function ready!")
print("📊 Usage: rerank_passages(question, search_results, top_k=10)")

In [None]:
# Demonstrate Reranking Impact
print("🎯 RERANKING DEMONSTRATION")
print("="*70)

# Select a test example
demo_example = val_sample[0]
demo_question = demo_example['question']
demo_gold_titles = list(set([fact[0] for fact in demo_example['supporting_facts']]))

print(f"Question: {demo_question}")
print(f"Gold documents: {demo_gold_titles}")
print()

# Extract passages
demo_passages, demo_metadata = extract_passages_from_example(demo_example)

# Use best embedder (BGE-large)
best_embedder = embedders['BGE-large (2023)']

# Stage 1: Bi-encoder retrieval (top-20)
print("📍 STAGE 1: Bi-Encoder Retrieval (BGE-large)")
print("-"*70)

passage_embeddings = best_embedder.encode_passages(demo_passages, batch_size=16, show_progress=False)
query_embedding = best_embedder.encode_queries([demo_question], show_progress=False)

# Reshape if needed
if query_embedding.dim() == 1:
    query_embedding = query_embedding.unsqueeze(0)

initial_results = best_embedder.search(query_embedding, passage_embeddings, 
                                       demo_passages, demo_metadata, top_k=20)

print(f"\nTop-10 results from bi-encoder:")
bi_encoder_titles = []
for i, result in enumerate(initial_results[:10]):
    title = result['title']
    is_gold = "✅ GOLD" if title in demo_gold_titles else ""
    print(f"   {i+1}. Score: {result['score']:.3f} | {title} {is_gold}")
    if title not in bi_encoder_titles:
        bi_encoder_titles.append(title)

bi_encoder_recall = len(set(bi_encoder_titles[:10]).intersection(demo_gold_titles)) / len(demo_gold_titles)
print(f"\n📊 Bi-Encoder Document Recall@10: {bi_encoder_recall:.3f}")

# Stage 2: Cross-encoder reranking
print(f"\n📍 STAGE 2: Cross-Encoder Reranking")
print("-"*70)

reranked_results = rerank_passages(demo_question, initial_results, top_k=10, rerank_from_k=20)

print(f"\nTop-10 results after reranking:")
reranked_titles = []
rank_changes = []

for i, result in enumerate(reranked_results[:10]):
    title = result['title']
    is_gold = "✅ GOLD" if title in demo_gold_titles else ""
    rank_change = result['original_rank'] - (i + 1)  # Positive = moved up
    
    if rank_change > 0:
        change_indicator = f"↑{rank_change}"
    elif rank_change < 0:
        change_indicator = f"↓{abs(rank_change)}"
    else:
        change_indicator = "="
    
    print(f"   {i+1}. Rerank: {result['rerank_score']:.3f} | Original: {result['original_score']:.3f} | {title} {is_gold} ({change_indicator})")
    
    if title not in reranked_titles:
        reranked_titles.append(title)
    
    rank_changes.append(rank_change)

reranked_recall = len(set(reranked_titles[:10]).intersection(demo_gold_titles)) / len(demo_gold_titles)
print(f"\n📊 After Reranking Document Recall@10: {reranked_recall:.3f}")

# Show improvement
improvement = reranked_recall - bi_encoder_recall
print(f"\n{'='*70}")
print(f"📈 RERANKING IMPACT:")
print(f"   Before reranking: {bi_encoder_recall:.3f}")
print(f"   After reranking:  {reranked_recall:.3f}")
print(f"   Improvement:      {improvement:+.3f} ({(improvement/bi_encoder_recall*100):+.1f}%)" if bi_encoder_recall > 0 else "   Improvement:      N/A")

print(f"\n🔍 Observations:")
print(f"   • Cross-encoder reordered {sum(1 for c in rank_changes if c != 0)} passages")
print(f"   • Gold documents moved up: {sum(1 for i, r in enumerate(reranked_results[:10]) if r['title'] in demo_gold_titles and rank_changes[i] > 0)}")
print(f"   • Reranking provides more accurate relevance scoring")

print(f"\n✅ Reranking demonstration complete!")

## ⚡ SECTION 8: Hybrid Search - Best of Both Worlds

We've seen two retrieval paradigms:
- **BM25 (Sparse)**: Keyword matching, exact term overlap
- **Dense (Embeddings)**: Semantic similarity, meaning-based

What if we **combine both**? This is hybrid search - leveraging complementary strengths!

### 🎯 Why Hybrid Search?

**BM25 Strengths:**
- ✅ Exact keyword matching (great for entities, names, dates)
- ✅ Fast, interpretable
- ✅ Works well on entity-heavy questions
- ❌ Misses semantic similarity

**Dense Retrieval Strengths:**
- ✅ Semantic understanding (handles paraphrases, synonyms)
- ✅ Captures meaning beyond exact words
- ✅ Better for conceptual questions
- ❌ Can miss exact entity matches

**Hybrid = Combine Both!**
- Use BM25 for entity recall
- Use dense for semantic relevance
- Weighted fusion of scores
- Potentially best of both worlds!

### 📊 Hybrid Search Formula

For each passage, compute a combined score:

```
final_score = α × normalized_bm25_score + (1-α) × normalized_dense_score
```

Where:
- α = weight for BM25 (typically 0.2-0.4)
- (1-α) = weight for dense (typically 0.6-0.8)
- Normalization ensures scores are comparable

### 🔧 Implementation Strategy

1. **Retrieve with both methods** separately
   - BM25: Get top-100 with BM25 scores
   - Dense: Get top-100 with cosine similarity scores

2. **Normalize scores** to [0, 1] range
   - Min-max normalization
   - Makes scores comparable

3. **Fuse scores** with weighted combination
   - Experiment with different α weights
   - Common: α=0.3 (30% BM25, 70% dense)

4. **Rerank** with cross-encoder (optional but recommended)
   - Apply cross-encoder to top-100 hybrid results
   - Final top-10 selection

### 📊 What We'll Measure

We'll compare three approaches on the same questions:
1. **BM25 only**: Pure sparse retrieval
2. **Dense only**: Pure semantic retrieval  
3. **Hybrid**: Weighted combination

Metrics:
- Document Recall@10 for each approach
- Which question types benefit from hybrid?
- Score distribution analysis
- Best weight combination (α tuning)

The experiments will show whether combining methods provides robustness across different question types.

Let's implement and measure!

In [None]:
# Hybrid Search Implementation

def normalize_scores(scores):
    """Normalize scores to [0, 1] range using min-max normalization"""
    if len(scores) == 0:
        return scores
    
    scores_array = np.array(scores)
    min_score = scores_array.min()
    max_score = scores_array.max()
    
    if max_score == min_score:
        return np.ones_like(scores_array)
    
    normalized = (scores_array - min_score) / (max_score - min_score)
    return normalized

def hybrid_search(question, passages, metadata, embedder, 
                 bm25_weight=0.3, dense_weight=0.7, top_k=10):
    """
    Hybrid search combining BM25 (sparse) and dense retrieval
    
    Args:
        question: Query string
        passages: List of passage texts
        metadata: List of metadata dicts for passages
        embedder: UnifiedEmbedder instance
        bm25_weight: Weight for BM25 scores (default: 0.3)
        dense_weight: Weight for dense scores (default: 0.7)
        top_k: Number of results to return
    
    Returns:
        List of search results with hybrid scores
    """
    # Validate weights
    assert abs(bm25_weight + dense_weight - 1.0) < 1e-6, "Weights must sum to 1.0"
    
    # 1. BM25 Retrieval
    tokenized_passages = [preprocess_text_for_bm25(p) for p in passages]
    bm25 = BM25Okapi(tokenized_passages)
    query_tokens = preprocess_text_for_bm25(question)
    bm25_scores = bm25.get_scores(query_tokens)
    
    # 2. Dense Retrieval
    passage_embeddings = embedder.encode_passages(passages, batch_size=16, show_progress=False)
    query_embedding = embedder.encode_queries([question], show_progress=False)
    
    # Reshape if needed
    if query_embedding.dim() == 1:
        query_embedding = query_embedding.unsqueeze(0)
    
    dense_results = embedder.search(query_embedding, passage_embeddings, 
                                   passages, metadata, top_k=len(passages))
    
    # Extract dense scores
    dense_scores = np.array([r['score'] for r in dense_results])
    
    # 3. Normalize both sets of scores
    bm25_scores_norm = normalize_scores(bm25_scores)
    dense_scores_norm = normalize_scores(dense_scores)
    
    # 4. Create mapping from passage index to dense score
    dense_score_map = {}
    for i, result in enumerate(dense_results):
        idx = result['idx']
        dense_score_map[idx] = dense_scores_norm[i]
    
    # 5. Combine scores
    hybrid_results = []
    for idx in range(len(passages)):
        bm25_score_norm = bm25_scores_norm[idx]
        dense_score_norm = dense_score_map.get(idx, 0.0)
        
        # Weighted combination
        hybrid_score = bm25_weight * bm25_score_norm + dense_weight * dense_score_norm
        
        hybrid_results.append({
            'idx': idx,
            'passage': passages[idx],
            'metadata': metadata[idx],
            'title': metadata[idx]['title'],
            'hybrid_score': float(hybrid_score),
            'bm25_score': float(bm25_scores[idx]),
            'dense_score': float(dense_score_map.get(idx, 0.0)),
            'bm25_score_norm': float(bm25_score_norm),
            'dense_score_norm': float(dense_score_norm)
        })
    
    # 6. Sort by hybrid score
    hybrid_results_sorted = sorted(hybrid_results, key=lambda x: x['hybrid_score'], reverse=True)
    
    # 7. Return top-k
    return hybrid_results_sorted[:top_k]

print("✅ Hybrid search implementation ready!")
print("📊 Combines BM25 (sparse) + Dense (semantic)")
print(f"   Default weights: BM25={0.3}, Dense={0.7}")
print(f"   Usage: hybrid_search(question, passages, metadata, embedder)")

In [None]:
# Demonstrate Hybrid Search vs BM25 vs Dense
print("🔬 HYBRID SEARCH COMPARISON")
print("="*70)
print("Comparing: BM25-only vs Dense-only vs Hybrid (BM25 + Dense)")
print()

# Use same demo example
hybrid_demo_example = val_sample[1]  # Different example for variety
hybrid_question = hybrid_demo_example['question']
hybrid_gold_titles = list(set([fact[0] for fact in hybrid_demo_example['supporting_facts']]))

print(f"Question: {hybrid_question}")
print(f"Gold documents: {hybrid_gold_titles}")
print()

# Extract passages
hybrid_passages, hybrid_metadata = extract_passages_from_example(hybrid_demo_example)

# Use BGE embedder
bge_embedder = embedders['BGE-large (2023)']

# Method 1: BM25 Only
print("📍 METHOD 1: BM25 Only (Sparse)")
print("-"*70)

tokenized = [preprocess_text_for_bm25(p) for p in hybrid_passages]
bm25_model = BM25Okapi(tokenized)
query_tokens = preprocess_text_for_bm25(hybrid_question)
bm25_only_scores = bm25_model.get_scores(query_tokens)
bm25_top_indices = np.argsort(bm25_only_scores)[::-1][:10]

bm25_only_titles = []
for i, idx in enumerate(bm25_top_indices):
    title = hybrid_metadata[idx]['title']
    is_gold = "✅" if title in hybrid_gold_titles else ""
    print(f"   {i+1}. Score: {bm25_only_scores[idx]:.3f} | {title} {is_gold}")
    if title not in bm25_only_titles:
        bm25_only_titles.append(title)

bm25_only_recall = len(set(bm25_only_titles).intersection(hybrid_gold_titles)) / len(hybrid_gold_titles)
print(f"\n📊 BM25-only Doc Recall@10: {bm25_only_recall:.3f}")

# Method 2: Dense Only
print(f"\n📍 METHOD 2: Dense Only (BGE Embeddings)")
print("-"*70)

passage_embs = bge_embedder.encode_passages(hybrid_passages, batch_size=16, show_progress=False)
query_emb = bge_embedder.encode_queries([hybrid_question], show_progress=False)

if query_emb.dim() == 1:
    query_emb = query_emb.unsqueeze(0)

dense_only_results = bge_embedder.search(query_emb, passage_embs, 
                                         hybrid_passages, hybrid_metadata, top_k=10)

dense_only_titles = []
for i, result in enumerate(dense_only_results):
    title = result['title']
    is_gold = "✅" if title in hybrid_gold_titles else ""
    print(f"   {i+1}. Score: {result['score']:.3f} | {title} {is_gold}")
    if title not in dense_only_titles:
        dense_only_titles.append(title)

dense_only_recall = len(set(dense_only_titles).intersection(hybrid_gold_titles)) / len(hybrid_gold_titles)
print(f"\n📊 Dense-only Doc Recall@10: {dense_only_recall:.3f}")

# Method 3: Hybrid
print(f"\n📍 METHOD 3: Hybrid (30% BM25 + 70% Dense)")
print("-"*70)

hybrid_only_results = hybrid_search(hybrid_question, hybrid_passages, hybrid_metadata, 
                                   bge_embedder, bm25_weight=0.3, dense_weight=0.7, top_k=10)

hybrid_only_titles = []
for i, result in enumerate(hybrid_only_results):
    title = result['title']
    is_gold = "✅" if title in hybrid_gold_titles else ""
    print(f"   {i+1}. Hybrid: {result['hybrid_score']:.3f} | BM25: {result['bm25_score_norm']:.3f} | Dense: {result['dense_score_norm']:.3f} | {title} {is_gold}")
    if title not in hybrid_only_titles:
        hybrid_only_titles.append(title)

hybrid_only_recall = len(set(hybrid_only_titles).intersection(hybrid_gold_titles)) / len(hybrid_gold_titles)
print(f"\n📊 Hybrid Doc Recall@10: {hybrid_only_recall:.3f}")

# Summary Comparison
print(f"\n{'='*70}")
print(f"📈 COMPARISON SUMMARY")
print(f"{'='*70}")
print(f"\n   BM25-only:   {bm25_only_recall:.3f}")
print(f"   Dense-only:  {dense_only_recall:.3f}")
print(f"   Hybrid:      {hybrid_only_recall:.3f}")

best_method = max([
    ("BM25-only", bm25_only_recall),
    ("Dense-only", dense_only_recall),
    ("Hybrid", hybrid_only_recall)
], key=lambda x: x[1])

print(f"\n🏆 Best performer: {best_method[0]} ({best_method[1]:.3f})")

print(f"\n💡 Key Insights:")
if hybrid_only_recall >= max(bm25_only_recall, dense_only_recall):
    print(f"   • Hybrid search leverages strengths of both methods")
    print(f"   • BM25 helps with entity/keyword matching")
    print(f"   • Dense helps with semantic understanding")
print(f"   • The best approach often depends on question type")
print(f"   • Hybrid provides robustness across different question types")

print(f"\n✅ Hybrid search demonstration complete!")

## 🚀 LlamaIndex Comparison: Framework vs Custom Implementation

Let's compare our custom vector search implementation with LlamaIndex to show how modern frameworks abstract these concepts.

In [None]:
# LlamaIndex Implementation Example (for comparison)
print("🚀 LlamaIndex Vector Search Comparison")
print("=" * 50)

print("📚 Our Custom Implementation (learning-focused):")
print("   1. Manual passage extraction from HotpotQA")
print("   2. Explicit DPR model loading and tokenization")
print("   3. Custom vector encoding functions")
print("   4. Manual cosine similarity computation")
print("   5. Custom retrieval and ranking logic")
print("   6. Full control over similarity metrics")

print("\n🏗️ How LlamaIndex Would Abstract This:")
print("   1. SimpleDirectoryReader() - automatic document loading")
print("   2. VectorStoreIndex.from_documents() - auto-embedding")
print("   3. Built-in similarity search")
print("   4. Automatic retrieval and generation")
print("   5. Multiple vector database backends")
print("   6. Pre-configured pipelines")

print("\n💡 LlamaIndex Equivalent Code (pseudo-code):")
print("""
```python
# LlamaIndex - High-level abstraction
from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index.embeddings import HuggingFaceEmbedding

# Load documents (abstracts our manual extraction)
documents = SimpleDirectoryReader('data/').load_data()

# Create vector index (abstracts our manual DPR encoding)
embed_model = HuggingFaceEmbedding(model_name='facebook/dpr-ctx_encoder-single-nq-base')
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

# Query (abstracts our manual vector search)
query_engine = index.as_query_engine()
response = query_engine.query('What is the question?')
```
""")

print("🎯 Key Learning Points:")
print("   ✅ Our implementation: Understand the underlying math and operations")
print("   ✅ LlamaIndex: Production-ready with optimizations and abstractions")
print("   ✅ Both use the SAME vector search concepts:")
print("      - Document → Vector encoding")
print("      - Query → Vector encoding") 
print("      - Similarity search (cosine/inner product)")
print("      - Top-k retrieval")

print("\n🔍 Why Learn Custom Implementation First:")
print("   1. Understand vector search mathematics")
print("   2. Debug and optimize retrieval performance")
print("   3. Implement custom similarity metrics")
print("   4. Adapt to specific domain requirements")
print("   5. Build domain-specific evaluation metrics")

print("\n⚡ When to Use LlamaIndex:")
print("   1. Rapid prototyping and production deployment")
print("   2. Standard RAG pipelines without customization")
print("   3. Multiple vector database backend support")
print("   4. Built-in optimization and caching")
print("   5. Integration with LLM frameworks")

print("\n🏆 Best Practice: Learn fundamentals first, then use frameworks!")
print("📊 Our approach: Custom → Framework comparison → Production choice")

## 📊 Evaluation Framework

Let's implement our evaluation framework using the 6 chosen metrics for HotpotQA multihop reasoning.

In [None]:
# Import evaluation framework from @03_Evaluation_Code.ipynb
class HotpotQAEvaluator:
    """Comprehensive evaluator for HotpotQA multihop reasoning"""
    
    def __init__(self):
        pass
    
    def normalize_answer(self, text):
        """Normalize answer text for comparison"""
        import re
        import string
        
        # Convert to lowercase
        text = text.lower()
        
        # Remove articles
        text = re.sub(r'\b(a|an|the)\b', ' ', text)
        
        # Remove punctuation
        text = text.translate(str.maketrans('', '', string.punctuation))
        
        # Remove extra whitespace
        text = ' '.join(text.split())
        
        return text
    
    def answer_f1_score(self, prediction, ground_truth):
        """Calculate F1 score between prediction and ground truth"""
        from collections import Counter
        
        pred_tokens = self.normalize_answer(prediction).split()
        gold_tokens = self.normalize_answer(ground_truth).split()
        
        if len(pred_tokens) == 0 and len(gold_tokens) == 0:
            return 1.0
        if len(pred_tokens) == 0 or len(gold_tokens) == 0:
            return 0.0
        
        common_tokens = Counter(pred_tokens) & Counter(gold_tokens)
        num_same = sum(common_tokens.values())
        
        if num_same == 0:
            return 0.0
        
        precision = num_same / len(pred_tokens)
        recall = num_same / len(gold_tokens)
        
        return 2 * precision * recall / (precision + recall)
    
    def answer_exact_match(self, prediction, ground_truth):
        """Calculate exact match score"""
        return float(self.normalize_answer(prediction) == self.normalize_answer(ground_truth))
    
    def document_recall_at_k(self, retrieved_titles, gold_titles, k=10):
        """Calculate document recall@k"""
        if len(gold_titles) == 0:
            return 1.0
        
        retrieved_k = set(retrieved_titles[:k])
        gold_set = set(gold_titles)
        
        return len(retrieved_k.intersection(gold_set)) / len(gold_set)
    
    def supporting_fact_f1(self, predicted_facts, gold_facts):
        """Calculate supporting facts F1 score"""
        if len(gold_facts) == 0:
            return 1.0 if len(predicted_facts) == 0 else 0.0
        
        pred_set = set(predicted_facts)
        gold_set = set(gold_facts)
        
        if len(pred_set) == 0:
            return 0.0
        
        intersection = pred_set.intersection(gold_set)
        precision = len(intersection) / len(pred_set)
        recall = len(intersection) / len(gold_set)
        
        if precision + recall == 0:
            return 0.0
        
        return 2 * precision * recall / (precision + recall)
    
    def joint_exact_match(self, pred_answer, gold_answer, pred_facts, gold_facts):
        """Calculate joint exact match (answer + supporting facts)"""
        answer_em = self.answer_exact_match(pred_answer, gold_answer)
        facts_em = 1.0 if set(pred_facts) == set(gold_facts) else 0.0
        
        return float(answer_em == 1.0 and facts_em == 1.0)
    
    def evaluate_single(self, prediction_dict, gold_data, k=10, processing_time=None):
        """Evaluate a single prediction against gold data"""
        
        # Extract predictions
        pred_answer = prediction_dict.get('answer', '')
        pred_titles = prediction_dict.get('retrieved_titles', [])
        pred_facts = prediction_dict.get('supporting_facts', [])
        
        # Extract gold data
        gold_answer = gold_data.get('answer', '')
        gold_facts = []
        gold_titles = []
        
        # Extract gold supporting facts and titles
        if 'supporting_facts' in gold_data:
            facts_list = list(gold_data['supporting_facts'])
            for title, sent_id in facts_list:
                gold_facts.append((title, sent_id))
                if title not in gold_titles:
                    gold_titles.append(title)
        
        # Calculate metrics
        metrics = {
            'answer_f1': self.answer_f1_score(pred_answer, gold_answer),
            'answer_em': self.answer_exact_match(pred_answer, gold_answer),
            'document_recall@k': self.document_recall_at_k(pred_titles, gold_titles, k),
            'supporting_fact_f1': self.supporting_fact_f1(pred_facts, gold_facts),
            'joint_em': self.joint_exact_match(pred_answer, gold_answer, pred_facts, gold_facts),
            'latency': processing_time if processing_time is not None else 0.0
        }
        
        return metrics

# Initialize evaluator
evaluator = HotpotQAEvaluator()
print("✅ HotpotQA Evaluation Framework Ready!")
print("📊 Available metrics:")
print("   1. Answer F1 Score")
print("   2. Answer Exact Match")  
print("   3. Document Recall@k")
print("   4. Supporting-Fact F1")
print("   5. Joint Exact Match")
print("   6. Latency (processing time)")

# Test with sample data
sample_prediction = {
    'answer': 'test answer',
    'retrieved_titles': ['Title 1', 'Title 2', 'Title 3'],
    'supporting_facts': [('Title 1', 0), ('Title 2', 1)]
}

sample_gold = {
    'answer': 'test answer',
    'supporting_facts': [('Title 1', 0), ('Title 2', 1)]
}

test_metrics = evaluator.evaluate_single(sample_prediction, sample_gold, k=10, processing_time=0.5)
print(f"\n🧪 Test evaluation result: {test_metrics}")

## 🏃‍♂️ Run Complete Pipeline and Compare Methods

Now let's run both BM25 (sparse) and DPR (vector search) on a subset of validation data and compare their performance.

In [None]:
# Comprehensive evaluation on validation set with PER-QUESTION retrieval
print("🔬 COMPREHENSIVE EVALUATION: BM25 vs DPR Vector Search")
print("="* 60)
print("✅ Using CORRECT per-question retrieval approach for HotpotQA!")
print()

# Use a subset of validation data for evaluation
eval_size = 20  # Adjust based on compute resources
val_subset = val_sample.select(range(min(eval_size, len(val_sample))))

print(f"📊 Evaluating on {len(val_subset)} validation examples")

# Initialize results storage
results = {
    'BM25': [],
    'DPR_Vector_Search': []
}

# Evaluate each method
for i, example in enumerate(val_subset):
    question = example['question']
    gold_answer = example['answer']
    
    # Extract gold supporting docs and facts
    gold_supporting_facts = list(example['supporting_facts'])
    gold_titles = list(set([fact[0] for fact in gold_supporting_facts]))
    
    gold_data = {
        'answer': gold_answer,
        'supporting_facts': gold_supporting_facts
    }
    
    print(f"\n📝 Example {i+1}/{len(val_subset)}")
    print(f"   Question: {question[:80]}...")
    print(f"   Gold answer: {gold_answer}")
    print(f"   Gold documents: {gold_titles}")
    
    # Extract passages from THIS example's 10 context paragraphs
    example_passages, example_metadata = extract_passages_from_example(example)
    
    # ========== BM25 Evaluation ==========
    print(f"\n   🔍 BM25 Sparse Retrieval:")
    start_time = time.time()
    
    # Build BM25 index for this example
    tokenized_passages = [preprocess_text_for_bm25(p) for p in example_passages]
    bm25 = BM25Okapi(tokenized_passages)
    
    # BM25 search
    query_tokens = preprocess_text_for_bm25(question)
    bm25_scores = bm25.get_scores(query_tokens)
    top_k_indices = np.argsort(bm25_scores)[::-1][:10]
    
    # Extract retrieved titles and facts
    retrieved_titles = []
    retrieved_facts = []
    retrieved_passages_text = []
    
    for idx in top_k_indices:
        title = example_metadata[idx]['title']
        if title not in retrieved_titles:
            retrieved_titles.append(title)
        # Add sentence-level facts
        retrieved_facts.append((title, example_metadata[idx]['sentence_idx']))
        retrieved_passages_text.append(example_passages[idx])
    
    # Generate answer using simple extraction
    bm25_answer = simple_answer_extraction(question, retrieved_passages_text, top_k=3)
    
    bm25_time = time.time() - start_time
    
    bm25_prediction = {
        'answer': bm25_answer,
        'retrieved_titles': retrieved_titles,
        'supporting_facts': retrieved_facts[:5]  # Top 5 facts
    }
    
    bm25_metrics = evaluator.evaluate_single(bm25_prediction, gold_data, k=10, processing_time=bm25_time)
    results['BM25'].append(bm25_metrics)
    
    print(f"      Answer: {bm25_answer}")
    print(f"      Retrieved docs: {retrieved_titles[:3]}...")
    print(f"      Document Recall@10: {bm25_metrics['document_recall@k']:.3f}")
    print(f"      Latency: {bm25_time:.3f}s")
    
    # ========== DPR Vector Search Evaluation ==========
    print(f"\n   🧠 DPR Vector Search:")
    start_time = time.time()
    
    # Build DPR embeddings for this example
    passage_embeddings = encode_passages_dpr(example_passages, batch_size=16)
    
    # Vector search
    query_embedding = encode_questions_dpr([question])
    vector_results = vector_search_dpr(query_embedding, passage_embeddings, example_passages, example_metadata, top_k=10)
    
    # Extract retrieved titles and facts
    dpr_titles = []
    dpr_facts = []
    dpr_passages_text = []
    
    for result in vector_results:
        title = result['title']
        if title not in dpr_titles:
            dpr_titles.append(title)
        dpr_facts.append((title, result['metadata']['sentence_idx']))
        dpr_passages_text.append(result['passage'])
    
    # Generate answer using simple extraction
    dpr_answer = simple_answer_extraction(question, dpr_passages_text, top_k=3)
    
    dpr_time = time.time() - start_time
    
    dpr_prediction = {
        'answer': dpr_answer,
        'retrieved_titles': dpr_titles,
        'supporting_facts': dpr_facts[:5]  # Top 5 facts
    }
    
    dpr_metrics = evaluator.evaluate_single(dpr_prediction, gold_data, k=10, processing_time=dpr_time)
    results['DPR_Vector_Search'].append(dpr_metrics)
    
    print(f"      Answer: {dpr_answer}")
    print(f"      Retrieved docs: {dpr_titles[:3]}...")
    print(f"      Document Recall@10: {dpr_metrics['document_recall@k']:.3f}")
    print(f"      Latency: {dpr_time:.3f}s")

print(f"\n📊 FINAL RESULTS COMPARISON")
print("="*40)

# Calculate average metrics
for method, method_results in results.items():
    print(f"\n🔸 {method}:")
    
    if method_results:
        avg_metrics = {}
        for metric in method_results[0].keys():
            avg_metrics[metric] = np.mean([r[metric] for r in method_results])
        
        print(f"   Document Recall@10: {avg_metrics['document_recall@k']:.3f}")
        print(f"   Supporting-Fact F1: {avg_metrics['supporting_fact_f1']:.3f}")
        print(f"   Answer F1: {avg_metrics['answer_f1']:.3f}")
        print(f"   Answer EM: {avg_metrics['answer_em']:.3f}")
        print(f"   Joint EM: {avg_metrics['joint_em']:.3f}")
        print(f"   Avg Latency: {avg_metrics['latency']:.3f}s")

print(f"\n🎯 KEY INSIGHTS:")
print("="*30)
print("🔍 BM25 (Sparse Retrieval):")
print("   ✅ Fast and lightweight")
print("   ✅ Exact keyword matching")
print("   ✅ Works well with entity-heavy questions")
print("   ❌ Limited semantic understanding")

print("\n🧠 DPR (Vector Search):")
print("   ✅ Semantic similarity")
print("   ✅ Handles paraphrases")
print("   ✅ Dense representations capture meaning")
print("   ❌ Slower encoding time")
print("   ❌ Less interpretable")

print(f"\n✅ Per-Question Retrieval Benefits:")
print("   • Realistic evaluation matching HotpotQA setup")
print("   • Tests ability to filter signal from noise (distractors)")
print("   • Meaningful Document Recall@10 scores")
print("   • Demonstrates multihop reasoning challenges")

print(f"\n✅ Evaluation complete on {len(val_subset)} examples!")

## 📝 Summary and Key Takeaways

This notebook demonstrated how traditional methods implement vector search concepts AND the critical importance of matching your retrieval setup to the evaluation benchmark.

### 🎯 Critical Learning: Per-Question vs Global Retrieval

**The Bug We Fixed:**
```python
# ❌ WRONG: Build index from training, search from validation
train_passages, train_metadata = extract_passages_from_hotpotqa(train_sample)
bm25_global = BM25Okapi(train_passages)
# This gives 0% Document Recall because gold docs don't exist!

# ✅ CORRECT: Per-question retrieval from each example's contexts
for example in validation:
    example_passages, metadata = extract_passages_from_example(example)
    bm25_local = BM25Okapi(example_passages)
    results = bm25_local.search(example['question'])
    # Now Document Recall is meaningful!
```

**Why This Matters:**
- HotpotQA provides 10 context paragraphs per question (2 gold + 8 distractors)
- The challenge is filtering correct documents from this candidate set
- Global retrieval would test "can you find documents in a corpus?" (wrong task)
- Per-question retrieval tests "can you identify relevant docs from noisy candidates?" (correct task)

In [None]:
print("🎓 SUMMARY: Vector Search Learning Journey + Critical Implementation Lessons")
print("="*70)

print("✅ What We Learned:")
print("\n1. **BM25 = Sparse Vector Search**")
print("   - Represents documents as sparse term-frequency vectors")
print("   - Uses statistical weighting (IDF) for relevance")
print("   - Fast, interpretable, keyword-focused")

print("\n2. **DPR = Dense Vector Search**") 
print("   - Transforms text into 768-dimensional dense vectors")
print("   - Uses neural networks (BERT) for semantic encoding")
print("   - Cosine similarity for vector matching")
print("   - Same technology as OpenAI embeddings + Pinecone!")

print("\n3. **Vector Search Fundamentals**")
print("   - Document encoding: Text → Vector representations")
print("   - Query encoding: Questions → Query vectors")
print("   - Similarity search: Find closest vectors (MIPS/Cosine)")
print("   - Top-k retrieval: Return most similar documents")

print("\n🔧 **CRITICAL IMPLEMENTATION LESSON**")
print("="*70)
print("⚠️  Per-Question vs Global Retrieval - THIS DETERMINES SUCCESS!")
print()
print("❌ WRONG APPROACH (gives ~0% performance):")
print("   • Build ONE global index from training data")
print("   • Search this index for validation questions")
print("   • Problem: Gold docs for validation don't exist in training!")
print("   • Result: Document Recall = 0%, everything fails")
print()
print("✅ CORRECT APPROACH (gives realistic performance):")
print("   • Each question comes with 10 context paragraphs")
print("   • Build a SEPARATE index for each question's contexts")
print("   • Search within those 10 candidates (2 gold + 8 distractors)")
print("   • Result: Meaningful metrics, proper evaluation")
print()
print("💡 KEY INSIGHT:")
print("   HotpotQA tests: 'Can you filter signal from noise?'")
print("   NOT: 'Can you find documents in a large corpus?'")

print("\n🎯 **Implementation vs Framework Comparison**")
print("="*50)
print("   Our Custom Implementation:")
print("   ✅ Full control over encoding and similarity metrics")
print("   ✅ Understanding of underlying mathematics")
print("   ✅ Custom evaluation for multihop reasoning")
print("   ✅ Learned critical per-question retrieval pattern")

print("\n   LlamaIndex Framework:")
print("   ✅ Production-ready optimizations")
print("   ✅ Multiple vector database backends")
print("   ✅ Built-in caching and indexing")
print("   ⚠️  Must still configure retrieval correctly!")

print("\n🚀 **Vector Search in Modern RAG Systems**")
print("="*40)
print("   • Pinecone, Weaviate, Chroma: Production vector databases")
print("   • OpenAI Ada-002: Dense embeddings for general domain")
print("   • LlamaIndex/LangChain: High-level RAG frameworks")
print("   • Hybrid search: Combining sparse (BM25) + dense (embeddings)")
print("   • All require proper corpus setup for your use case!")

print("\n📊 **HotpotQA Multihop Evaluation**")
print("="*35)
print("   • 6 comprehensive metrics for multihop reasoning")
print("   • Document Recall@k: Did we retrieve the right docs?")
print("   • Supporting-Fact F1: Did we identify the right evidence?")
print("   • Joint EM: Perfect answer + reasoning path")

print("\n🎯 **Best Practices for Learning**")
print("="*30)
print("   1. ✅ Understand your dataset's evaluation setup FIRST")
print("   2. ✅ Match retrieval approach to benchmark requirements")
print("   3. ✅ Verify metrics are meaningful (not 0%!)")
print("   4. ✅ Start simple with clear baseline implementations")
print("   5. ✅ Use frameworks after understanding fundamentals")

print("\n💡 **Key Insight: Traditional Methods ARE Vector Search!**")
print("="*55)
print("   • BM25: Sparse vector similarity with TF-IDF weighting")
print("   • DPR: Dense vector similarity with neural encoding")
print("   • Both use core concept: vector similarity search")
print("   • Modern vector databases scale and optimize these concepts")
print("   • BUT: Wrong retrieval setup = Wrong results, regardless of method!")

print("\n🔮 **Next Steps**")
print("="*15)
print("   • Experiment with different embedding models")
print("   • Try hybrid sparse + dense retrieval")
print("   • Implement production vector database integration")
print("   • Add reranking and generation components")
print("   • Scale to full HotpotQA dataset")

print("\n✨ Vector search is the foundation of modern RAG - now you understand it!")
print("🎓 AND you know how to implement it correctly for your specific task!")

In [None]:
# Add missing import
import time