# Traditional to Modern RAG: Evolution of Retrieval Methods (2020-2024)

This comprehensive tutorial traces the evolution of Retrieval-Augmented Generation (RAG) from the original 2020 paper to modern state-of-the-art implementations, with hands-on implementation of each advancement.

## üéØ Learning Objectives

By the end of this tutorial, you will be able to:

1. **Understand RAG Fundamentals**
   - Implement the exact retrieval method from the original RAG paper (Lewis et al., 2020)
   - Understand Dense Passage Retrieval (DPR) and Maximum Inner Product Search (MIPS)
   - Recognize that DPR = vector search used in modern vector databases

2. **Diagnose Performance Issues**
   - Identify domain mismatch problems (single-hop vs multihop)
   - Analyze why strong models on one task struggle on another
   - Conduct error analysis and failure case studies

3. **Compare Modern Embedders**
   - Evaluate 4 embedding models: DPR (2020) ‚Üí Contriever (2022) ‚Üí E5/BGE (2023)
   - Understand model evolution and selection criteria
   - Make informed decisions about embedder choice for your task

4. **Build Production RAG Pipelines**
   - Implement cross-encoder reranking for improved precision
   - Combine sparse (BM25) and dense (embeddings) search for hybrid retrieval
   - Measure each component's contribution through experimental evaluation

5. **Evaluate and Optimize**
   - Apply 6 comprehensive metrics for multihop reasoning evaluation
   - Profile latency and optimize performance bottlenecks
   - Compare custom implementations with frameworks (LlamaIndex)

## üó∫Ô∏è Tutorial Roadmap

### **PART 1: Foundations (Sections 1-4)**
```
Setup ‚Üí Data Loading ‚Üí BM25 Baseline ‚Üí RAG Paper DPR Baseline
‚îî‚îÄ Build understanding of sparse and dense retrieval
```

### **PART 2: Analysis (Section 5)**
```
Why RAG Baseline Struggles on Multihop Tasks
‚îî‚îÄ Learn from failures, understand domain mismatch
```

### **PART 3: Modern Embedders (Section 6)**
```
Compare 4 Embedders: DPR ‚Üí Contriever ‚Üí E5 ‚Üí BGE
‚îî‚îÄ See 3 years of embedding model evolution (2020-2023)
```

### **PART 4: Pipeline Enhancements (Sections 7-8)**
```
Add Reranking ‚Üí Add Hybrid Search
‚îî‚îÄ Build components incrementally, measure each contribution
```

## üîç Key Insight: RAG Paper Uses Vector Search!

:::{admonition} Understanding the RAG Paper Baseline
:class: important

**The Original RAG Paper (Lewis et al., 2020) Uses:**
- **Dense Passage Retrieval (DPR)** - Bi-encoder architecture
- **Pre-trained on Natural Questions** - Single-hop factoid QA
- **No Fine-tuning** - Frozen retriever (this is our baseline!)
- **MIPS** - Maximum Inner Product Search for similarity
- **Top-k Retrieval** - Return most similar document vectors

**This IS vector search** - the same technology behind:
- OpenAI embeddings + Pinecone/Weaviate/Chroma
- LlamaIndex vector stores
- LangChain vector retrievers
- Modern production RAG systems

**Our Tutorial Journey:**
1. Implement RAG 2020 baseline (DPR-NQ)
2. Test on HotpotQA multihop ‚Üí Measure actual performance
3. Analyze failures ‚Üí Domain mismatch (trained on single-hop, testing on multihop)
4. Upgrade to modern embedders ‚Üí Measure improvements
5. Add enhancements (reranking, hybrid) ‚Üí Quantify gains
:::

## üìä What We'll Build & Measure

We'll implement and compare retrieval components:

| Component | Baseline | Modern | Measured Via |
|-----------|----------|--------|--------------|
| **Embedder** | DPR-NQ (2020) | BGE/E5 (2023) | Experiments will show |
| **Retrieval** | Dense only | Hybrid (BM25 + Dense) | Document Recall@10 |
| **Reranking** | None | Cross-encoder | Precision improvement |
| **Evaluation** | Per-question | Comprehensive | 6 HotpotQA metrics |

## üîß Learning Philosophy

**Learn by Implementing, Not Just Using**

While frameworks like LlamaIndex provide excellent abstractions, we implement from scratch to understand:
- ‚úÖ How vector embeddings work mathematically
- ‚úÖ Different similarity metrics (cosine vs inner product vs MIPS)
- ‚úÖ Trade-offs between sparse and dense representations
- ‚úÖ Why and when each component improves performance
- ‚úÖ How to debug and optimize RAG systems
- ‚úÖ When to use frameworks vs custom implementations

**Progressive Complexity**
- Start with RAG paper baseline (understand foundations)
- Analyze failures (learn from mistakes)
- Add one improvement at a time (measure each contribution)
- Build complete modern RAG (production-ready system)

**Comparative Learning**
- Side-by-side comparisons at every step
- Visualizations showing performance differences
- Understanding evolution from 2020 to 2024

Let's begin the journey from traditional to modern RAG! üöÄ

In [1]:
# Install required packages
!pip install transformers datasets torch sentence-transformers rank-bm25 numpy scikit-learn matplotlib seaborn
!pip install llama-index  # For comparison later

import torch
import torch.nn.functional as F  # Added for DPR normalization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datasets import load_dataset
from transformers import (
    DPRQuestionEncoder, DPRContextEncoder,
    DPRQuestionEncoderTokenizer, DPRContextEncoderTokenizer,
    AutoTokenizer, AutoModelForCausalLM
)
from sentence_transformers import CrossEncoder
from rank_bm25 import BM25Okapi
from sklearn.metrics.pairwise import cosine_similarity
import string
import re
from typing import List, Dict, Tuple, Optional
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

print(" All packages installed and imported successfully!")
print(f" PyTorch version: {torch.__version__}")
print(f" CUDA available: {torch.cuda.is_available()}")

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f" Using device: {device}")

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

Collecting rank-bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank-bm25
Successfully installed rank-bm25-0.2.2
Collecting llama-index
  Downloading llama_index-0.14.6-py3-none-any.whl.metadata (13 kB)
Collecting llama-index-cli<0.6,>=0.5.0 (from llama-index)
  Downloading llama_index_cli-0.5.3-py3-none-any.whl.metadata (1.4 kB)
Collecting llama-index-core<0.15.0,>=0.14.6 (from llama-index)
  Downloading llama_index_core-0.14.6-py3-none-any.whl.metadata (2.5 kB)
Collecting llama-index-embeddings-openai<0.6,>=0.5.0 (from llama-index)
  Downloading llama_index_embeddings_openai-0.5.1-py3-none-any.whl.metadata (400 bytes)
Collecting llama-index-indices-managed-llama-cloud>=0.4.0 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.9.4-py3-none-any.whl.metadata (3.7 kB)
Collecting llama-index-llms-openai<0.7,>=0.6.0 (from llama-index)
  Downloading llama_index_l

 All packages installed and imported successfully!
 PyTorch version: 2.8.0+cu126
 CUDA available: True
 Using device: cuda


## üìä Data Loading and Preprocessing

Let's load the HotpotQA dataset and prepare it for our vector search implementation.

In [2]:
# Load HotpotQA dataset
print(" Loading HotpotQA dataset...")
dataset = load_dataset('hotpotqa/hotpot_qa', 'distractor')
train_data = dataset['train']
validation_data = dataset['validation']

print(f" Dataset loaded successfully!")
print(f"   Training examples: {len(train_data):,}")
print(f"   Validation examples: {len(validation_data):,}")

# Take a smaller subset for faster processing during development
SAMPLE_SIZE = 100  # Increase this for full evaluation
train_sample = train_data.shuffle(seed=42).select(range(min(SAMPLE_SIZE, len(train_data))))
val_sample = validation_data.shuffle(seed=42).select(range(min(SAMPLE_SIZE, len(validation_data))))

print(f" Working with sample: {len(train_sample)} train, {len(val_sample)} validation")

# Inspect a sample to understand the structure
sample_example = train_sample[0]
print("\n Sample HotpotQA Example Structure:")
print(f"   Question: {sample_example['question']}")
print(f"   Answer: {sample_example['answer']}")
print(f"   Type: {sample_example['type']}")
print(f"   Level: {sample_example['level']}")
print(f"   Number of context paragraphs: {len(list(sample_example['context']))}")
print(f"   Supporting facts: {len(list(sample_example['supporting_facts']))}")

print("\n First few context titles:")
context_titles = sample_example['context']['title']
context_sentences = sample_example['context']['sentences']
for i, (title, sentences) in enumerate(zip(context_titles[:3], context_sentences[:3])):
    print(f"   {i+1}. {title} ({len(sentences)} sentences)")

print("\n Supporting facts:")
for title, sent_idx in zip(sample_example['supporting_facts']['title'], sample_example['supporting_facts']['sent_id']):
    print(f"   - {title}, sentence {sent_idx}")

 Loading HotpotQA dataset...


README.md: 0.00B [00:00, ?B/s]

distractor/train-00000-of-00002.parquet:   0%|          | 0.00/166M [00:00<?, ?B/s]

distractor/train-00001-of-00002.parquet:   0%|          | 0.00/166M [00:00<?, ?B/s]

distractor/validation-00000-of-00001.par(‚Ä¶):   0%|          | 0.00/27.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/90447 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/7405 [00:00<?, ? examples/s]

 Dataset loaded successfully!
   Training examples: 90,447
   Validation examples: 7,405
 Working with sample: 100 train, 100 validation

 Sample HotpotQA Example Structure:
   Question: Which airport is located in Maine, Sacramento International Airport or Knox County Regional Airport?
   Answer: Knox County Regional Airport
   Type: comparison
   Level: medium
   Number of context paragraphs: 2
   Supporting facts: 2

 First few context titles:
   1. Vinalhaven, Maine (5 sentences)
   2. Owls Head, Maine (4 sentences)
   3. North Haven, Maine (4 sentences)

 Supporting facts:
   - Sacramento International Airport, sentence 0
   - Knox County Regional Airport, sentence 0


In [3]:
# Preprocessing functions for BM25 and DPR
def preprocess_text_for_bm25(text):
    """Preprocess text for BM25 sparse retrieval"""
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    return text.split()

def extract_passages_from_example(example):
    """
    Extract individual passages from a SINGLE HotpotQA example

    KEY INSIGHT FOR HOTPOTQA:
    - Each example has 10 context paragraphs (2 gold + 8 distractors)
    - Retrieval happens WITHIN these 10 paragraphs per question
    - This simulates real-world multihop QA where we filter from a candidate set
    """
    passages = []
    passage_metadata = []
    supporting_passages = []
    # CORRECT: Access dict keys directly
    context_titles = example['context']['title']
    context_sentences = example['context']['sentences']
    supporting_facts = example['supporting_facts']
    supporting_index = list(zip(supporting_facts['title'], supporting_facts['sent_id']))
    for context_idx, (title, sentences) in enumerate(zip(context_titles, context_sentences)):
        # Each sentence becomes a separate passage
        for sent_idx, sentence in enumerate(sentences):
            passage_text = sentence.strip()
            if passage_text:  # Only add non-empty passages
                passages.append(passage_text)
                passage_metadata.append({
                    'title': title,
                    'context_idx': context_idx,
                    'sentence_idx': sent_idx,
                    'full_passage': ' '.join(sentences)  # Full paragraph for context
                })
                if (title,sent_idx) in supporting_index:
                    supporting_passages.append(passage_text)

    return passages, passage_metadata, supporting_passages

def simple_answer_extraction(question, retrieved_passages, top_k=3):
    """
    Simple answer extraction strategy for tutorial purposes

    Strategy:
    1. Combine top-k retrieved passages
    2. Find the shortest span that appears in passages and overlaps with question entities
    3. For tutorial: just return key entities from top passage
    """
    if not retrieved_passages:
        return "Unable to answer"

    # Simple strategy: extract key entities from top passages
    # Combine top-k passages
    combined_text = ' '.join(retrieved_passages[:top_k])

    # Remove question words to find novel information
    question_words = set(preprocess_text_for_bm25(question))
    passage_words = preprocess_text_for_bm25(combined_text)

    # Find candidate answer phrases (simple heuristic: capitalized words or numbers)
    import re
    # Look for capitalized phrases (proper nouns) or numbers or years
    candidates = re.findall(r'\b[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*\b|\b\d+\b', combined_text)

    if candidates:
        # Return first candidate that's not in question
        for candidate in candidates:
            if candidate.lower() not in [q.lower() for q in question.split()]:
                return candidate

    # Fallback: return first few words of top passage
    words = combined_text.split()
    return ' '.join(words[:min(5, len(words))])

print(" Preprocessing functions ready!")
print(" Functions available:")
print("   - preprocess_text_for_bm25(): Clean text for sparse retrieval")
print("   - extract_passages_from_example(): Extract passages from ONE example")
print("   - simple_answer_extraction(): Generate answer from retrieved passages")

 Preprocessing functions ready!
 Functions available:
   - preprocess_text_for_bm25(): Clean text for sparse retrieval
   - extract_passages_from_example(): Extract passages from ONE example
   - simple_answer_extraction(): Generate answer from retrieved passages


## üîç BM25 Sparse Retrieval Implementation

BM25 is a probabilistic ranking function that scores documents based on query term frequency and document length. It's a traditional sparse method that forms the foundation of modern search engines.

### üéØ Critical HotpotQA Setup Understanding

**Per-Question Retrieval (Correct Approach):**
- Each HotpotQA example provides 10 context paragraphs (2 gold + 8 distractors)
- The task: retrieve the correct 2 documents from these 10 candidates
- We build a separate BM25 index for each question's context
- This tests the model's ability to filter signal from noise

**Why NOT Global Corpus Retrieval:**
- ‚ùå Building one index from training data and searching from validation won't work
- ‚ùå Gold documents for validation questions aren't in the training corpus
- ‚ùå Document Recall@10 would be 0% because documents don't exist!
- ‚úÖ Per-question retrieval matches the actual HotpotQA benchmark setup

In [4]:
# Demonstrate BM25 with PER-QUESTION retrieval (correct approach for HotpotQA)
print(" BM25 Per-Question Retrieval Demonstration")
print("="*60)

# Get a test example from validation set
test_example = val_sample[0]
test_question = test_example['question']

print(f" Test Question: {test_question}")
print(f" Gold Answer: {test_example['answer']}")

# Extract passages from THIS example's context (10 paragraphs)
example_passages, example_metadata, supporting_passages= extract_passages_from_example(test_example)
print(f"\n Extracted {len(example_passages)} passages from {len(test_example['context'])} context paragraphs")
print(f'\n Supporting passages (text): ', supporting_passages) # Changed message for clarity

# Create a set of gold supporting fact tuples (title, sent_id)
gold_supporting_facts_tuples = set(zip(test_example['supporting_facts']['title'], test_example['supporting_facts']['sent_id']))
gold_titles = set(test_example['supporting_facts']['title']) # Keep gold_titles for standard document recall check later if needed

# Show context paragraph titles
context_titles = test_example['context']['title']  # Direct dict access
print(f"\n Available context paragraphs (2 gold + 8 distractors):")
for i, title in enumerate(context_titles):
    # Check if this is a gold supporting document title
    marker = "üü¢ GOLD" if title in gold_titles else " DISTRACTOR"
    print(f"   {i+1}. {title} {marker}")

# Build BM25 index for THIS example's passages
print(f"\n Building BM25 index for this example's passages...")
tokenized_passages = [preprocess_text_for_bm25(passage) for passage in example_passages]
bm25 = BM25Okapi(tokenized_passages)
print(f" BM25 index built with {len(tokenized_passages)} passages")

# Tokenize query and search
test_query_tokens = preprocess_text_for_bm25(test_question)
print(f"\n Query tokens: {test_query_tokens[:10]}...")

# Get top-k BM25 scores
k = 10 # We will evaluate Passage Recall@10
bm25_scores = bm25.get_scores(test_query_tokens)
top_k_indices = np.argsort(bm25_scores)[::-1][:k]

print(f"\n Top-{k} BM25 Retrieval Results:")
retrieved_titles_bm25 = []
retrieved_facts_bm25 = [] # To store retrieved facts (title, sent_id)

for i, idx in enumerate(top_k_indices):
    score = bm25_scores[idx]
    passage = example_passages[idx][:100] + "..."
    title = example_metadata[idx]['title']
    sent_idx = example_metadata[idx]['sentence_idx']
    current_fact = (title, sent_idx)

    # Check if this retrieved passage is a gold supporting fact
    is_gold_fact = current_fact in gold_supporting_facts_tuples
    fact_marker = "‚≠ê GOLD FACT" if is_gold_fact else ""

    print(f"   {i+1}. Score: {score:.3f} | ({title},{sent_idx}) {fact_marker}")
    print(f"      {passage}")

    if title not in retrieved_titles_bm25:
        retrieved_titles_bm25.append(title)

    if is_gold_fact:
        retrieved_facts_bm25.append(current_fact)


# Calculate Passage Recall@k (based on (title, sent_id) tuples)
# How many of the gold supporting facts were retrieved in the top k
passage_recall_k = len(set(retrieved_facts_bm25).intersection(gold_supporting_facts_tuples)) / len(gold_supporting_facts_tuples) if len(gold_supporting_facts_tuples) > 0 else 1.0
print(f"\n Passage Recall@{k}: {passage_recall_k:.3f} ({len(set(retrieved_facts_bm25).intersection(gold_supporting_facts_tuples))}/{len(gold_supporting_facts_tuples)} gold facts retrieved in top {k})")

# Optional: Also show Document Recall@k for context, based on titles
# doc_recall_title_based = len(set(retrieved_titles_bm25[:k]).intersection(gold_titles)) / len(gold_titles) if len(gold_titles) > 0 else 1.0
# print(f" (Standard Document Recall@{k} (titles): {doc_recall_title_based:.3f})")


print("\n BM25 per-question retrieval demonstration complete!")
print(" Key insight: We retrieve from each question's 10 context paragraphs, not a global corpus!")

 BM25 Per-Question Retrieval Demonstration
 Test Question: What nationality was Oliver Reed's character in the film Royal Flash?
 Gold Answer: Prussian

 Extracted 38 passages from 2 context paragraphs

 Supporting passages (text):  ['Royal Flash is a 1975 film based on George MacDonald Fraser\'s second Flashman novel, "Royal Flash".', 'Additionally, Oliver Reed appeared in the role of Otto von Bismarck, Alan Bates as Rudi von Sternberg, and Florinda Bolkan played Lola Montez.', 'Otto Eduard Leopold, Prince of Bismarck, Duke of Lauenburg (1 April 1815 ‚Äì 30 July 1898), known as Otto von Bismarck (] ), was a conservative Prussian statesman who dominated German and European affairs from the 1860s until 1890.']

 Available context paragraphs (2 gold + 8 distractors):
   1. Robin Barton  DISTRACTOR
   2. Funny Bones  DISTRACTOR
   3. Ivan Dragomiloff  DISTRACTOR
   4. Oliver Reed  DISTRACTOR
   5. Harry Flashman  DISTRACTOR
   6. Royal Flash (film) üü¢ GOLD
   7. Royal Flash  DISTRACTOR


## üéØ SECTION 4: RAG Paper Baseline - Dense Passage Retrieval (2020)

**This implements the EXACT retrieval method from the original RAG paper (Lewis et al., 2020)!**

### üìÑ What is the RAG Paper Baseline?

The Retrieval-Augmented Generation (RAG) paper by Lewis et al. (2020) introduced a powerful paradigm: combine neural retrieval with neural generation. The retrieval component uses:

- **Dense Passage Retrieval (DPR)** - Bi-encoder architecture with separate encoders for questions and passages
- **Pre-trained on Natural Questions (NQ)** - Google's single-hop factoid QA dataset  
- **No fine-tuning** - Frozen retriever weights (this is the baseline configuration)
- **MIPS (Maximum Inner Product Search)** - Find passages with highest dot product to query vector
- **Top-k retrieval** - Return k most similar passages based on vector similarity

### üîç DPR IS Vector Search!

Dense Passage Retrieval is essentially **vector search** - the same underlying technology used in:
- OpenAI embeddings + Pinecone/Weaviate/Chroma
- LlamaIndex vector stores
- LangChain vector retrievers
- All modern RAG systems

**The architecture:**
```
Question ‚Üí BERT Encoder ‚Üí 768-dim query vector
Passages ‚Üí BERT Encoder ‚Üí 768-dim passage vectors  
Similarity ‚Üí Dot product (or cosine) between query and passage vectors
Retrieval ‚Üí Top-k passages with highest similarity scores
```

### üìä Task Characteristics

**Original RAG Paper Tasks (where DPR-NQ was trained and evaluated):**
- ‚úÖ Natural Questions: "Who is the president of France?"
- ‚úÖ TriviaQA: "What year did the Berlin Wall fall?"  
- ‚úÖ WebQuestions: "When was Barack Obama born?"
- All are **single-hop** questions requiring one document

**Our Task - HotpotQA (different from training):**
- ‚ùì Multihop reasoning: "What year was the director of Inception born?"
  - Step 1: Find document about Inception ‚Üí Learn director is Christopher Nolan
  - Step 2: Find document about Christopher Nolan ‚Üí Find birth year 1970
- Requires **two documents** and reasoning across them

### üéØ Why Implement This Baseline?

1. **Understand Foundations**: See how RAG paper's retriever actually works
2. **Learn from Failures**: Observe performance on mismatched task domains
3. **Appreciate Evolution**: Understand motivations for modern improvements
4. **Build Intuition**: Know when domain-specific training matters

### üìà What We'll Measure

When we test DPR-NQ on HotpotQA multihop questions, we'll measure:
- Document Recall@10: Can it find both required documents?
- Supporting-Fact F1: Sentence-level retrieval accuracy
- Latency: Encoding and search time
- Comparison baseline for modern methods

The experiments will reveal how task mismatch affects performance. This isn't about DPR being "bad" - it's about understanding when and why models struggle on out-of-domain tasks.

Let's implement the RAG paper baseline and run experiments!

In [5]:
# Load pre-trained DPR models (this is vector search!)
print(" Loading pre-trained DPR models for vector search...")

# Question encoder (queries ‚Üí vectors)
q_encoder = DPRQuestionEncoder.from_pretrained('facebook/dpr-question_encoder-single-nq-base')
q_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained('facebook/dpr-question_encoder-single-nq-base')

# Context encoder (passages ‚Üí vectors)
c_encoder = DPRContextEncoder.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')
c_tokenizer = DPRContextEncoderTokenizer.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')

# Move to device
q_encoder.to(device)
c_encoder.to(device)

print(f" DPR models loaded on {device}")
print(" Vector Search Components Ready:")
print(f"   - Question Encoder: Transforms questions ‚Üí 768-dim vectors")
print(f"   - Context Encoder: Transforms passages ‚Üí 768-dim vectors")
print(f"   - Similarity: Cosine similarity / Inner product search")

def encode_questions_dpr(questions, batch_size=32):
    """Encode questions into dense vectors (Vector Search Step 1)"""
    q_encoder.eval()
    all_embeddings = []

    with torch.no_grad():
        for i in range(0, len(questions), batch_size):
            batch_questions = questions[i:i+batch_size]

            # Tokenize questions
            encoded = q_tokenizer(
                batch_questions,
                padding=True,
                truncation=True,
                max_length=512,
                return_tensors='pt'
            )

            # Move to device
            input_ids = encoded['input_ids'].to(device)
            attention_mask = encoded['attention_mask'].to(device)

            # Encode to vectors
            embeddings = q_encoder(input_ids=input_ids, attention_mask=attention_mask)
            all_embeddings.append(embeddings.pooler_output.cpu())

    return torch.cat(all_embeddings, dim=0)

def encode_passages_dpr(passages, batch_size=32):
    """Encode passages into dense vectors (Vector Search Step 2)"""
    c_encoder.eval()
    all_embeddings = []

    with torch.no_grad():
        for i in tqdm(range(0, len(passages), batch_size), desc="Encoding passages"):
            batch_passages = passages[i:i+batch_size]

            # Tokenize passages
            encoded = c_tokenizer(
                batch_passages,
                padding=True,
                truncation=True,
                max_length=512,
                return_tensors='pt'
            )

            # Move to device
            input_ids = encoded['input_ids'].to(device)
            attention_mask = encoded['attention_mask'].to(device)

            # Encode to vectors
            embeddings = c_encoder(input_ids=input_ids, attention_mask=attention_mask)
            all_embeddings.append(embeddings.pooler_output.cpu())

    return torch.cat(all_embeddings, dim=0)

def vector_search_dpr(query_embedding, passage_embeddings, passages, metadata, top_k=10):
    """Perform vector search using cosine similarity (Vector Search Step 3)"""
    # Normalize embeddings for cosine similarity
    query_embedding = F.normalize(query_embedding, p=2, dim=1)
    passage_embeddings = F.normalize(passage_embeddings, p=2, dim=1)

    # Compute similarity matrix (this IS vector search!)
    similarities = torch.mm(query_embedding, passage_embeddings.transpose(0, 1))
    similarities = similarities.squeeze(0)  # Remove batch dimension

    # Get top-k most similar vectors
    top_k_scores, top_k_indices = torch.topk(similarities, min(top_k, len(similarities)))

    results = []
    for score, idx in zip(top_k_scores, top_k_indices):
        results.append({
            'idx': int(idx),
            'score': float(score),
            'passage': passages[int(idx)],
            'title': metadata[int(idx)]['title'],
            'metadata': metadata[int(idx)]
        })

    return results

print("\n DPR Vector Search functions ready!")
print(" Key insight: DPR = Vector Database functionality!")
print("   - encode_questions_dpr(): Query ‚Üí Vector")
print("   - encode_passages_dpr(): Documents ‚Üí Vectors")
print("   - vector_search_dpr(): Similarity search (MIPS/Cosine)")

 Loading pre-trained DPR models for vector search...


config.json:   0%|          | 0.00/493 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/dpr-question_encoder-single-nq-base were not used when initializing DPRQuestionEncoder: ['question_encoder.bert_model.pooler.dense.bias', 'question_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRQuestionEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRQuestionEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/492 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/dpr-ctx_encoder-single-nq-base were not used when initializing DPRContextEncoder: ['ctx_encoder.bert_model.pooler.dense.bias', 'ctx_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRContextEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRContextEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizer'.


 DPR models loaded on cuda
 Vector Search Components Ready:
   - Question Encoder: Transforms questions ‚Üí 768-dim vectors
   - Context Encoder: Transforms passages ‚Üí 768-dim vectors
   - Similarity: Cosine similarity / Inner product search

 DPR Vector Search functions ready!
 Key insight: DPR = Vector Database functionality!
   - encode_questions_dpr(): Query ‚Üí Vector
   - encode_passages_dpr(): Documents ‚Üí Vectors
   - vector_search_dpr(): Similarity search (MIPS/Cosine)


In [32]:
# Demonstrate DPR Vector Search with PER-QUESTION retrieval
print(" DPR Vector Search Per-Question Demonstration")
print("="*60)
print(" This demonstrates vector search on a per-question basis!")

# Use the same test example for comparison
print(f" Test Question: {test_question}")
print(f" Gold Answer: {test_example['answer']}")

# Build DPR vector embeddings for THIS example's passages
print(f"\n Encoding {len(example_passages)} passages into dense vectors...")
passage_embeddings = encode_passages_dpr(example_passages, batch_size=16)

print(f" Vector index built!")
print(f" Vector Statistics:")
print(f"   - Number of vectors: {passage_embeddings.shape[0]}")
print(f"   - Vector dimension: {passage_embeddings.shape[1]}")

# Encode query into vector
print(f"\n Encoding question into query vector...")
query_embedding = encode_questions_dpr([test_question])
print(f"   Query vector shape: {query_embedding.shape}")

top_k = 10
# Perform vector search
print(f"\n Performing vector similarity search...")
vector_results = vector_search_dpr(query_embedding, passage_embeddings, example_passages, example_metadata, top_k=top_k)

print(f"\n Top-{top_k} Vector Search Results:")
retrieved_titles_dpr = []
for i, result in enumerate(vector_results):
    title = result['title']
    is_gold = " GOLD" if title in gold_titles else ""
    print(f"   {i+1}. Score: {result['score']:.3f} | {title} {is_gold}")
    print(f"      {result['passage'][:100]}...")

    if title not in retrieved_titles_dpr:
        retrieved_titles_dpr.append(title)

# Calculate document recall for DPR
doc_recall_dpr = len(set(retrieved_titles_dpr[:top_k]).intersection(gold_titles)) / len(gold_titles)
print(f"\n Document Recall@10: {doc_recall_dpr:.3f} ({len(set(retrieved_titles_dpr[:top_k]).intersection(gold_titles))}/{len(gold_titles)} gold docs retrieved)")

print("\n DPR Vector Search demonstration complete!")
print(" This demonstrates the core technology behind:")
print("   - OpenAI Embeddings + Vector DBs")
print("   - LlamaIndex vector stores")
print("   - LangChain vector retrievers")
print("   - Pinecone, Weaviate, Chroma databases")

print("\n Key Difference from Production:")
print("   - Production: One large vector DB with millions of documents")
print("   - HotpotQA: Per-question retrieval from 10 candidate paragraphs")
print("   - This setup tests multihop reasoning with controlled distractors")

 DPR Vector Search Per-Question Demonstration
 This demonstrates vector search on a per-question basis!
 Test Question: What nationality was Oliver Reed's character in the film Royal Flash?
 Gold Answer: Prussian

 Encoding 23 passages into dense vectors...


Encoding passages: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:00<00:00, 14.36it/s]

 Vector index built!
 Vector Statistics:
   - Number of vectors: 23
   - Vector dimension: 768

 Encoding question into query vector...
   Query vector shape: torch.Size([1, 768])

 Performing vector similarity search...

 Top-10 Vector Search Results:
   1. Score: 0.466 | Fotos Politis 
      German educated Greek stage director Fotos Politis (Greek: Œ¶œéœÑŒøœÇ Œ†ŒøŒªŒØœÑŒ∑œÇ), 1890-1934, was one of the...
   2. Score: 0.452 | Ahmad Kamyabi Mask 
      He is a prominent scholar of French Avant-garde theater and influential in the study of Eug√®ne Iones...
   3. Score: 0.449 | DMPO's on Broadway 
      It was filmed on June 16, 1984, the last night the theater was open before it was torn down....
   4. Score: 0.444 | Ethel Winter  GOLD
      Winter was an early ballet dancer with the Martha Graham Dance Company from the 1940s to the 1960s, ...
   5. Score: 0.444 | Lovisa Simson 
      She was the first female theater director over a permanent theater (rather than a travelling theater..




## üî¨ SECTION 5: Why RAG Baseline Struggles - Multihop Analysis

Now that we've implemented the RAG paper baseline (DPR-NQ), let's analyze **why** it struggles on HotpotQA multihop questions.

### üéØ Key Question: Why Does Strong Performance on NQ Become Poor on HotpotQA?

**Domain Mismatch in Action:**

The DPR model was trained on Natural Questions, where:
- Questions are **single-hop**: "Who won the 2020 NBA championship?"
- Answer requires **ONE document**: Lakers team page
- Encoder learns: "Find the most relevant single document"

HotpotQA requires **multihop reasoning**:
- Questions span **two+ documents**: "What position did the 2020 NBA championship MVP play?"
- Step 1: Find 2020 NBA championship ‚Üí Lakers  
- Step 2: Find Lakers MVP ‚Üí LeBron James
- Step 3: Find LeBron's position ‚Üí Small Forward
- Encoder needs: "Find MULTIPLE related but different documents"

### üìä What We'll Analyze

1. **Failure Case Examples**: Show 3-5 questions where DPR-NQ retrieves wrong documents
2. **Error Categorization**: Classify types of retrieval failures
3. **Retrieval Overlap**: Visualize retrieved docs vs gold docs
4. **Patterns**: Identify which question types fail most

This analysis will motivate our move to modern embedders in Section 6!

In [11]:
# Error Analysis Functions for DPR Baseline on Multihop Tasks

def analyze_retrieval_failure(example, retrieved_titles, method_name="DPR-NQ"):
    """
    Analyze why retrieval failed for a multihop question

    Returns failure type and explanation
    """
    # Correctly access gold supporting facts from the example dictionary
    gold_supporting_facts = example['supporting_facts']
    gold_titles = list(set(gold_supporting_facts['title'])) # Extract unique gold titles

    retrieved_set = set(retrieved_titles[:10])  # Top-10 retrieved titles
    gold_set = set(gold_titles) # Set of gold titles

    overlap = len(retrieved_set.intersection(gold_set))

    # Categorize failure type
    if overlap == len(gold_set):
        failure_type = "SUCCESS"
        explanation = f"Retrieved all {len(gold_set)} required documents"
    elif overlap == 0:
        failure_type = "COMPLETE_MISS"
        explanation = f"Retrieved 0/{len(gold_set)} gold documents. Retrieved distractors instead."
    else:
        failure_type = "PARTIAL_RETRIEVAL"
        missing = gold_set - retrieved_set
        explanation = f"Found {overlap}/{len(gold_set)} gold docs. Missing: {list(missing)}" # List missing titles
        if overlap == 1 and len(gold_set) == 2: # More specific message for 1/2 case
             explanation = f"Found 1/2 gold docs. Missing: {list(missing)[0]}"


    return {
        'failure_type': failure_type,
        'overlap': overlap,
        'total_required': len(gold_set),
        'explanation': explanation,
        'gold_titles': gold_titles,
        'retrieved_titles': retrieved_titles[:10],
        'question': example['question'],
        'answer': example['answer'],
        'question_type': example['type'],
        'difficulty': example['level']
    }

def categorize_errors(analysis_results):
    """Categorize all errors by type"""
    from collections import Counter

    failure_types = [r['failure_type'] for r in analysis_results]
    counts = Counter(failure_types)

    return counts

def find_failure_examples(analysis_results, failure_type, n=5):
    """Find n examples of a specific failure type"""
    examples = [r for r in analysis_results if r['failure_type'] == failure_type]
    return examples[:n]

print(" Error analysis functions ready!")
print(" Functions available:")
print("   - analyze_retrieval_failure(): Categorize individual failures")
print("   - categorize_errors(): Count failure types")
print("   - find_failure_examples(): Get examples of specific failures")

 Error analysis functions ready!
 Functions available:
   - analyze_retrieval_failure(): Categorize individual failures
   - categorize_errors(): Count failure types
   - find_failure_examples(): Get examples of specific failures


In [12]:
# Run Error Analysis on DPR Baseline
print(" Running Failure Analysis on DPR Baseline")
print("="*60)

# Analyze a subset for demonstration
analysis_subset_size = 50
analysis_subset = val_sample.select(range(min(analysis_subset_size, len(val_sample))))

dpr_failure_analysis = []

print(f" Analyzing {len(analysis_subset)} validation examples...")
print()

for i, example in enumerate(tqdm(analysis_subset, desc="Analyzing DPR failures")):
    question = example['question']

    # Extract passages and encode with DPR
    # Fix: extract_passages_from_example returns 3 values, unpack accordingly
    example_passages, example_metadata, supporting_passages = extract_passages_from_example(example)
    passage_embeddings = encode_passages_dpr(example_passages, batch_size=16)

    # DPR vector search
    query_embedding = encode_questions_dpr([question])
    vector_results = vector_search_dpr(query_embedding, passage_embeddings,
                                       example_passages, example_metadata, top_k=10)

    # Extract retrieved titles
    retrieved_titles = []
    for result in vector_results:
        title = result['title']
        if title not in retrieved_titles:
            retrieved_titles.append(title)

    # Analyze this retrieval
    analysis = analyze_retrieval_failure(example, retrieved_titles, method_name="DPR-NQ")
    dpr_failure_analysis.append(analysis)

# Categorize all errors
error_counts = categorize_errors(dpr_failure_analysis)

print("\n DPR-NQ Retrieval Performance Analysis")
print("="*60)
print(f"\n Overall Statistics:")
print(f"   Total examples analyzed: {len(dpr_failure_analysis)}")
print(f"\n Failure Type Distribution:")
for failure_type, count in error_counts.most_common():
    percentage = (count / len(dpr_failure_analysis)) * 100
    print(f"   {failure_type}: {count} ({percentage:.1f}%)")

# Calculate average document recall (using titles)
avg_recall_title = np.mean([a['overlap'] / a['total_required'] for a in dpr_failure_analysis])
print(f"\n Average Document Recall@10 (Titles): {avg_recall_title:.3f}")

# Note: Supporting Fact Recall/F1 will be calculated later with the comprehensive evaluator

print("\n Example Failures by Type:")
print("="*60)

# Show examples of each failure type
for failure_type in ['COMPLETE_MISS', 'PARTIAL_RETRIEVAL', 'SUCCESS']:
    examples = find_failure_examples(dpr_failure_analysis, failure_type, n=2)

    if examples:
        print(f"\n {failure_type} Examples:")
        for j, ex in enumerate(examples[:2], 1):
            print(f"\n   Example {j}:")
            print(f"   Question: {ex['question'][:100]}...")
            print(f"   Answer: {ex['answer']}")
            print(f"   Type: {ex['question_type']} | Difficulty: {ex['difficulty']}")
            print(f"   Gold docs (Titles): {ex['gold_titles']}")
            print(f"   Retrieved docs (Titles): {ex['retrieved_titles'][:3]}...")
            print(f"   {ex['explanation']}")

print("\n Failure analysis complete!")

 Running Failure Analysis on DPR Baseline
 Analyzing 50 validation examples...



Analyzing DPR failures:   0%|          | 0/50 [00:00<?, ?it/s]
Encoding passages:   0%|          | 0/3 [00:00<?, ?it/s][A
Encoding passages:  33%|‚ñà‚ñà‚ñà‚ñé      | 1/3 [00:00<00:00,  7.70it/s][A
Encoding passages: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:00<00:00, 11.94it/s]
Analyzing DPR failures:   2%|‚ñè         | 1/50 [00:00<00:13,  3.71it/s]
Encoding passages:   0%|          | 0/5 [00:00<?, ?it/s][A
Encoding passages:  40%|‚ñà‚ñà‚ñà‚ñà      | 2/5 [00:00<00:00, 15.19it/s][A
Encoding passages: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [00:00<00:00, 19.71it/s]
Analyzing DPR failures:   4%|‚ñç         | 2/50 [00:00<00:12,  3.70it/s]
Encoding passages:   0%|          | 0/3 [00:00<?, ?it/s][A
Encoding passages: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:00<00:00, 24.91it/s]
Analyzing DPR failures:   6%|‚ñå         | 3/50 [00:00<00:09,  4.75it/s]
Encoding passages:   0%|          | 0/4 [00:00<?, ?it/s][A
Encoding passages: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:0


 DPR-NQ Retrieval Performance Analysis

 Overall Statistics:
   Total examples analyzed: 50

 Failure Type Distribution:
   SUCCESS: 29 (58.0%)
   PARTIAL_RETRIEVAL: 20 (40.0%)
   COMPLETE_MISS: 1 (2.0%)

 Average Document Recall@10 (Titles): 0.780

 Example Failures by Type:

 COMPLETE_MISS Examples:

   Example 1:
   Question: After his curacy at the village that is a suburb of Scunthorpe, who was Industrial Chaplain to the B...
   Answer: Dudman
   Type: bridge | Difficulty: hard
   Gold docs (Titles): ['Frodingham, Lincolnshire', 'Bill Dudman']
   Retrieved docs (Titles): ['John Braddocke', 'Ernest Holmes (priest)', 'Alfred Hurley']...
   Retrieved 0/2 gold documents. Retrieved distractors instead.

 PARTIAL_RETRIEVAL Examples:

   Example 1:
   Question: What nationality was Oliver Reed's character in the film Royal Flash?...
   Answer: Prussian
   Type: bridge | Difficulty: hard
   Gold docs (Titles): ['Royal Flash (film)', 'Otto von Bismarck']
   Retrieved docs (Titles): ['Ivan




##  SECTION 6: Modern Embedder Comparison (2020 ‚Üí 2023)

We've seen that DPR-NQ (2020) was trained on single-hop Natural Questions. Let's compare it with a modern embedder to understand the evolution.

###  From 2020 to 2023

**2020: Dense Passage Retrieval (DPR)**
- First successful dense retrieval for open-domain QA
- Task-specific training (Natural Questions)
- Limitation: Domain-specific, requires fine-tuning for new tasks

**2023: BGE (BAAI)**
- Multi-task training across diverse datasets
- State-of-the-art on MTEB benchmark
- General-purpose, no task-specific fine-tuning needed

###  Simple Comparison

We'll test 2 embedders on the SAME HotpotQA validation set:

| Model | Year | Training | Design Focus |
|-------|------|----------|--------------|
| **DPR-NQ** | 2020 | Natural Questions (single-hop) | Task-specific QA |
| **BGE-large** | 2023 | Multi-task (diverse datasets) | General-purpose SOTA |

**Questions to answer through experiments:**
- How does performance differ between 2020 and 2023 models?
- Does multi-task training help with multihop reasoning?
- What's the latency trade-off?

###  Implementation Strategy

We'll create a **unified embedder interface** that works with both:
- DPR (separate question/context encoders)
- SentenceTransformers (unified encoder)

Both will use the same `vector_search_dpr()` function for fair comparison.

###  Metrics We'll Measure

For each embedder:
1. **Document Recall@10**: Can it find both required documents?
2. **Latency**: Encoding time
3. **Improvement**: BGE vs DPR-NQ baseline

Let's implement and run experiments!

In [28]:
# Unified Embedder Interface for Fair Comparison

from sentence_transformers import SentenceTransformer
import time
import torch.nn.functional as F # Import F for normalize

class UnifiedEmbedder:
    """
    Unified interface for different embedding models

    Supports:
    - DPR (separate question/context encoders)
    - SentenceTransformers (unified models like BGE)
    """

    def __init__(self, model_name, model_type='sentence-transformer'):
        """
        Initialize embedder

        Args:
            model_name: Model identifier
            model_type: 'dpr' or 'sentence-transformer'
        """
        self.model_name = model_name
        self.model_type = model_type

        if model_type == 'dpr':
            # Use existing DPR encoders (already loaded)
            # Ensure q_encoder, c_encoder, q_tokenizer, c_tokenizer are globally available or passed
            # Assuming they are globally available from a previous cell run
            global q_encoder, c_encoder, q_tokenizer, c_tokenizer, device, encode_questions_dpr, encode_passages_dpr, vector_search_dpr
            self.q_encoder = q_encoder
            self.c_encoder = c_encoder
            self.q_tokenizer = q_tokenizer
            self.c_tokenizer = c_tokenizer
            print(f" Using existing DPR encoders for {model_name}")

        elif model_type == 'sentence-transformer':
            # Load SentenceTransformer model
            print(f" Loading {model_name}...")
            self.model = SentenceTransformer(model_name)
            if torch.cuda.is_available():
                self.model = self.model.to(device)
            print(f" Loaded {model_name} on {device}")

        else:
            raise ValueError(f"Unknown model_type: {model_type}")

    def encode_queries(self, queries, batch_size=32, show_progress=False):
        """Encode queries into vectors"""
        if self.model_type == 'dpr':
            # DPR encode_questions_dpr does not have show_progress
            return encode_questions_dpr(queries, batch_size)
        else:
            # SentenceTransformer
            embeddings = self.model.encode(
                queries,
                batch_size=batch_size,
                show_progress_bar=show_progress,
                convert_to_tensor=True,
                device=device
            )
            return embeddings

    def encode_passages(self, passages, batch_size=32, show_progress=True):
        """Encode passages into vectors"""
        if self.model_type == 'dpr':
            # DPR encode_passages_dpr does not have show_progress
            return encode_passages_dpr(passages, batch_size) # Removed show_progress=show_progress
        else:
            # SentenceTransformer
            embeddings = self.model.encode(
                passages,
                batch_size=batch_size,
                show_progress_bar=show_progress,
                convert_to_tensor=True,
                device=device
            )
            return embeddings

    def search(self, query_embedding, passage_embeddings, passages, metadata, top_k=10):
        """Perform vector similarity search using the detailed vector_search_dpr function"""
        # Use the existing vector_search_dpr function (works for any embeddings)
        # vector_search_dpr is assumed to be globally available
        global vector_search_dpr
        return vector_search_dpr(query_embedding, passage_embeddings, passages, metadata, top_k)


# Initialize 2 embedders for comparison
print(" Initializing Embedders for Comparison")
print("="*60)

embedders = {}

# 1. DPR-NQ (RAG 2020 baseline) - already loaded
embedders['DPR-NQ (2020)'] = UnifiedEmbedder('facebook/dpr-nq', model_type='dpr')

# 2. BGE-large (2023 SOTA)
embedders['BGE-large (2023)'] = UnifiedEmbedder('BAAI/bge-large-en-v1.5', model_type='sentence-transformer')

print("\n Both embedders loaded and ready!")
print("\n Loaded Models:")
for name in embedders.keys():
    print(f"   - {name}")
print("\n Both use the same vector_search_dpr() function for fair comparison")

 Initializing Embedders for Comparison
 Using existing DPR encoders for facebook/dpr-nq
 Loading BAAI/bge-large-en-v1.5...
 Loaded BAAI/bge-large-en-v1.5 on cuda

 Both embedders loaded and ready!

 Loaded Models:
   - DPR-NQ (2020)
   - BGE-large (2023)

 Both use the same vector_search_dpr() function for fair comparison


In [17]:
# Compare 2 Embedders on Same Validation Set
print(" EMBEDDER COMPARISON: 2020 vs 2023")
print("="*70)

# Use a subset for comparison
comparison_size = 30
comparison_subset = val_sample.select(range(min(comparison_size, len(val_sample))))

print(f" Evaluating {len(comparison_subset)} validation examples on 2 embedders")
print()

# Store results for each embedder
embedder_results = {name: [] for name in embedders.keys()}

# Evaluate each embedder
for embedder_name, embedder in embedders.items():
    print(f"\n{'='*70}")
    print(f" Testing: {embedder_name}")
    print(f"{'='*70}")

    start_total = time.time()

    for i, example in enumerate(tqdm(comparison_subset, desc=f"Evaluating {embedder_name}")):
        question = example['question']
        gold_supporting_facts = example['supporting_facts']  # Keep as dict
        gold_titles = list(set(gold_supporting_facts['title']))

        # Extract passages from this example
        # Fix: extract_passages_from_example returns 3 values, unpack accordingly
        example_passages, example_metadata, supporting_passages = extract_passages_from_example(example)

        # Encode passages
        start_encode = time.time()
        passage_embeddings = embedder.encode_passages(example_passages, batch_size=16, show_progress=False)

        # Encode query
        query_embedding = embedder.encode_queries([question], show_progress=False)
        encode_time = time.time() - start_encode

        # Search using vector_search_dpr function
        start_search = time.time()
        if embedder.model_type == 'dpr':
            # DPR: query_embedding already 2D from encode_questions_dpr
            search_results = embedder.search(query_embedding, passage_embeddings,
                                            example_passages, example_metadata, top_k=10)
        else:
            # SentenceTransformer: reshape to 2D
            query_embedding_2d = query_embedding.unsqueeze(0) if query_embedding.dim() == 1 else query_embedding
            search_results = embedder.search(query_embedding_2d, passage_embeddings,
                                            example_passages, example_metadata, top_k=10)
        search_time = time.time() - start_search

        # Extract retrieved titles
        retrieved_titles = []
        for result in search_results:
            title = result['title']
            if title not in retrieved_titles:
                retrieved_titles.append(title)

        # Calculate Document Recall@10
        retrieved_set = set(retrieved_titles[:10])
        gold_set = set(gold_titles)
        doc_recall = len(retrieved_set.intersection(gold_set)) / len(gold_set)

        # Store results
        embedder_results[embedder_name].append({
            'doc_recall@10': doc_recall,
            'encode_time': encode_time,
            'search_time': search_time,
            'total_time': encode_time + search_time,
            'retrieved_count': len(retrieved_set.intersection(gold_set)),
            'gold_count': len(gold_set)
        })

    total_time = time.time() - start_total
    avg_doc_recall = np.mean([r['doc_recall@10'] for r in embedder_results[embedder_name]])
    avg_latency = np.mean([r['total_time'] for r in embedder_results[embedder_name]])

    print(f"\n {embedder_name} Results:")
    print(f"   Document Recall@10: {avg_doc_recall:.3f}")
    print(f"   Avg Latency: {avg_latency:.3f}s per question")
    print(f"   Total Time: {total_time:.1f}s")

# ========== COMPARISON SUMMARY ==========
print(f"\n\n{'='*70}")
print(f" FINAL COMPARISON SUMMARY")
print(f"{'='*70}\n")

# Create comparison table
comparison_data = []
baseline_recall = np.mean([r['doc_recall@10'] for r in embedder_results['DPR-NQ (2020)']])

for embedder_name in embedders.keys():
    results = embedder_results[embedder_name]
    avg_recall = np.mean([r['doc_recall@10'] for r in results])
    avg_latency = np.mean([r['total_time'] for r in results])
    improvement = ((avg_recall - baseline_recall) / baseline_recall * 100) if baseline_recall > 0 else 0

    comparison_data.append({
        'Model': embedder_name,
        'Doc Recall@10': f"{avg_recall:.3f}",
        'Improvement': f"{improvement:+.1f}%",
        'Avg Latency (s)': f"{avg_latency:.3f}"
    })

# Print as table
import pandas as pd
df_comparison = pd.DataFrame(comparison_data)
print(df_comparison.to_string(index=False))

print(f"\n\n KEY OBSERVATIONS:")
print(f"="*50)

bge_recall = np.mean([r['doc_recall@10'] for r in embedder_results['BGE-large (2023)']])
improvement_pct = ((bge_recall - baseline_recall) / baseline_recall * 100) if baseline_recall > 0 else 0

print(f"\n1. **Performance Evolution:**")
print(f"   ‚Ä¢ DPR-NQ (2020): {baseline_recall:.3f} (baseline)")
print(f"   ‚Ä¢ BGE-large (2023): {bge_recall:.3f} ({improvement_pct:+.1f}% change)")

print(f"\n2. **Why the difference?**")
print(f"   ‚Ä¢ DPR-NQ: Trained only on single-hop Natural Questions")
print(f"   ‚Ä¢ BGE-large: Multi-task training across diverse datasets")
print(f"   ‚Ä¢ Task match matters for retrieval performance")

print(f"\n3. **Latency:**")
for embedder_name in embedders.keys():
    avg_latency = np.mean([r['total_time'] for r in embedder_results[embedder_name]])
    print(f"   ‚Ä¢ {embedder_name}: {avg_latency:.3f}s")

print(f"\n Embedder comparison complete!")

 EMBEDDER COMPARISON: 2020 vs 2023
 Evaluating 30 validation examples on 2 embedders


 Testing: DPR-NQ (2020)


Evaluating DPR-NQ (2020):   0%|          | 0/30 [00:00<?, ?it/s]
Encoding passages:   0%|          | 0/3 [00:00<?, ?it/s][A
Encoding passages:  33%|‚ñà‚ñà‚ñà‚ñé      | 1/3 [00:00<00:00,  7.74it/s][A
Encoding passages: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:00<00:00, 10.71it/s]
Evaluating DPR-NQ (2020):   3%|‚ñé         | 1/30 [00:00<00:08,  3.29it/s]
Encoding passages:   0%|          | 0/5 [00:00<?, ?it/s][A
Encoding passages:  40%|‚ñà‚ñà‚ñà‚ñà      | 2/5 [00:00<00:00, 13.58it/s][A
Encoding passages: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [00:00<00:00, 17.56it/s]
Evaluating DPR-NQ (2020):   7%|‚ñã         | 2/30 [00:00<00:08,  3.26it/s]
Encoding passages:   0%|          | 0/3 [00:00<?, ?it/s][A
Encoding passages: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:00<00:00, 21.85it/s]
Evaluating DPR-NQ (2020):  10%|‚ñà         | 3/30 [00:00<00:06,  4.19it/s]
Encoding passages:   0%|          | 0/4 [00:00<?, ?it/s][A
Encoding passages: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00


 DPR-NQ (2020) Results:
   Document Recall@10: 0.783
   Avg Latency: 0.174s per question
   Total Time: 5.2s

 Testing: BGE-large (2023)


Evaluating BGE-large (2023): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 30/30 [00:11<00:00,  2.63it/s]


 BGE-large (2023) Results:
   Document Recall@10: 0.967
   Avg Latency: 0.379s per question
   Total Time: 11.4s


 FINAL COMPARISON SUMMARY

           Model Doc Recall@10 Improvement Avg Latency (s)
   DPR-NQ (2020)         0.783       +0.0%           0.174
BGE-large (2023)         0.967      +23.4%           0.379


 KEY OBSERVATIONS:

1. **Performance Evolution:**
   ‚Ä¢ DPR-NQ (2020): 0.783 (baseline)
   ‚Ä¢ BGE-large (2023): 0.967 (+23.4% change)

2. **Why the difference?**
   ‚Ä¢ DPR-NQ: Trained only on single-hop Natural Questions
   ‚Ä¢ BGE-large: Multi-task training across diverse datasets
   ‚Ä¢ Task match matters for retrieval performance

3. **Latency:**
   ‚Ä¢ DPR-NQ (2020): 0.174s
   ‚Ä¢ BGE-large (2023): 0.379s

 Embedder comparison complete!





## üéØ SECTION 7: Cross-Encoder Reranking - Precision Boost

We've improved retrieval by upgrading our embedder (Section 6). Now let's add **reranking** - a second-stage refinement that can improve precision.

### üîç Bi-Encoder vs Cross-Encoder

**Bi-Encoder (What we've used so far):**
```
Question ‚Üí Encoder A ‚Üí Query vector [768]
Passage ‚Üí Encoder B ‚Üí Passage vector [768]
Similarity ‚Üí Dot product of vectors
```
- ‚úÖ **Fast**: Pre-compute passage vectors, quick dot product
- ‚úÖ **Scalable**: Can index millions of passages
- ‚ùå **No interaction**: Question and passage never "see" each other

**Cross-Encoder (For reranking):**
```
[Question + Passage] ‚Üí Joint Encoder ‚Üí Relevance score [0-1]
```
- ‚úÖ **Accurate**: Question and passage processed together (full attention)
- ‚úÖ **Better relevance**: Direct relevance modeling
- ‚ùå **Slow**: Must process every question-passage pair separately
- ‚ùå **Not scalable**: Can't pre-compute, O(n) complexity

### üéØ Two-Stage Retrieval Strategy

Combine both for best results:

**Stage 1: Bi-Encoder (Fast Retrieval)**
- Retrieve top-100 candidates from full corpus
- Uses BGE/E5 embeddings
- Fast, scalable

**Stage 2: Cross-Encoder (Precise Reranking)**
- Rerank top-100 ‚Üí top-10
- Uses cross-encoder for accurate scoring
- Slow but only on 100 candidates (acceptable)

### üìä What We'll Measure

We'll compare bi-encoder retrieval with and without reranking:
- Document Recall@10: Does reranking improve document selection?
- Rank changes: Which passages move up/down?
- Latency impact: Cost of reranking stage
- Precision@k: Quality of top-k results

### üîß Implementation

We'll use `BAAI/bge-reranker-large` - current SOTA reranker:
- Trained specifically for reranking
- Optimized for question-passage relevance
- Compatible with BGE embeddings (but works with any)

Experiments will show whether the latency cost is worth the accuracy gain.

In [16]:
# Load Cross-Encoder Reranker
print(" Loading Cross-Encoder Reranker...")

from sentence_transformers import CrossEncoder

# Load BGE reranker (current SOTA)
reranker = CrossEncoder('BAAI/bge-reranker-large')

print(" Cross-encoder reranker loaded!")
print(f"   Model: BAAI/bge-reranker-large")
print(f"   Purpose: Rerank retrieved passages for better precision")

def rerank_passages(question, search_results, top_k=10, rerank_from_k=100):
    """
    Rerank passages using cross-encoder

    Args:
        question: Query string
        search_results: List of search results from bi-encoder
        top_k: Number of final results to return
        rerank_from_k: Number of candidates to rerank

    Returns:
        Reranked search results (top_k)
    """
    # Take top rerank_from_k candidates from bi-encoder
    candidates = search_results[:min(rerank_from_k, len(search_results))]

    # Prepare question-passage pairs for cross-encoder
    pairs = [[question, result['passage']] for result in candidates]

    # Get cross-encoder scores
    ce_scores = reranker.predict(pairs)

    # Combine original results with new scores
    for i, result in enumerate(candidates):
        result['rerank_score'] = float(ce_scores[i])
        result['original_score'] = result['score']  # Keep bi-encoder score
        result['original_rank'] = i + 1

    # Sort by rerank scores
    reranked = sorted(candidates, key=lambda x: x['rerank_score'], reverse=True)

    # Return top-k
    return reranked[:top_k]

print("\n Reranking function ready!")
print(" Usage: rerank_passages(question, search_results, top_k=10)")

 Loading Cross-Encoder Reranker...


config.json:   0%|          | 0.00/801 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

 Cross-encoder reranker loaded!
   Model: BAAI/bge-reranker-large
   Purpose: Rerank retrieved passages for better precision

 Reranking function ready!
 Usage: rerank_passages(question, search_results, top_k=10)


In [19]:
# Demonstrate Reranking Impact
print(" RERANKING DEMONSTRATION")
print("="*70)

# Select a test example
demo_example = val_sample[0]
demo_question = demo_example['question']
demo_gold_titles = list(set(demo_example['supporting_facts']['title']))

print(f"Question: {demo_question}")
print(f"Gold documents: {demo_gold_titles}")
print()

# Extract passages
# Fix: extract_passages_from_example returns 3 values, unpack accordingly
demo_passages, demo_metadata, supporting_passages = extract_passages_from_example(demo_example)

# Use best embedder (BGE-large)
best_embedder = embedders['BGE-large (2023)']

# Stage 1: Bi-encoder retrieval (top-20)
print(" STAGE 1: Bi-Encoder Retrieval (BGE-large)")
print("-"*70)

passage_embeddings = best_embedder.encode_passages(demo_passages, batch_size=16, show_progress=False)
query_embedding = best_embedder.encode_queries([demo_question], show_progress=False)

# Reshape if needed
if query_embedding.dim() == 1:
    query_embedding = query_embedding.unsqueeze(0)

initial_results = best_embedder.search(query_embedding, passage_embeddings,
                                       demo_passages, demo_metadata, top_k=20)

print(f"\nTop-10 results from bi-encoder:")
bi_encoder_titles = []
for i, result in enumerate(initial_results[:10]):
    title = result['title']
    is_gold = " GOLD" if title in demo_gold_titles else ""
    print(f"   {i+1}. Score: {result['score']:.3f} | {title} {is_gold}")
    if title not in bi_encoder_titles:
        bi_encoder_titles.append(title)

bi_encoder_recall = len(set(bi_encoder_titles[:10]).intersection(demo_gold_titles)) / len(demo_gold_titles)
print(f"\n Bi-Encoder Document Recall@10: {bi_encoder_recall:.3f}")

# Stage 2: Cross-encoder reranking
print(f"\n STAGE 2: Cross-Encoder Reranking")
print("-"*70)

reranked_results = rerank_passages(demo_question, initial_results, top_k=10, rerank_from_k=20)

print(f"\nTop-10 results after reranking:")
reranked_titles = []
rank_changes = []

for i, result in enumerate(reranked_results[:10]):
    title = result['title']
    is_gold = " GOLD" if title in demo_gold_titles else ""
    rank_change = result['original_rank'] - (i + 1)  # Positive = moved up

    if rank_change > 0:
        change_indicator = f"‚Üë{rank_change}"
    elif rank_change < 0:
        change_indicator = f"‚Üì{abs(rank_change)}"
    else:
        change_indicator = "="

    print(f"   {i+1}. Rerank: {result['rerank_score']:.3f} | Original: {result['original_score']:.3f} | {title} {is_gold} ({change_indicator})")

    if title not in reranked_titles:
        reranked_titles.append(title)

    rank_changes.append(rank_change)

reranked_recall = len(set(reranked_titles[:10]).intersection(demo_gold_titles)) / len(demo_gold_titles)
print(f"\n After Reranking Document Recall@10: {reranked_recall:.3f}")

# Show improvement
improvement = reranked_recall - bi_encoder_recall
print(f"\n{'='*70}")
print(f" RERANKING IMPACT:")
print(f"   Before reranking: {bi_encoder_recall:.3f}")
print(f"   After reranking:  {reranked_recall:.3f}")
print(f"   Improvement:      {improvement:+.3f} ({(improvement/bi_encoder_recall*100):+.1f}%)" if bi_encoder_recall > 0 else "   Improvement:      N/A")

print(f"\n Observations:")
print(f"   ‚Ä¢ Cross-encoder reordered {sum(1 for c in rank_changes if c != 0)} passages")
print(f"   ‚Ä¢ Gold documents moved up: {sum(1 for i, r in enumerate(reranked_results[:10]) if r['title'] in demo_gold_titles and rank_changes[i] > 0)}")
print(f"   ‚Ä¢ Reranking provides more accurate relevance scoring")

print(f"\n Reranking demonstration complete!")

 RERANKING DEMONSTRATION
Question: What nationality was Oliver Reed's character in the film Royal Flash?
Gold documents: ['Royal Flash (film)', 'Otto von Bismarck']

 STAGE 1: Bi-Encoder Retrieval (BGE-large)
----------------------------------------------------------------------

Top-10 results from bi-encoder:
   1. Score: 0.712 | Ivan Dragomiloff 
   2. Score: 0.649 | Oliver Reed 
   3. Score: 0.646 | Harry Flashman 
   4. Score: 0.628 | Royal Flash (film)  GOLD
   5. Score: 0.613 | Royal Flash 
   6. Score: 0.596 | Lion of the Desert 
   7. Score: 0.566 | Funny Bones 
   8. Score: 0.566 | Royal Flash 
   9. Score: 0.559 | Royal Flash (film)  GOLD
   10. Score: 0.552 | Royal Flash (film)  GOLD

 Bi-Encoder Document Recall@10: 0.500

 STAGE 2: Cross-Encoder Reranking
----------------------------------------------------------------------

Top-10 results after reranking:
   1. Rerank: 0.984 | Original: 0.712 | Ivan Dragomiloff  (=)
   2. Rerank: 0.697 | Original: 0.649 | Oliver Reed  (=

## ‚ö° SECTION 8: Hybrid Search - Best of Both Worlds

We've seen two retrieval paradigms:
- **BM25 (Sparse)**: Keyword matching, exact term overlap
- **Dense (Embeddings)**: Semantic similarity, meaning-based

What if we **combine both**? This is hybrid search - leveraging complementary strengths!

### üéØ Why Hybrid Search?

**BM25 Strengths:**
- ‚úÖ Exact keyword matching (great for entities, names, dates)
- ‚úÖ Fast, interpretable
- ‚úÖ Works well on entity-heavy questions
- ‚ùå Misses semantic similarity

**Dense Retrieval Strengths:**
- ‚úÖ Semantic understanding (handles paraphrases, synonyms)
- ‚úÖ Captures meaning beyond exact words
- ‚úÖ Better for conceptual questions
- ‚ùå Can miss exact entity matches

**Hybrid = Combine Both!**
- Use BM25 for entity recall
- Use dense for semantic relevance
- Weighted fusion of scores
- Potentially best of both worlds!

### üìä Hybrid Search Formula

For each passage, compute a combined score:

```
final_score = Œ± √ó normalized_bm25_score + (1-Œ±) √ó normalized_dense_score
```

Where:
- Œ± = weight for BM25 (typically 0.2-0.4)
- (1-Œ±) = weight for dense (typically 0.6-0.8)
- Normalization ensures scores are comparable

### üîß Implementation Strategy

1. **Retrieve with both methods** separately
   - BM25: Get top-100 with BM25 scores
   - Dense: Get top-100 with cosine similarity scores

2. **Normalize scores** to [0, 1] range
   - Min-max normalization
   - Makes scores comparable

3. **Fuse scores** with weighted combination
   - Experiment with different Œ± weights
   - Common: Œ±=0.3 (30% BM25, 70% dense)

4. **Rerank** with cross-encoder (optional but recommended)
   - Apply cross-encoder to top-100 hybrid results
   - Final top-10 selection

### üìä What We'll Measure

We'll compare three approaches on the same questions:
1. **BM25 only**: Pure sparse retrieval
2. **Dense only**: Pure semantic retrieval  
3. **Hybrid**: Weighted combination

Metrics:
- Document Recall@10 for each approach
- Which question types benefit from hybrid?
- Score distribution analysis
- Best weight combination (Œ± tuning)

The experiments will show whether combining methods provides robustness across different question types.

Let's implement and measure!

In [20]:
# Hybrid Search Implementation

def normalize_scores(scores):
    """Normalize scores to [0, 1] range using min-max normalization"""
    if len(scores) == 0:
        return scores

    scores_array = np.array(scores)
    min_score = scores_array.min()
    max_score = scores_array.max()

    if max_score == min_score:
        return np.ones_like(scores_array)

    normalized = (scores_array - min_score) / (max_score - min_score)
    return normalized

def hybrid_search(question, passages, metadata, embedder,
                 bm25_weight=0.3, dense_weight=0.7, top_k=10):
    """
    Hybrid search combining BM25 (sparse) and dense retrieval

    Args:
        question: Query string
        passages: List of passage texts
        metadata: List of metadata dicts for passages
        embedder: UnifiedEmbedder instance
        bm25_weight: Weight for BM25 scores (default: 0.3)
        dense_weight: Weight for dense scores (default: 0.7)
        top_k: Number of results to return

    Returns:
        List of search results with hybrid scores
    """
    # Validate weights
    assert abs(bm25_weight + dense_weight - 1.0) < 1e-6, "Weights must sum to 1.0"

    # 1. BM25 Retrieval
    tokenized_passages = [preprocess_text_for_bm25(p) for p in passages]
    bm25 = BM25Okapi(tokenized_passages)
    query_tokens = preprocess_text_for_bm25(question)
    bm25_scores = bm25.get_scores(query_tokens)

    # 2. Dense Retrieval
    passage_embeddings = embedder.encode_passages(passages, batch_size=16, show_progress=False)
    query_embedding = embedder.encode_queries([question], show_progress=False)

    # Reshape if needed
    if query_embedding.dim() == 1:
        query_embedding = query_embedding.unsqueeze(0)

    dense_results = embedder.search(query_embedding, passage_embeddings,
                                   passages, metadata, top_k=len(passages))

    # Extract dense scores
    dense_scores = np.array([r['score'] for r in dense_results])

    # 3. Normalize both sets of scores
    bm25_scores_norm = normalize_scores(bm25_scores)
    dense_scores_norm = normalize_scores(dense_scores)

    # 4. Create mapping from passage index to dense score
    dense_score_map = {}
    for i, result in enumerate(dense_results):
        idx = result['idx']
        dense_score_map[idx] = dense_scores_norm[i]

    # 5. Combine scores
    hybrid_results = []
    for idx in range(len(passages)):
        bm25_score_norm = bm25_scores_norm[idx]
        dense_score_norm = dense_score_map.get(idx, 0.0)

        # Weighted combination
        hybrid_score = bm25_weight * bm25_score_norm + dense_weight * dense_score_norm

        hybrid_results.append({
            'idx': idx,
            'passage': passages[idx],
            'metadata': metadata[idx],
            'title': metadata[idx]['title'],
            'hybrid_score': float(hybrid_score),
            'bm25_score': float(bm25_scores[idx]),
            'dense_score': float(dense_score_map.get(idx, 0.0)),
            'bm25_score_norm': float(bm25_score_norm),
            'dense_score_norm': float(dense_score_norm)
        })

    # 6. Sort by hybrid score
    hybrid_results_sorted = sorted(hybrid_results, key=lambda x: x['hybrid_score'], reverse=True)

    # 7. Return top-k
    return hybrid_results_sorted[:top_k]

print(" Hybrid search implementation ready!")
print(" Combines BM25 (sparse) + Dense (semantic)")
print(f"   Default weights: BM25={0.3}, Dense={0.7}")
print(f"   Usage: hybrid_search(question, passages, metadata, embedder)")

 Hybrid search implementation ready!
 Combines BM25 (sparse) + Dense (semantic)
   Default weights: BM25=0.3, Dense=0.7
   Usage: hybrid_search(question, passages, metadata, embedder)


In [22]:
# Demonstrate Hybrid Search vs BM25 vs Dense
print(" HYBRID SEARCH COMPARISON")
print("="*70)
print("Comparing: BM25-only vs Dense-only vs Hybrid (BM25 + Dense)")
print()

# Use same demo example
hybrid_demo_example = val_sample[1]  # Different example for variety
hybrid_question = hybrid_demo_example['question']
hybrid_gold_titles = list(set(hybrid_demo_example['supporting_facts']['title']))

print(f"Question: {hybrid_question}")
print(f"Gold documents: {hybrid_gold_titles}")
print()

# Extract passages
# Fix: extract_passages_from_example returns 3 values, unpack accordingly
hybrid_passages, hybrid_metadata, supporting_passages = extract_passages_from_example(hybrid_demo_example)

# Use BGE embedder
bge_embedder = embedders['BGE-large (2023)']

# Method 1: BM25 Only
print(" METHOD 1: BM25 Only (Sparse)")
print("-"*70)

tokenized = [preprocess_text_for_bm25(p) for p in hybrid_passages]
bm25_model = BM25Okapi(tokenized)
query_tokens = preprocess_text_for_bm25(hybrid_question)
bm25_only_scores = bm25_model.get_scores(query_tokens)
bm25_top_indices = np.argsort(bm25_only_scores)[::-1][:10]

bm25_only_titles = []
for i, idx in enumerate(bm25_top_indices):
    title = hybrid_metadata[idx]['title']
    is_gold = "" if title in hybrid_gold_titles else ""
    print(f"   {i+1}. Score: {bm25_only_scores[idx]:.3f} | {title} {is_gold}")
    if title not in bm25_only_titles:
        bm25_only_titles.append(title)

bm25_only_recall = len(set(bm25_only_titles).intersection(hybrid_gold_titles)) / len(hybrid_gold_titles)
print(f"\n BM25-only Doc Recall@10: {bm25_only_recall:.3f}")

# Method 2: Dense Only
print(f"\n METHOD 2: Dense Only (BGE Embeddings)")
print("-"*70)

passage_embs = bge_embedder.encode_passages(hybrid_passages, batch_size=16, show_progress=False)
query_emb = bge_embedder.encode_queries([hybrid_question], show_progress=False)

if query_emb.dim() == 1:
    query_emb = query_emb.unsqueeze(0)

dense_only_results = bge_embedder.search(query_emb, passage_embs,
                                         hybrid_passages, hybrid_metadata, top_k=10)

dense_only_titles = []
for i, result in enumerate(dense_only_results):
    title = result['title']
    is_gold = "" if title in hybrid_gold_titles else ""
    print(f"   {i+1}. Score: {result['score']:.3f} | {title} {is_gold}")
    if title not in dense_only_titles:
        dense_only_titles.append(title)

dense_only_recall = len(set(dense_only_titles).intersection(hybrid_gold_titles)) / len(hybrid_gold_titles)
print(f"\n Dense-only Doc Recall@10: {dense_only_recall:.3f}")

# Method 3: Hybrid
print(f"\n METHOD 3: Hybrid (30% BM25 + 70% Dense)")
print("-"*70)

hybrid_only_results = hybrid_search(hybrid_question, hybrid_passages, hybrid_metadata,
                                   bge_embedder, bm25_weight=0.3, dense_weight=0.7, top_k=10)

hybrid_only_titles = []
for i, result in enumerate(hybrid_only_results):
    title = result['title']
    is_gold = "" if title in hybrid_gold_titles else ""
    print(f"   {i+1}. Hybrid: {result['hybrid_score']:.3f} | BM25: {result['bm25_score_norm']:.3f} | Dense: {result['dense_score_norm']:.3f} | {title} {is_gold}")
    if title not in hybrid_only_titles:
        hybrid_only_titles.append(title)

hybrid_only_recall = len(set(hybrid_only_titles).intersection(hybrid_gold_titles)) / len(hybrid_gold_titles)
print(f"\n Hybrid Doc Recall@10: {hybrid_only_recall:.3f}")

# Summary Comparison
print(f"\n{'='*70}")
print(f" COMPARISON SUMMARY")
print(f"{'='*70}")
print(f"\n   BM25-only:   {bm25_only_recall:.3f}")
print(f"   Dense-only:  {dense_only_recall:.3f}")
print(f"   Hybrid:      {hybrid_only_recall:.3f}")

best_method = max([
    ("BM25-only", bm25_only_recall),
    ("Dense-only", dense_only_recall),
    ("Hybrid", hybrid_only_recall)
], key=lambda x: x[1])

print(f"\n Best performer: {best_method[0]} ({best_method[1]:.3f})")

print(f"\n Key Insights:")
if hybrid_only_recall >= max(bm25_only_recall, dense_only_recall):
    print(f"   ‚Ä¢ Hybrid search leverages strengths of both methods")
    print(f"   ‚Ä¢ BM25 helps with entity/keyword matching")
    print(f"   ‚Ä¢ Dense helps with semantic understanding")
print(f"   ‚Ä¢ The best approach often depends on question type")
print(f"   ‚Ä¢ Hybrid provides robustness across different question types")

print(f"\n Hybrid search demonstration complete!")

 HYBRID SEARCH COMPARISON
Comparing: BM25-only vs Dense-only vs Hybrid (BM25 + Dense)

Question: Pacific Mozart Ensemble performed which German composer's Der Lindberghflug in 2002?
Gold documents: ['Kurt Weill', 'Pacific Mozart Ensemble']

 METHOD 1: BM25 Only (Sparse)
----------------------------------------------------------------------
   1. Score: 11.070 | Pacific Mozart Ensemble 
   2. Score: 10.288 | Pacific Mozart Ensemble 
   3. Score: 6.957 | The Tutor (Brecht) 
   4. Score: 6.179 | Conservatory String Quartet 
   5. Score: 5.327 | The Flight Across the Ocean 
   6. Score: 4.974 | Der Widersp√§nstigen Z√§hmung 
   7. Score: 4.377 | The Tutor (Brecht) 
   8. Score: 4.069 | Zeitoper 
   9. Score: 3.841 | Martin Boykan 
   10. Score: 3.795 | The Flight Across the Ocean 

 BM25-only Doc Recall@10: 0.500

 METHOD 2: Dense Only (BGE Embeddings)
----------------------------------------------------------------------
   1. Score: 0.636 | Pacific Mozart Ensemble 
   2. Score: 0.569 | K

##  LlamaIndex Comparison: Framework vs Custom Implementation

Let's compare our custom vector search implementation with LlamaIndex to show how modern frameworks abstract these concepts.

# LlamaIndex Implementation Example (for comparison)

**Our Custom Implementation (learning-focused):**
1. Manual passage extraction from HotpotQA
2. Explicit DPR model loading and tokenization
3. Custom vector encoding functions
4. Manual cosine similarity computation
5. Custom retrieval and ranking logic
6. Full control over similarity metrics

**How LlamaIndex Would Abstract This:**
1. SimpleDirectoryReader() - automatic document loading
2. VectorStoreIndex.from_documents() - auto-embedding
3. Built-in similarity search
4. Automatic retrieval and generation
5. Multiple vector database backends
6. Pre-configured pipelines

**LlamaIndex Equivalent Code (pseudo-code):**
```python
# LlamaIndex - High-level abstraction
from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index.embeddings import HuggingFaceEmbedding

# Load documents (abstracts our manual extraction)
documents = SimpleDirectoryReader('data/').load_data()

# Create vector index (abstracts our manual DPR encoding)
embed_model = HuggingFaceEmbedding(model_name='facebook/dpr-ctx_encoder-single-nq-base')
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

# Query (abstracts our manual vector search)
query_engine = index.as_query_engine()
response = query_engine.query('What is the question?')
```

**Key Learning Points:**
- Our implementation: Understand the underlying math and operations
- LlamaIndex: Production-ready with optimizations and abstractions
- Both use the SAME vector search concepts:
  - Document ‚Üí Vector encoding
  - Query ‚Üí Vector encoding
  - Similarity search (cosine/inner product)
  - Top-k retrieval

**Why Learn Custom Implementation First:**
1. Understand vector search mathematics
2. Debug and optimize retrieval performance
3. Implement custom similarity metrics
4. Adapt to specific domain requirements
5. Build domain-specific evaluation metrics

**When to Use LlamaIndex:**
1. Rapid prototyping and production deployment
2. Standard RAG pipelines without customization
3. Multiple vector database backend support
4. Built-in optimization and caching
5. Integration with LLM frameworks

**Best Practice:** Learn fundamentals first, then use frameworks!  
Our approach: Custom ‚Üí Framework comparison ‚Üí Production choice



##  Evaluation Framework

Let's implement our evaluation framework using the 6 chosen metrics for HotpotQA multihop reasoning.

In [25]:
# Import evaluation framework from @03_Evaluation_Code.ipynb
class HotpotQAEvaluator:
    """Comprehensive evaluator for HotpotQA multihop reasoning"""

    def __init__(self):
        pass

    def normalize_answer(self, text):
        """Normalize answer text for comparison"""
        import re
        import string

        # Convert to lowercase
        text = text.lower()

        # Remove articles
        text = re.sub(r'\b(a|an|the)\b', ' ', text)

        # Remove punctuation
        text = text.translate(str.maketrans('', '', string.punctuation))

        # Remove extra whitespace
        text = ' '.join(text.split())

        return text

    def answer_f1_score(self, prediction, ground_truth):
        """Calculate F1 score between prediction and ground truth"""
        from collections import Counter

        pred_tokens = self.normalize_answer(prediction).split()
        gold_tokens = self.normalize_answer(ground_truth).split()

        if len(pred_tokens) == 0 and len(gold_tokens) == 0:
            return 1.0
        if len(pred_tokens) == 0 or len(gold_tokens) == 0:
            return 0.0

        common_tokens = Counter(pred_tokens) & Counter(gold_tokens)
        num_same = sum(common_tokens.values())

        if num_same == 0:
            return 0.0

        precision = num_same / len(pred_tokens)
        recall = num_same / len(gold_tokens)

        return 2 * precision * recall / (precision + recall)

    def answer_exact_match(self, prediction, ground_truth):
        """Calculate exact match score"""
        return float(self.normalize_answer(prediction) == self.normalize_answer(ground_truth))

    def document_recall_at_k(self, retrieved_titles, gold_titles, k=10):
        """Calculate document recall@k"""
        if len(gold_titles) == 0:
            return 1.0

        retrieved_k = set(retrieved_titles[:k])
        gold_set = set(gold_titles)

        return len(retrieved_k.intersection(gold_set)) / len(gold_set)

    def supporting_fact_f1(self, predicted_facts, gold_facts):
        """Calculate supporting facts F1 score"""
        if len(gold_facts) == 0:
            return 1.0 if len(predicted_facts) == 0 else 0.0

        pred_set = set(predicted_facts)
        gold_set = set(gold_facts)

        if len(pred_set) == 0:
            return 0.0

        intersection = pred_set.intersection(gold_set)
        precision = len(intersection) / len(pred_set)
        recall = len(intersection) / len(gold_set)

        if precision + recall == 0:
            return 0.0

        return 2 * precision * recall / (precision + recall)

    def joint_exact_match(self, pred_answer, gold_answer, pred_facts, gold_facts):
        """Calculate joint exact match (answer + supporting facts)"""
        answer_em = self.answer_exact_match(pred_answer, gold_answer)
        facts_em = 1.0 if set(pred_facts) == set(gold_facts) else 0.0

        return float(answer_em == 1.0 and facts_em == 1.0)

    def evaluate_single(self, prediction_dict, gold_data, k=10, processing_time=None):
        """Evaluate a single prediction against gold data"""

        # Extract predictions
        pred_answer = prediction_dict.get('answer', '')
        pred_titles = prediction_dict.get('retrieved_titles', [])
        pred_facts = prediction_dict.get('supporting_facts', [])

        # Extract gold data
        gold_answer = gold_data.get('answer', '')
        gold_facts = []
        gold_titles = []

        # Extract gold supporting facts and titles
        if 'supporting_facts' in gold_data and gold_data['supporting_facts']: # Added check for empty supporting_facts
            # Ensure supporting_facts is a dictionary with 'title' and 'sent_id' keys
            if isinstance(gold_data['supporting_facts'], dict) and 'title' in gold_data['supporting_facts'] and 'sent_id' in gold_data['supporting_facts']:
                 for title, sent_id in zip(gold_data['supporting_facts']['title'], gold_data['supporting_facts']['sent_id']):
                    gold_facts.append((title, sent_id))
                    if title not in gold_titles:
                        gold_titles.append(title)
            # Handle case where supporting_facts might be a list of tuples (like in the test sample)
            elif isinstance(gold_data['supporting_facts'], list):
                 for title, sent_id in gold_data['supporting_facts']:
                    gold_facts.append((title, sent_id))
                    if title not in gold_titles:
                        gold_titles.append(title)


        # Calculate metrics
        metrics = {
            'answer_f1': self.answer_f1_score(pred_answer, gold_answer),
            'answer_em': self.answer_exact_match(pred_answer, gold_answer),
            'document_recall@k': self.document_recall_at_k(pred_titles, gold_titles, k),
            'supporting_fact_f1': self.supporting_fact_f1(pred_facts, gold_facts),
            'joint_em': self.joint_exact_match(pred_answer, gold_answer, pred_facts, gold_facts),
            'latency': processing_time if processing_time is not None else 0.0
        }

        return metrics

# Initialize evaluator
evaluator = HotpotQAEvaluator()
print(" HotpotQA Evaluation Framework Ready!")
print(" Available metrics:")
print("   1. Answer F1 Score")
print("   2. Answer Exact Match")
print("   3. Document Recall@k")
print("   4. Supporting-Fact F1")
print("   5. Joint Exact Match")
print("   6. Latency (processing time)")

# Test with sample data - Correcting sample_gold structure
sample_prediction = {
    'answer': 'test answer',
    'retrieved_titles': ['Title 1', 'Title 2', 'Title 3'],
    'supporting_facts': [('Title 1', 0), ('Title 2', 1)]
}

sample_gold = {
    'answer': 'test answer',
    # Corrected structure to match expected input for evaluate_single
    'supporting_facts': {'title': ['Title 1', 'Title 2'], 'sent_id': [0, 1]}
}

test_metrics = evaluator.evaluate_single(sample_prediction, sample_gold, k=10, processing_time=0.5)
print(f"\n Test evaluation result: {test_metrics}")

 HotpotQA Evaluation Framework Ready!
 Available metrics:
   1. Answer F1 Score
   2. Answer Exact Match
   3. Document Recall@k
   4. Supporting-Fact F1
   5. Joint Exact Match
   6. Latency (processing time)

 Test evaluation result: {'answer_f1': 1.0, 'answer_em': 1.0, 'document_recall@k': 1.0, 'supporting_fact_f1': 1.0, 'joint_em': 1.0, 'latency': 0.5}


## üèÉ‚Äç‚ôÇÔ∏è Run Complete Pipeline and Compare Methods

Now let's run both BM25 (sparse) and DPR (vector search) on a subset of validation data and compare their performance.

In [31]:
# Comprehensive evaluation on validation set with PER-QUESTION retrieval
print(" COMPREHENSIVE EVALUATION: BM25 vs DPR Vector Search")
print("="* 60)
print(" Using CORRECT per-question retrieval approach for HotpotQA!")
print()

# IMPORTANT: Ensure cells defining DPR functions (3cpew9hag5o) and UnifiedEmbedder (cwb3szhol7w) have been run!

# Import tqdm for progress bar
from tqdm import tqdm

# Use a subset of validation data for evaluation
eval_size = 200  # Adjust based on compute resources
val_subset = val_sample.select(range(min(eval_size, len(val_sample))))

print(f" Evaluating on {len(val_subset)} validation examples")

# Initialize results storage
results = {
    'BM25': [],
    'DPR_Vector_Search': []
}

# Ensure evaluator and embedders are available globally
# global evaluator, embedders, encode_passages_dpr, encode_questions_dpr, vector_search_dpr, preprocess_text_for_bm25, simple_answer_extraction

# Get the DPR embedder instance from the globally available 'embedders' dictionary
# This dictionary is populated by running cell cwb3szhol7w
dpr_embedder = embedders['DPR-NQ (2020)']

# Evaluate each example
for i, example in enumerate(tqdm(val_subset, desc="Evaluating examples")):
    question = example['question']
    gold_answer = example['answer']

    # Extract gold supporting docs and facts
    gold_supporting_facts = example['supporting_facts']  # Keep as dict
    gold_titles = list(set(gold_supporting_facts['title']))

    gold_data = {
        'answer': gold_answer,
        'supporting_facts': gold_supporting_facts # This is now correctly structured for the evaluator
    }

    # print(f"\n Example {i+1}/{len(val_subset)}") # Commented out verbose printing per example
    # print(f"   Question: {question[:80]}...")
    # print(f"   Gold answer: {gold_answer}")
    # print(f"   Gold documents: {gold_titles}")

    # Extract passages from THIS example's 10 context paragraphs
    # Note: extract_passages_from_example returns 3 values, ensure it's handled
    example_passages, example_metadata, supporting_passages_text = extract_passages_from_example(example)


    # ========== BM25 Evaluation ==========
    # print(f"\n    BM25 Sparse Retrieval:") # Commented out verbose printing
    start_time = time.time()

    # Build BM25 index for this example
    tokenized_passages = [preprocess_text_for_bm25(p) for p in example_passages]
    bm25 = BM25Okapi(tokenized_passages)

    # BM25 search
    query_tokens = preprocess_text_for_bm25(question)
    bm25_scores = bm25.get_scores(query_tokens)
    top_k_indices = np.argsort(bm25_scores)[::-1][:10]

    # Extract retrieved titles and facts for evaluation
    retrieved_titles = []
    retrieved_facts = [] # To store (title, sent_id) tuples
    retrieved_passages_text = []

    for idx in top_k_indices:
        title = example_metadata[idx]['title']
        sent_idx = example_metadata[idx]['sentence_idx']
        retrieved_facts.append((title, sent_idx)) # Collect the (title, sent_id) tuple
        if title not in retrieved_titles:
            retrieved_titles.append(title)
        retrieved_passages_text.append(example_passages[idx])


    # Generate answer using simple extraction (uses retrieved_passages_text)
    bm25_answer = simple_answer_extraction(question, retrieved_passages_text, top_k=3)

    bm25_time = time.time() - start_time

    bm25_prediction = {
        'answer': bm25_answer,
        'retrieved_titles': retrieved_titles, # Use unique titles for Document Recall
        'supporting_facts': retrieved_facts   # Use (title, sent_id) for Supporting Fact F1
    }

    bm25_metrics = evaluator.evaluate_single(bm25_prediction, gold_data, k=10, processing_time=bm25_time)
    results['BM25'].append(bm25_metrics)

    # print(f"      Answer: {bm25_answer}") # Commented out verbose printing
    # print(f"      Retrieved docs (titles): {retrieved_titles[:3]}...")
    # print(f"      Retrieved facts ((title, sent_id)): {retrieved_facts[:3]}...")
    # print(f"      Document Recall@10: {bm25_metrics['document_recall@k']:.3f}")
    # print(f"      Supporting-Fact F1: {bm25_metrics['supporting_fact_f1']:.3f}")
    # print(f"      Latency: {bm25_time:.3f}s")

    # ========== DPR Vector Search Evaluation ==========
    # print(f"\n    DPR Vector Search:") # Commented out verbose printing
    start_time = time.time()

    # Use the UnifiedEmbedder encode_passages method
    # The UnifiedEmbedder handles whether to pass show_progress based on model type
    passage_embeddings = dpr_embedder.encode_passages(example_passages, batch_size=16, show_progress=False) # show_progress=False is handled by UnifiedEmbedder now

    # Use the UnifiedEmbedder encode_queries method
    query_embedding = dpr_embedder.encode_queries([question], show_progress=False)

    # Vector search using the UnifiedEmbedder search method (which uses vector_search_dpr)
    # Ensure query_embedding is 2D for search method
    if query_embedding.dim() == 1:
        query_embedding_2d = query_embedding.unsqueeze(0)
    else:
        query_embedding_2d = query_embedding

    vector_results = dpr_embedder.search(query_embedding_2d, passage_embeddings, example_passages, example_metadata, top_k=10)


    # Extract retrieved titles and facts for evaluation
    dpr_titles = []
    dpr_facts = [] # To store (title, sent_id) tuples
    dpr_passages_text = []

    for result in vector_results:
        title = result['title']
        sent_idx = result['metadata']['sentence_idx']
        dpr_facts.append((title, sent_idx)) # Collect the (title, sent_id) tuple
        if title not in dpr_titles:
            dpr_titles.append(title)
        dpr_passages_text.append(result['passage'])

    # Generate answer using simple extraction (uses retrieved_passages_text)
    dpr_answer = simple_answer_extraction(question, dpr_passages_text, top_k=3)


    dpr_time = time.time() - start_time

    dpr_prediction = {
        'answer': dpr_answer,
        'retrieved_titles': dpr_titles, # Use unique titles for Document Recall
        'supporting_facts': dpr_facts   # Use (title, sent_id) for Supporting Fact F1
    }

    dpr_metrics = evaluator.evaluate_single(dpr_prediction, gold_data, k=10, processing_time=dpr_time)
    results['DPR_Vector_Search'].append(dpr_metrics)

    # print(f"      Answer: {dpr_answer}") # Commented out verbose printing
    # print(f"      Retrieved docs (titles): {dpr_titles[:3]}...")
    # print(f"      Retrieved facts ((title, sent_id)): {dpr_facts[:3]}...")
    # print(f"      Document Recall@10: {dpr_metrics['document_recall@k']:.3f}")
    # print(f"      Supporting-Fact F1: {dpr_metrics['supporting_fact_f1']:.3f}")
    # print(f"      Latency: {dpr_time:.3f}s")


print(f"\n FINAL RESULTS COMPARISON")
print("="*40)

# Calculate average metrics
for method, method_results in results.items():
    print(f"\n {method}:")

    if method_results:
        avg_metrics = {}
        for metric in method_results[0].keys():
            avg_metrics[metric] = np.mean([r[metric] for r in method_results])

        print(f"   Document Recall@10: {avg_metrics['document_recall@k']:.3f}")
        print(f"   Supporting-Fact F1: {avg_metrics['supporting_fact_f1']:.3f}")
        print(f"   Answer F1: {avg_metrics['answer_f1']:.3f}")
        print(f"   Answer EM: {avg_metrics['answer_em']:.3f}")
        print(f"   Joint EM: {avg_metrics['joint_em']:.3f}")
        print(f"   Avg Latency: {avg_metrics['latency']:.3f}s")

print(f"\n KEY INSIGHTS:")
print("="*30)
print(" BM25 (Sparse Retrieval):")
print("    Fast and lightweight")
print("    Exact keyword matching")
print("    Works well with entity-heavy questions")
print("    Limited semantic understanding")

print("\n DPR (Vector Search):")
print("    Semantic similarity")
print("    Handles paraphrases")
print("    Dense representations capture meaning")
print("    Slower encoding time")
print("    Less interpretable")

print(f"\n Per-Question Retrieval Benefits:")
print("   ‚Ä¢ Realistic evaluation matching HotpotQA setup")
print("   ‚Ä¢ Tests ability to filter signal from noise (distractors)")
print("   ‚Ä¢ Meaningful Document Recall@10 scores")
print("   ‚Ä¢ Demonstrates multihop reasoning challenges")

print(f"\n Evaluation complete on {len(val_subset)} examples!")

 COMPREHENSIVE EVALUATION: BM25 vs DPR Vector Search
 Using CORRECT per-question retrieval approach for HotpotQA!

 Evaluating on 100 validation examples


Evaluating examples:   0%|          | 0/100 [00:00<?, ?it/s]
Encoding passages:   0%|          | 0/3 [00:00<?, ?it/s][A
Encoding passages:  33%|‚ñà‚ñà‚ñà‚ñé      | 1/3 [00:00<00:00,  8.09it/s][A
Encoding passages: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:00<00:00, 11.30it/s]
Evaluating examples:   1%|          | 1/100 [00:00<00:28,  3.53it/s]
Encoding passages:   0%|          | 0/5 [00:00<?, ?it/s][A
Encoding passages:  40%|‚ñà‚ñà‚ñà‚ñà      | 2/5 [00:00<00:00, 14.62it/s][A
Encoding passages: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [00:00<00:00, 19.30it/s]
Evaluating examples:   2%|‚ñè         | 2/100 [00:00<00:27,  3.56it/s]
Encoding passages:   0%|          | 0/3 [00:00<?, ?it/s][A
Encoding passages: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:00<00:00, 20.92it/s]
Evaluating examples:   3%|‚ñé         | 3/100 [00:00<00:23,  4.21it/s]
Encoding passages:   0%|          | 0/4 [00:00<?, ?it/s][A
Encoding passages: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 19.89it


 FINAL RESULTS COMPARISON

 BM25:
   Document Recall@10: 0.885
   Supporting-Fact F1: 0.291
   Answer F1: 0.110
   Answer EM: 0.070
   Joint EM: 0.000
   Avg Latency: 0.002s

 DPR_Vector_Search:
   Document Recall@10: 0.780
   Supporting-Fact F1: 0.233
   Answer F1: 0.036
   Answer EM: 0.020
   Joint EM: 0.000
   Avg Latency: 0.179s

 KEY INSIGHTS:
 BM25 (Sparse Retrieval):
    Fast and lightweight
    Exact keyword matching
    Works well with entity-heavy questions
    Limited semantic understanding

 DPR (Vector Search):
    Semantic similarity
    Handles paraphrases
    Dense representations capture meaning
    Slower encoding time
    Less interpretable

 Per-Question Retrieval Benefits:
   ‚Ä¢ Realistic evaluation matching HotpotQA setup
   ‚Ä¢ Tests ability to filter signal from noise (distractors)
   ‚Ä¢ Meaningful Document Recall@10 scores
   ‚Ä¢ Demonstrates multihop reasoning challenges

 Evaluation complete on 100 examples!





## üìù Summary and Key Takeaways

This notebook demonstrated how traditional methods implement vector search concepts AND the critical importance of matching your retrieval setup to the evaluation benchmark.

### üéØ Critical Learning: Per-Question vs Global Retrieval

**The Bug We Fixed:**
```python
# ‚ùå WRONG: Build index from training, search from validation
train_passages, train_metadata = extract_passages_from_hotpotqa(train_sample)
bm25_global = BM25Okapi(train_passages)
# This gives 0% Document Recall because gold docs don't exist!

# ‚úÖ CORRECT: Per-question retrieval from each example's contexts
for example in validation:
    example_passages, metadata = extract_passages_from_example(example)
    bm25_local = BM25Okapi(example_passages)
    results = bm25_local.search(example['question'])
    # Now Document Recall is meaningful!
```

**Why This Matters:**
- HotpotQA provides 10 context paragraphs per question (2 gold + 8 distractors)
- The challenge is filtering correct documents from this candidate set
- Global retrieval would test "can you find documents in a corpus?" (wrong task)
- Per-question retrieval tests "can you identify relevant docs from noisy candidates?" (correct task)