# Traditional Methods Implementation: Vector Search in Action

This notebook implements traditional information retrieval methods for HotpotQA multihop reasoning, with special emphasis on **vector search concepts** and comparison with modern frameworks like LlamaIndex.

## 🔍 Vector Search in Traditional Methods

:::{admonition} Key Insight: Traditional Methods Use Vector Search!
:class: important

**Dense Passage Retrieval (DPR) IS Vector Search!**
- **Query Encoding**: Transform questions into dense vector embeddings
- **Document Encoding**: Transform passages into dense vector embeddings  
- **Similarity Search**: Use Maximum Inner Product Search (MIPS) or cosine similarity. Note they are not the same. 
- **Top-k Retrieval**: Return most similar document vectors to query vector

This is the same underlying technology as modern vector databases (Pinecone, Chroma, Weaviate).
:::

## 📋 Implementation Overview

We'll implement:
1. **BM25 Sparse Retrieval**: Statistical term weighting (traditional baseline)
2. **DPR Vector Search**: Dense embeddings with similarity search (same as modern vector DBs)
3. **BGE Cross-Encoder Reranking**: Advanced relevance scoring
4. **Mistral-7B Answer Generation**: Final reasoning and synthesis
5. **LlamaIndex Comparison**: Show how modern frameworks abstract the same concepts

## 🔧 Learning Focus

While frameworks like **LlamaIndex** provide excellent abstractions for vector search and RAG, we implement from scratch to understand:
- How vector embeddings work under the hood
- Different similarity metrics (cosine vs inner product)
- Trade-offs between sparse and dense representations
- Custom evaluation metrics for multihop reasoning

In [None]:
# Install required packages
!pip install transformers datasets torch sentence-transformers rank-bm25 numpy scikit-learn matplotlib seaborn
!pip install llama-index  # For comparison later

import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datasets import load_dataset
from transformers import (
    DPRQuestionEncoder, DPRContextEncoder,
    DPRQuestionEncoderTokenizer, DPRContextEncoderTokenizer,
    AutoTokenizer, AutoModelForCausalLM
)
from sentence_transformers import CrossEncoder
from rank_bm25 import BM25Okapi
from sklearn.metrics.pairwise import cosine_similarity
import string
import re
from typing import List, Dict, Tuple, Optional
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

print("✅ All packages installed and imported successfully!")
print(f"🔥 PyTorch version: {torch.__version__}")
print(f"🎯 CUDA available: {torch.cuda.is_available()}")

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"🚀 Using device: {device}")

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

## 📊 Data Loading and Preprocessing

Let's load the HotpotQA dataset and prepare it for our vector search implementation.

In [None]:
# Load HotpotQA dataset
print("🔄 Loading HotpotQA dataset...")
dataset = load_dataset('hotpotqa/hotpot_qa', 'distractor')
train_data = dataset['train']
validation_data = dataset['validation']

print(f"📊 Dataset loaded successfully!")
print(f"   Training examples: {len(train_data):,}")
print(f"   Validation examples: {len(validation_data):,}")

# Take a smaller subset for faster processing during development
SAMPLE_SIZE = 100  # Increase this for full evaluation
train_sample = train_data.shuffle(seed=42).select(range(min(SAMPLE_SIZE, len(train_data))))
val_sample = validation_data.shuffle(seed=42).select(range(min(SAMPLE_SIZE, len(validation_data))))

print(f"🎯 Working with sample: {len(train_sample)} train, {len(val_sample)} validation")

# Inspect a sample to understand the structure
sample_example = train_sample[0]
print("\n📋 Sample HotpotQA Example Structure:")
print(f"   Question: {sample_example['question']}")
print(f"   Answer: {sample_example['answer']}")
print(f"   Type: {sample_example['type']}")
print(f"   Level: {sample_example['level']}")
print(f"   Number of context paragraphs: {len(list(sample_example['context']))}")
print(f"   Supporting facts: {len(list(sample_example['supporting_facts']))}")

print("\n🔍 First few context titles:")
context_list = list(sample_example['context'])
for i, (title, sentences) in enumerate(context_list[:3]):
    print(f"   {i+1}. {title} ({len(sentences)} sentences)")

print("\n📍 Supporting facts:")
facts_list = list(sample_example['supporting_facts'])
for title, sent_idx in facts_list:
    print(f"   - {title}, sentence {sent_idx}")

In [None]:
# Preprocessing functions for BM25 and DPR
def preprocess_text_for_bm25(text):
    """Preprocess text for BM25 sparse retrieval"""
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    return text.split()

def extract_passages_from_hotpotqa(examples):
    """Extract individual passages from HotpotQA context structure"""
    passages = []
    passage_metadata = []
    
    for example_idx, example in enumerate(examples):
        context_list = list(example['context'])
        
        for context_idx, (title, sentences) in enumerate(context_list):
            # Each sentence becomes a separate passage
            for sent_idx, sentence in enumerate(sentences):
                passage_text = sentence.strip()
                if passage_text:  # Only add non-empty passages
                    passages.append(passage_text)
                    passage_metadata.append({
                        'example_idx': example_idx,
                        'title': title,
                        'context_idx': context_idx,
                        'sentence_idx': sent_idx,
                        'full_passage': ' '.join(sentences)  # Full paragraph for context
                    })
    
    return passages, passage_metadata

print("✅ Preprocessing functions ready!")
print("📝 Functions available:")
print("   - preprocess_text_for_bm25(): Clean text for sparse retrieval")
print("   - extract_passages_from_hotpotqa(): Convert HotpotQA format to passages")

## 🔍 BM25 Sparse Retrieval Implementation

BM25 is a probabilistic ranking function that scores documents based on query term frequency and document length. It's a traditional sparse method that forms the foundation of modern search engines.

In [None]:
# Build BM25 index from HotpotQA passages
print("🔄 Building BM25 corpus from HotpotQA training data...")

# Extract passages from training data
train_passages, train_metadata = extract_passages_from_hotpotqa(train_sample)
print(f"📊 Extracted {len(train_passages):,} passages from {len(train_sample)} training examples")

# Show sample passages
print("\n📝 Sample passages:")
for i in range(min(3, len(train_passages))):
    print(f"   {i+1}. {train_passages[i][:100]}...")
    print(f"      Title: {train_metadata[i]['title']}")

# Preprocess passages for BM25
print("\n🔄 Preprocessing passages for BM25...")
tokenized_passages = [preprocess_text_for_bm25(passage) for passage in train_passages]

# Build BM25 index
print("🏗️ Building BM25 index...")
bm25 = BM25Okapi(tokenized_passages)
print(f"✅ BM25 index built with {len(tokenized_passages):,} documents")

# Test BM25 with a sample query
test_question = train_sample[0]['question']
print(f"\n🎯 Testing BM25 with sample question:")
print(f"   Question: {test_question}")

# Tokenize query and search
test_query_tokens = preprocess_text_for_bm25(test_question)
print(f"   Query tokens: {test_query_tokens[:10]}...")

# Get top-k BM25 scores
k = 10
bm25_scores = bm25.get_scores(test_query_tokens)
top_k_indices = np.argsort(bm25_scores)[::-1][:k]

print(f"\n📊 Top-{k} BM25 results:")
for i, idx in enumerate(top_k_indices):
    score = bm25_scores[idx]
    passage = train_passages[idx][:100] + "..."
    title = train_metadata[idx]['title']
    print(f"   {i+1}. Score: {score:.3f} | Title: {title}")
    print(f"      {passage}")

print("\n✅ BM25 retrieval system ready!")

## 🧠 DPR Vector Search Implementation

Dense Passage Retrieval (DPR) represents the **core of modern vector search**. It transforms questions and passages into dense embeddings and finds similar passages using vector similarity metrics.

In [None]:
# Load pre-trained DPR models (this is vector search!)
print("🔄 Loading pre-trained DPR models for vector search...")

# Question encoder (queries → vectors)
q_encoder = DPRQuestionEncoder.from_pretrained('facebook/dpr-question_encoder-single-nq-base')
q_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained('facebook/dpr-question_encoder-single-nq-base')

# Context encoder (passages → vectors) 
c_encoder = DPRContextEncoder.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')
c_tokenizer = DPRContextEncoderTokenizer.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')

# Move to device
q_encoder.to(device)
c_encoder.to(device)

print(f"✅ DPR models loaded on {device}")
print("🔍 Vector Search Components Ready:")
print(f"   - Question Encoder: Transforms questions → 768-dim vectors")
print(f"   - Context Encoder: Transforms passages → 768-dim vectors")
print(f"   - Similarity: Cosine similarity / Inner product search")

def encode_questions_dpr(questions, batch_size=32):
    """Encode questions into dense vectors (Vector Search Step 1)"""
    q_encoder.eval()
    all_embeddings = []
    
    with torch.no_grad():
        for i in range(0, len(questions), batch_size):
            batch_questions = questions[i:i+batch_size]
            
            # Tokenize questions
            encoded = q_tokenizer(
                batch_questions,
                padding=True,
                truncation=True,
                max_length=512,
                return_tensors='pt'
            )
            
            # Move to device
            input_ids = encoded['input_ids'].to(device)
            attention_mask = encoded['attention_mask'].to(device)
            
            # Encode to vectors
            embeddings = q_encoder(input_ids=input_ids, attention_mask=attention_mask)
            all_embeddings.append(embeddings.pooler_output.cpu())
    
    return torch.cat(all_embeddings, dim=0)

def encode_passages_dpr(passages, batch_size=32):
    """Encode passages into dense vectors (Vector Search Step 2)"""
    c_encoder.eval()
    all_embeddings = []
    
    with torch.no_grad():
        for i in tqdm(range(0, len(passages), batch_size), desc="Encoding passages"):
            batch_passages = passages[i:i+batch_size]
            
            # Tokenize passages
            encoded = c_tokenizer(
                batch_passages,
                padding=True,
                truncation=True,
                max_length=512,
                return_tensors='pt'
            )
            
            # Move to device
            input_ids = encoded['input_ids'].to(device)
            attention_mask = encoded['attention_mask'].to(device)
            
            # Encode to vectors
            embeddings = c_encoder(input_ids=input_ids, attention_mask=attention_mask)
            all_embeddings.append(embeddings.pooler_output.cpu())
    
    return torch.cat(all_embeddings, dim=0)

def vector_search_dpr(query_embedding, passage_embeddings, passages, metadata, top_k=10):
    """Perform vector search using cosine similarity (Vector Search Step 3)"""
    # Normalize embeddings for cosine similarity
    query_embedding = F.normalize(query_embedding, p=2, dim=1)
    passage_embeddings = F.normalize(passage_embeddings, p=2, dim=1)
    
    # Compute similarity matrix (this IS vector search!)
    similarities = torch.mm(query_embedding, passage_embeddings.transpose(0, 1))
    similarities = similarities.squeeze(0)  # Remove batch dimension
    
    # Get top-k most similar vectors
    top_k_scores, top_k_indices = torch.topk(similarities, min(top_k, len(similarities)))
    
    results = []
    for score, idx in zip(top_k_scores, top_k_indices):
        results.append({
            'idx': int(idx),
            'score': float(score),
            'passage': passages[int(idx)],
            'title': metadata[int(idx)]['title'],
            'metadata': metadata[int(idx)]
        })
    
    return results

print("\n✅ DPR Vector Search functions ready!")
print("🎯 Key insight: DPR = Vector Database functionality!")
print("   - encode_questions_dpr(): Query → Vector")
print("   - encode_passages_dpr(): Documents → Vectors") 
print("   - vector_search_dpr(): Similarity search (MIPS/Cosine)")

In [None]:
# Build DPR vector index (same as building a vector database!)
print("🔄 Building DPR vector embeddings for passage corpus...")
print("⚡ This is exactly what vector databases like Pinecone/Chroma do!")

# Encode all passages into vectors (this takes a few minutes)
passage_embeddings = encode_passages_dpr(train_passages, batch_size=16)

print(f"✅ Vector index built!")
print(f"📊 Vector Statistics:")
print(f"   - Number of vectors: {passage_embeddings.shape[0]:,}")
print(f"   - Vector dimension: {passage_embeddings.shape[1]}")
print(f"   - Total parameters: {passage_embeddings.shape[0] * passage_embeddings.shape[1]:,}")
print(f"   - Memory usage: ~{passage_embeddings.numel() * 4 / 1024 / 1024:.1f} MB")

# Test vector search with the same question
print(f"\n🎯 Testing Vector Search with same question:")
print(f"   Question: {test_question}")

# Encode query into vector
query_embedding = encode_questions_dpr([test_question])
print(f"   Query vector shape: {query_embedding.shape}")

# Perform vector search (this is the core of RAG!)
vector_results = vector_search_dpr(query_embedding, passage_embeddings, train_passages, train_metadata, top_k=10)

print(f"\n📊 Top-10 Vector Search Results:")
for i, result in enumerate(vector_results):
    print(f"   {i+1}. Score: {result['score']:.3f} | Title: {result['title']}")
    print(f"      {result['passage'][:100]}...")

print("\n✅ Vector Search system ready!")
print("🔍 This demonstrates the core technology behind:")
print("   - OpenAI Embeddings + Vector DBs")
print("   - LlamaIndex vector stores") 
print("   - LangChain vector retrievers")
print("   - Pinecone, Weaviate, Chroma databases")

## 🚀 LlamaIndex Comparison: Framework vs Custom Implementation

Let's compare our custom vector search implementation with LlamaIndex to show how modern frameworks abstract these concepts.

In [None]:
# LlamaIndex Implementation Example (for comparison)
print("🚀 LlamaIndex Vector Search Comparison")
print("=" * 50)

print("📚 Our Custom Implementation (learning-focused):")
print("   1. Manual passage extraction from HotpotQA")
print("   2. Explicit DPR model loading and tokenization")
print("   3. Custom vector encoding functions")
print("   4. Manual cosine similarity computation")
print("   5. Custom retrieval and ranking logic")
print("   6. Full control over similarity metrics")

print("\n🏗️ How LlamaIndex Would Abstract This:")
print("   1. SimpleDirectoryReader() - automatic document loading")
print("   2. VectorStoreIndex.from_documents() - auto-embedding")
print("   3. Built-in similarity search")
print("   4. Automatic retrieval and generation")
print("   5. Multiple vector database backends")
print("   6. Pre-configured pipelines")

print("\n💡 LlamaIndex Equivalent Code (pseudo-code):")
print("""
```python
# LlamaIndex - High-level abstraction
from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index.embeddings import HuggingFaceEmbedding

# Load documents (abstracts our manual extraction)
documents = SimpleDirectoryReader('data/').load_data()

# Create vector index (abstracts our manual DPR encoding)
embed_model = HuggingFaceEmbedding(model_name='facebook/dpr-ctx_encoder-single-nq-base')
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

# Query (abstracts our manual vector search)
query_engine = index.as_query_engine()
response = query_engine.query('What is the question?')
```
""")

print("🎯 Key Learning Points:")
print("   ✅ Our implementation: Understand the underlying math and operations")
print("   ✅ LlamaIndex: Production-ready with optimizations and abstractions")
print("   ✅ Both use the SAME vector search concepts:")
print("      - Document → Vector encoding")
print("      - Query → Vector encoding") 
print("      - Similarity search (cosine/inner product)")
print("      - Top-k retrieval")

print("\n🔍 Why Learn Custom Implementation First:")
print("   1. Understand vector search mathematics")
print("   2. Debug and optimize retrieval performance")
print("   3. Implement custom similarity metrics")
print("   4. Adapt to specific domain requirements")
print("   5. Build domain-specific evaluation metrics")

print("\n⚡ When to Use LlamaIndex:")
print("   1. Rapid prototyping and production deployment")
print("   2. Standard RAG pipelines without customization")
print("   3. Multiple vector database backend support")
print("   4. Built-in optimization and caching")
print("   5. Integration with LLM frameworks")

print("\n🏆 Best Practice: Learn fundamentals first, then use frameworks!")
print("📊 Our approach: Custom → Framework comparison → Production choice")

## 📊 Evaluation Framework

Let's implement our evaluation framework using the 6 chosen metrics for HotpotQA multihop reasoning.

In [None]:
# Import evaluation framework from @03_Evaluation_Code.ipynb
class HotpotQAEvaluator:
    """Comprehensive evaluator for HotpotQA multihop reasoning"""
    
    def __init__(self):
        pass
    
    def normalize_answer(self, text):
        """Normalize answer text for comparison"""
        import re
        import string
        
        # Convert to lowercase
        text = text.lower()
        
        # Remove articles
        text = re.sub(r'\b(a|an|the)\b', ' ', text)
        
        # Remove punctuation
        text = text.translate(str.maketrans('', '', string.punctuation))
        
        # Remove extra whitespace
        text = ' '.join(text.split())
        
        return text
    
    def answer_f1_score(self, prediction, ground_truth):
        """Calculate F1 score between prediction and ground truth"""
        from collections import Counter
        
        pred_tokens = self.normalize_answer(prediction).split()
        gold_tokens = self.normalize_answer(ground_truth).split()
        
        if len(pred_tokens) == 0 and len(gold_tokens) == 0:
            return 1.0
        if len(pred_tokens) == 0 or len(gold_tokens) == 0:
            return 0.0
        
        common_tokens = Counter(pred_tokens) & Counter(gold_tokens)
        num_same = sum(common_tokens.values())
        
        if num_same == 0:
            return 0.0
        
        precision = num_same / len(pred_tokens)
        recall = num_same / len(gold_tokens)
        
        return 2 * precision * recall / (precision + recall)
    
    def answer_exact_match(self, prediction, ground_truth):
        """Calculate exact match score"""
        return float(self.normalize_answer(prediction) == self.normalize_answer(ground_truth))
    
    def document_recall_at_k(self, retrieved_titles, gold_titles, k=10):
        """Calculate document recall@k"""
        if len(gold_titles) == 0:
            return 1.0
        
        retrieved_k = set(retrieved_titles[:k])
        gold_set = set(gold_titles)
        
        return len(retrieved_k.intersection(gold_set)) / len(gold_set)
    
    def supporting_fact_f1(self, predicted_facts, gold_facts):
        """Calculate supporting facts F1 score"""
        if len(gold_facts) == 0:
            return 1.0 if len(predicted_facts) == 0 else 0.0
        
        pred_set = set(predicted_facts)
        gold_set = set(gold_facts)
        
        if len(pred_set) == 0:
            return 0.0
        
        intersection = pred_set.intersection(gold_set)
        precision = len(intersection) / len(pred_set)
        recall = len(intersection) / len(gold_set)
        
        if precision + recall == 0:
            return 0.0
        
        return 2 * precision * recall / (precision + recall)
    
    def joint_exact_match(self, pred_answer, gold_answer, pred_facts, gold_facts):
        """Calculate joint exact match (answer + supporting facts)"""
        answer_em = self.answer_exact_match(pred_answer, gold_answer)
        facts_em = 1.0 if set(pred_facts) == set(gold_facts) else 0.0
        
        return float(answer_em == 1.0 and facts_em == 1.0)
    
    def evaluate_single(self, prediction_dict, gold_data, k=10, processing_time=None):
        """Evaluate a single prediction against gold data"""
        
        # Extract predictions
        pred_answer = prediction_dict.get('answer', '')
        pred_titles = prediction_dict.get('retrieved_titles', [])
        pred_facts = prediction_dict.get('supporting_facts', [])
        
        # Extract gold data
        gold_answer = gold_data.get('answer', '')
        gold_facts = []
        gold_titles = []
        
        # Extract gold supporting facts and titles
        if 'supporting_facts' in gold_data:
            facts_list = list(gold_data['supporting_facts'])
            for title, sent_id in facts_list:
                gold_facts.append((title, sent_id))
                if title not in gold_titles:
                    gold_titles.append(title)
        
        # Calculate metrics
        metrics = {
            'answer_f1': self.answer_f1_score(pred_answer, gold_answer),
            'answer_em': self.answer_exact_match(pred_answer, gold_answer),
            'document_recall@k': self.document_recall_at_k(pred_titles, gold_titles, k),
            'supporting_fact_f1': self.supporting_fact_f1(pred_facts, gold_facts),
            'joint_em': self.joint_exact_match(pred_answer, gold_answer, pred_facts, gold_facts),
            'latency': processing_time if processing_time is not None else 0.0
        }
        
        return metrics

# Initialize evaluator
evaluator = HotpotQAEvaluator()
print("✅ HotpotQA Evaluation Framework Ready!")
print("📊 Available metrics:")
print("   1. Answer F1 Score")
print("   2. Answer Exact Match")  
print("   3. Document Recall@k")
print("   4. Supporting-Fact F1")
print("   5. Joint Exact Match")
print("   6. Latency (processing time)")

# Test with sample data
sample_prediction = {
    'answer': 'test answer',
    'retrieved_titles': ['Title 1', 'Title 2', 'Title 3'],
    'supporting_facts': [('Title 1', 0), ('Title 2', 1)]
}

sample_gold = {
    'answer': 'test answer',
    'supporting_facts': [('Title 1', 0), ('Title 2', 1)]
}

test_metrics = evaluator.evaluate_single(sample_prediction, sample_gold, k=10, processing_time=0.5)
print(f"\n🧪 Test evaluation result: {test_metrics}")

## 🏃‍♂️ Run Complete Pipeline and Compare Methods

Now let's run both BM25 (sparse) and DPR (vector search) on a subset of validation data and compare their performance.

In [None]:
# Comprehensive evaluation on validation set
print("🔬 COMPREHENSIVE EVALUATION: BM25 vs DPR Vector Search")
print("=" * 60)

# Use a subset of validation data for evaluation
eval_size = 20  # Adjust based on compute resources
val_subset = val_sample.select(range(min(eval_size, len(val_sample))))

print(f"📊 Evaluating on {len(val_subset)} validation examples")

# Initialize results storage
results = {
    'BM25': [],
    'DPR_Vector_Search': []
}

# Evaluate each method
for i, example in enumerate(val_subset):
    question = example['question']
    gold_answer = example['answer']
    gold_data = {
        'answer': gold_answer,
        'supporting_facts': example['supporting_facts']
    }
    
    print(f"\n📝 Example {i+1}/{len(val_subset)}: {question[:80]}...")
    print(f"🎯 Gold answer: {gold_answer}")
    
    # BM25 Evaluation
    print("   🔍 BM25 Sparse Retrieval:")
    start_time = time.time()
    
    # BM25 search
    query_tokens = preprocess_text_for_bm25(question)
    bm25_scores = bm25.get_scores(query_tokens)
    top_k_indices = np.argsort(bm25_scores)[::-1][:10]
    
    # Extract retrieved titles and facts
    retrieved_titles = []
    retrieved_facts = []
    
    for idx in top_k_indices:
        title = train_metadata[idx]['title']
        if title not in retrieved_titles:
            retrieved_titles.append(title)
        # Add sentence-level facts
        retrieved_facts.append((title, train_metadata[idx]['sentence_idx']))
    
    # Simple answer extraction (use top passage)
    bm25_answer = train_passages[top_k_indices[0]][:50] + "..."  # Simplified
    
    bm25_time = time.time() - start_time
    
    bm25_prediction = {
        'answer': bm25_answer,
        'retrieved_titles': retrieved_titles,
        'supporting_facts': retrieved_facts[:5]  # Top 5 facts
    }
    
    bm25_metrics = evaluator.evaluate_single(bm25_prediction, gold_data, k=10, processing_time=bm25_time)
    results['BM25'].append(bm25_metrics)
    
    print(f"      Answer: {bm25_answer}")
    print(f"      Document Recall@10: {bm25_metrics['document_recall@k']:.3f}")
    print(f"      Latency: {bm25_time:.3f}s")
    
    # DPR Vector Search Evaluation  
    print("   🧠 DPR Vector Search:")
    start_time = time.time()
    
    # Vector search
    query_embedding = encode_questions_dpr([question])
    vector_results = vector_search_dpr(query_embedding, passage_embeddings, train_passages, train_metadata, top_k=10)
    
    # Extract retrieved titles and facts
    dpr_titles = []
    dpr_facts = []
    
    for result in vector_results:
        title = result['title']
        if title not in dpr_titles:
            dpr_titles.append(title)
        dpr_facts.append((title, result['metadata']['sentence_idx']))
    
    # Simple answer extraction (use top passage)
    dpr_answer = vector_results[0]['passage'][:50] + "..."  # Simplified
    
    dpr_time = time.time() - start_time
    
    dpr_prediction = {
        'answer': dpr_answer,
        'retrieved_titles': dpr_titles,
        'supporting_facts': dpr_facts[:5]  # Top 5 facts
    }
    
    dpr_metrics = evaluator.evaluate_single(dpr_prediction, gold_data, k=10, processing_time=dpr_time)
    results['DPR_Vector_Search'].append(dpr_metrics)
    
    print(f"      Answer: {dpr_answer}")
    print(f"      Document Recall@10: {dpr_metrics['document_recall@k']:.3f}")
    print(f"      Latency: {dpr_time:.3f}s")

print("\n📊 FINAL RESULTS COMPARISON")
print("=" * 40)

# Calculate average metrics
for method, method_results in results.items():
    print(f"\n🔸 {method}:")
    
    if method_results:
        avg_metrics = {}
        for metric in method_results[0].keys():
            avg_metrics[metric] = np.mean([r[metric] for r in method_results])
        
        print(f"   Document Recall@10: {avg_metrics['document_recall@k']:.3f}")
        print(f"   Supporting-Fact F1: {avg_metrics['supporting_fact_f1']:.3f}")
        print(f"   Answer F1: {avg_metrics['answer_f1']:.3f}")
        print(f"   Answer EM: {avg_metrics['answer_em']:.3f}")
        print(f"   Joint EM: {avg_metrics['joint_em']:.3f}")
        print(f"   Avg Latency: {avg_metrics['latency']:.3f}s")

print("\n🎯 KEY INSIGHTS:")
print("=" * 30)
print("🔍 BM25 (Sparse Retrieval):")
print("   ✅ Fast and lightweight")
print("   ✅ Exact keyword matching")
print("   ✅ Interpretable scoring")
print("   ❌ Limited semantic understanding")

print("\n🧠 DPR (Vector Search):")
print("   ✅ Semantic similarity")
print("   ✅ Handles paraphrases")
print("   ✅ Dense representations")
print("   ❌ Slower encoding time")
print("   ❌ Less interpretable")

print("\n🚀 Vector Search in Production:")
print("   • Both methods demonstrate core retrieval concepts")
print("   • DPR shows vector search fundamentals used in:")
print("     - OpenAI embeddings + Pinecone")
print("     - LlamaIndex vector stores")
print("     - LangChain retrievers")
print("   • Hybrid approaches often work best in practice")

print(f"\n✅ Evaluation complete on {len(val_subset)} examples!")

## 📝 Summary and Key Takeaways

This notebook demonstrated how traditional methods actually implement vector search concepts that modern RAG systems build upon.

In [None]:
print("🎓 SUMMARY: Vector Search Learning Journey")
print("=" * 45)

print("✅ What We Learned:")
print("1. **BM25 = Sparse Vector Search**")
print("   - Represents documents as sparse term-frequency vectors")
print("   - Uses statistical weighting (IDF) for relevance")
print("   - Fast, interpretable, keyword-focused")

print("\n2. **DPR = Dense Vector Search**") 
print("   - Transforms text into 768-dimensional dense vectors")
print("   - Uses neural networks (BERT) for semantic encoding")
print("   - Cosine similarity for vector matching")
print("   - Same technology as OpenAI embeddings + Pinecone!")

print("\n3. **Vector Search Fundamentals**")
print("   - Document encoding: Text → Vector representations")
print("   - Query encoding: Questions → Query vectors")
print("   - Similarity search: Find closest vectors (MIPS/Cosine)")
print("   - Top-k retrieval: Return most similar documents")

print("\n🔧 **Implementation vs Framework Comparison**")
print("   Our Custom Implementation:")
print("   ✅ Full control over encoding and similarity metrics")
print("   ✅ Understanding of underlying mathematics")
print("   ✅ Custom evaluation for multihop reasoning")
print("   ✅ Adaptable to domain-specific requirements")

print("\n   LlamaIndex Framework:")
print("   ✅ Production-ready optimizations")
print("   ✅ Multiple vector database backends")
print("   ✅ Built-in caching and indexing")
print("   ✅ Rapid prototyping capabilities")

print("\n🚀 **Vector Search in Modern RAG Systems**")
print("   • Pinecone, Weaviate, Chroma: Production vector databases")
print("   • OpenAI Ada-002: Dense embeddings for general domain")
print("   • LlamaIndex/LangChain: High-level RAG frameworks")
print("   • Hybrid search: Combining sparse (BM25) + dense (embeddings)")

print("\n📊 **HotpotQA Multihop Evaluation**")
print("   • 6 comprehensive metrics for multihop reasoning")
print("   • Document Recall@k: Retrieval effectiveness")
print("   • Supporting-Fact F1: Evidence identification")
print("   • Joint EM: Combined answer + reasoning accuracy")

print("\n🎯 **Best Practices for Learning**")
print("   1. Start with custom implementation to understand fundamentals")
print("   2. Compare sparse vs dense approaches on your domain")
print("   3. Use frameworks for production deployment")
print("   4. Implement comprehensive evaluation metrics")
print("   5. Consider hybrid approaches for best performance")

print("\n💡 **Key Insight: Traditional Methods ARE Vector Search!**")
print("   • BM25: Sparse vector similarity with TF-IDF weighting")
print("   • DPR: Dense vector similarity with neural encoding")
print("   • Both use the same core concept: vector similarity search")
print("   • Modern vector databases scale and optimize these concepts")

print("\n🔮 **Next Steps**")
print("   • Experiment with different embedding models")
print("   • Try hybrid sparse + dense retrieval")
print("   • Implement production vector database integration")
print("   • Add reranking and generation components")
print("   • Scale to full HotpotQA dataset for comprehensive evaluation")

print("\n✨ Vector search is the foundation of modern RAG - now you understand it!"))

In [None]:
# Add missing import
import time