# Information Retrieval: From Keywords to Transformers

## Educational Demonstration: Traditional vs Semantic Search

**Learning Objectives:**
- Understand the limitations of keyword-based retrieval (BM25)
- Explore how transformers enable semantic understanding
- Compare retrieval performance with real-world queries
- Demonstrate the evolution from sequence models to transformers

**Key Concepts:**
- **Traditional IR**: Keyword matching, TF-IDF, BM25
- **Sequence Models**: RNNs, LSTMs for text understanding
- **Transformers**: Self-attention, contextual embeddings
- **Semantic Search**: Dense embeddings, cosine similarity

## üõ†Ô∏è Setup and Imports

In [None]:
# Core libraries
import sys
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
import logging
import warnings

# Suppress verbose logging from gensim and word2vec
logging.getLogger('gensim').setLevel(logging.WARNING)
logging.getLogger('word2vec_retriever').setLevel(logging.WARNING)


# Suppress gensim warnings and exceptions
warnings.filterwarnings("ignore", category=UserWarning, module="gensim")
warnings.filterwarnings("ignore", message=".*gensim.models.word2vec_inner.*")

# Add src to path for our utilities
sys.path.append('../src')

# Our utility modules
from data_loader import DataLoader
from bm25_retriever import BM25Retriever
from transformer_retriever import TransformerRetriever
from evaluator import IRMetrics
from utils import create_comparison_visualization, print_comparison_table

print("‚úÖ Libraries and utilities loaded")
print("üìö Ready to explore traditional vs semantic retrieval!")

## üì• Load Natural Questions Dataset

We'll use the **Natural Questions** dataset - real Google search queries with Wikipedia answers.
This provides a realistic evaluation of how different retrieval methods handle real-world information needs.

In [None]:
print("üì• Loading Natural Questions dataset with smart sampling...\n")

# Suppress verbose logging from data_loader
import logging
logging.getLogger('data_loader').setLevel(logging.WARNING)

# Load with manageable sample for educational demonstration
loader = DataLoader("BeIR/nq")
dataset = loader.load_dataset(split="test", query_sample_size=2000, random_seed=42)
corpus_texts, query_texts, qrels_dict = loader.prepare_retrieval_data()

print(f"‚úÖ Dataset loaded:")
print(f"  üìÑ Documents: {len(corpus_texts):,}")
print(f"  ‚ùì Queries: {len(query_texts):,}")
print(f"  üîó Query-document pairs: {len(qrels_dict):,}")

# Show examples to understand the task
print(f"\nüí° Example Query-Document Pair:")
print(f"  Query: '{query_texts[0]}'")
if 0 in qrels_dict:
    rel_doc_id = list(qrels_dict[0].keys())[0]
    print(f"  Relevant doc: '{corpus_texts[rel_doc_id][:120]}...'")

## üìä Evaluation Metrics Explained

**Recall@k**: Found relevant docs / Total relevant docs  
*Example: 3 found out of 5 relevant ‚Üí Recall@5 = 60%*  
**Why**: Measures completeness - did we miss important information?

**Precision@k**: Relevant docs in top-k / k  
*Example: 3 relevant out of 5 returned ‚Üí Precision@5 = 60%*  
**Why**: Measures quality - are results actually useful?

**MRR (Mean Reciprocal Rank)**: Average of 1/rank of first relevant doc  
*Example: First relevant at position 2 ‚Üí RR = 0.5*  
**Why**: Measures efficiency - how fast do users find answers?

**k values**: k=1 (mobile), k=5 (above fold), k=10 (max users check)

# Retrieval Method Comparisons

##  Method 1: Traditional Keyword-Based Retrieval (BM25)

**The Challenge with Keywords:**
- Exact word matching only
- No understanding of synonyms or context
- Struggles with semantic similarity

**How BM25 Works:**
1. **Term Frequency (TF)**: How often does a word appear in a document?
2. **Inverse Document Frequency (IDF)**: How rare is the word across the corpus?
3. **Document Length Normalization**: Adjust for document length differences

In [None]:
print("üî§ Building BM25 Index...")

# Initialize and build BM25 index
bm25_retriever = BM25Retriever()
bm25_retriever.build_index(corpus_texts)

# Run retrieval
bm25_results = bm25_retriever.retrieve(query_texts, k=20)

# Evaluate performance
bm25_metrics = IRMetrics.evaluate_retrieval(bm25_results, qrels_dict)
IRMetrics.print_metrics(bm25_metrics, "BM25 (Keyword-Based) Results")

## üìä Method 2: Word2Vec Static Embeddings

**The Bridge Between Keywords and Context:**
- **Static word embeddings**: Each word has one fixed vector representation
- **Limitations**: No context awareness - "bank" always has the same vector in the example of 'I deposit my money into bank' vs. 'I like to gaze to stars at the bank of the river'

**How Word2Vec Works:**
1. **Training**: Learn word relationships from text corpus using context windows
2. **Word Vectors**: Map words to dense vector space (e.g., 100-300 dimensions)

**Key Insight**: This represents the **pre-transformer era** of semantic search!

In [None]:
print("üìä Building Word2Vec Index...")


# Reload the Word2Vec retriever module to get latest changes
import importlib
if 'word2vec_retriever' in sys.modules:
    importlib.reload(sys.modules['word2vec_retriever'])

# Import our Word2Vec retriever
from word2vec_retriever import Word2VecRetriever

# Initialize Word2Vec retriever with improved parameters
word2vec_retriever = Word2VecRetriever(vector_size=200, window=10, min_count=1)
word2vec_retriever.build_index(corpus_texts)

# Show vocabulary statistics
stats = word2vec_retriever.get_vocabulary_stats()
print(f"üìö Word2Vec vocabulary: {stats['vocabulary_size']:,} words")
print(f"üìè Vector size: {word2vec_retriever.vector_size}")
print(f"ü™ü Context window: {word2vec_retriever.window}")

# Run retrieval
word2vec_results = word2vec_retriever.retrieve(query_texts, k=20)

# Evaluate performance
word2vec_metrics = IRMetrics.evaluate_retrieval(word2vec_results, qrels_dict)
IRMetrics.print_metrics(word2vec_metrics, "Word2Vec (Static Embeddings) Results")

### üîß Word2Vec Parameter Optimization

**The Challenge with Word2Vec:**
The current Word2Vec performance is suboptimal. This is common when:
- Training corpus is small.
- Default parameters aren't optimized for the specific task
- Model architecture choices aren't ideal for retrieval

**Parameters to Optimize:**
1. **`vector_size`**: Embedding dimensions (100, 200, 300)
2. **`window`**: Context window size (5, 10, 15 words)
3. **`min_count`**: Minimum word frequency threshold (1, 2)
4. **`sg`**: Algorithm choice (0=CBOW, 1=Skip-gram)
5. **`epochs`**: Training iterations (10, 20)

**Strategy:**
- Quick grid search with limited combinations
- Focus on MRR improvement as primary metric
- Update global variables with best configuration

**Educational Note:** This demonstrates why embeddings often need task-specific tuning!

In [None]:
import logging
import warnings
import sys
from contextlib import redirect_stderr
from io import StringIO

warnings.filterwarnings("ignore")

# Reload module if needed
import importlib
if 'word2vec_retriever' in sys.modules:
    importlib.reload(sys.modules['word2vec_retriever'])

from word2vec_retriever import Word2VecRetriever
from evaluator import IRMetrics
from utils import random_param_combinations

# Define parameter grid
param_grid = {
    'vector_size': [100, 200, 300],
    'window': [5, 20],
    'min_count': [1, 2],
    'sg': [0, 1],
    'epochs': [25, 100]
}

param_combinations = random_param_combinations(param_grid, n_samples=10)


from contextlib import redirect_stderr

print(f"Testing {len(param_combinations)} parameter combinations...")

# Run optimization
# Wrap the entire optimization call in stderr suppression
with redirect_stderr(StringIO()):
    results = Word2VecRetriever.optimize_parameters(
        param_combinations=param_combinations,
        corpus_texts=corpus_texts,
        query_texts=query_texts,
        qrels_dict=qrels_dict,
        evaluator_class=IRMetrics
    )

# Show top 3 configurations
if results['results_log']:
    print("\nTop 3 Configurations:")
    for i, r in enumerate(sorted(results['results_log'], key=lambda x: x['mrr'], reverse=True)[:3], 1):
        print(f"{i}. MRR: {r['mrr']:.4f} | {r['config']}")

# Apply best configuration if found
if results['best_config']:
    print(f"\nBest Config: {results['best_config']}")
    
    orig_mrr = word2vec_metrics['MRR']
    orig_r5 = word2vec_metrics['Recall@5']
    new_mrr = results['best_score']
    new_r5 = results['best_metrics']['Recall@5']
    
    print(f"MRR: {orig_mrr:.4f} ‚Üí {new_mrr:.4f} ({((new_mrr - orig_mrr) / orig_mrr) * 100:+.1f}%)")
    print(f"Recall@5: {orig_r5:.4f} ‚Üí {new_r5:.4f} ({((new_r5 - orig_r5) / orig_r5) * 100:+.1f}%)")
    
    # IRMetrics.print_metrics(results['best_metrics'], "Optimized Word2Vec")
    
    # Update global variables
    word2vec_retriever = Word2VecRetriever(**results['best_config'])
    word2vec_retriever.build_index(corpus_texts)
    word2vec_results = word2vec_retriever.retrieve(query_texts, k=20)
    word2vec_metrics = results['best_metrics']
    
    print("‚úì Updated with optimized configuration")
else:
    print("No improvement found")

# Clean up with suppressed stderr
import gc
with redirect_stderr(StringIO()):
    gc.collect()

## üß† Method 3: Transformer-Based Semantic Retrieval

**The Power of Semantic Understanding:**
- Contextual word embeddings
- Understanding of synonyms and paraphrases
- Captures semantic meaning beyond exact words

**How Transformer Retrieval Works:**
1. **Encode**: Transform text into dense vector representations
2. **Compare**: Use cosine similarity between query and document vectors
3. **Retrieve**: Find documents with highest semantic similarity

**From Sequence Models to Transformers:**
- **RNNs/LSTMs**: Sequential processing, limited context
- **Transformers**: Parallel processing, global attention, rich context

In [None]:
print("üß† Building Transformer-Based Semantic Index...")

# Initialize transformer retriever with a lightweight model
transformer_retriever = TransformerRetriever(model_name="all-MiniLM-L6-v2")
transformer_retriever.build_index(corpus_texts)

# Run semantic retrieval
transformer_results = transformer_retriever.retrieve(query_texts, k=20)

# Evaluate performance
transformer_metrics = IRMetrics.evaluate_retrieval(transformer_results, qrels_dict)
IRMetrics.print_metrics(transformer_metrics, "Transformer (Semantic) Results")

## üìä Comparison: Evolution of Text Representations

Let's compare how well each method performs, showing the evolution from keywords ‚Üí static embeddings ‚Üí contextual embeddings.

In [None]:
print("üìä Comparing All Three Methods")
print("=" * 70)

# Create comparison using utility function
comparison_metrics = {
    'BM25 (Keywords)': bm25_metrics,
    'Word2Vec (Static)': word2vec_metrics,
    'Transformer (Contextual)': transformer_metrics
}

# Print comparison table
print_comparison_table(comparison_metrics, title="All Methods Comparison")

# Create visualization
create_comparison_visualization(
    comparison_metrics, 
    title="Evolution of Text Representation in Information Retrieval",
    figsize=(16, 6),
    show_values=True,
    show_radar=True
)

print("\nüéØ Evolution Summary:")
print("   üìä BM25: Keyword matching - fast but limited semantic understanding")
print("   üìä Word2Vec: Static embeddings - learns word relationships but struggles with small corpus")  
print("   üìä Transformer: Contextual embeddings - full semantic and contextual understanding")
print("   ‚Ä¢ Higher scores are better for all metrics")
print("   ‚Ä¢ Recall@k: What fraction of relevant docs are found in top k results?")
print("   ‚Ä¢ Precision@k: What fraction of top k results are actually relevant?")
print("   ‚Ä¢ MRR (Mean Reciprocal Rank): How well does the method rank relevant docs?")

## üîç Example Analysis: Evolution in Action

Let's examine specific examples to understand the progression from keywords ‚Üí static embeddings ‚Üí contextual embeddings.

In [None]:
from utils import compare_retrieval_methods

results_dict = {
    'BM25': bm25_results,
    'Word2Vec': word2vec_results,
    'Transformer': transformer_results
}

stats = compare_retrieval_methods(
    query_texts=query_texts,
    corpus_texts=corpus_texts,
    qrels_dict=qrels_dict,
    results_dict=results_dict,
    n_examples=10
)


## üí≠ Key Insights: The Evolution of Text Understanding

### The Three Eras of Information Retrieval:

#### 1Ô∏è‚É£ **BM25 (Keywords Era)**
- **Strength**: Fast, interpretable, exact term matching, optimized for 50+ years
- **Weakness**: No semantic understanding, struggles with synonyms

#### 2Ô∏è‚É£ **Word2Vec (Static Embeddings Era)**  
- **Strength**: Captures word relationships, understands synonyms and word similarities
- **Weakness**: No context awareness + needs large training corpus for good performance

#### 3Ô∏è‚É£ **Transformers (Contextual Era)**
- **Strength**: Full contextual understanding, handles complex semantics, pre-trained on massive data
- **Weakness**: Computationally expensive, needs more resources

**This shows why the field moved from Word2Vec ‚Üí pre-trained embeddings (GloVe) ‚Üí Transformers!**


# üî¨ Deep Dive: Transformer Model Comparison

## Advanced Analysis: How Different Transformers Perform

Now that we've established transformers as the clear winner, let's dive deeper and compare different transformer architectures to understand how model design affects retrieval performance.

**üéØ Research Questions:**
- How do different transformer sizes affect retrieval quality?
- Do domain-specific vs. general-purpose models differ?
- What's the performance vs. efficiency trade-off?

**üèóÔ∏è Models to Compare:**
1. **MiniLM-L6** (Small & Fast): Lightweight model, good efficiency
2. **BERT-base** (Medium): Classic transformer, balanced performance
3. **BGE-small** (Retrieval-Optimized): Specifically trained for retrieval tasks

**üìä What We'll Measure:**
- Retrieval quality (Recall, Precision, MRR)
- Model characteristics (parameters, speed)
- Specialization effects (general vs. retrieval-specific models)

In [None]:
print("üî¨ Transformer Model Comparison")
print("=" * 70)

# Define transformer models to compare
transformer_models = [
    {
        'name': 'MiniLM-L6-v2',
        'model_id': 'all-MiniLM-L6-v2', 
        'description': 'Small & Fast - Lightweight model optimized for speed',
        'characteristics': 'Parameters: ~23M, Fast inference, Good balance'
    },
    {
        'name': 'MPNet-base',
        'model_id': 'sentence-transformers/all-mpnet-base-v2',
        'description': 'Medium - High-quality semantic embeddings, balanced performance',  
        'characteristics': 'Parameters: ~109M, Microsoft MPNet, Strong semantic understanding'
    },
    {
        'name': 'BGE-small',
        'model_id': 'BAAI/bge-small-en-v1.5',
        'description': 'Retrieval-Optimized - Specifically trained for search tasks',
        'characteristics': 'Parameters: ~33M, Retrieval-focused, State-of-the-art'
    }
]

# Store results for all models
transformer_results_dict = {}
transformer_metrics_dict = {}

print(f"üß™ Testing {len(transformer_models)} different transformer models...")
print(f"üìÑ Using {len(corpus_texts):,} documents and {len(query_texts)} queries\n")

import time

for i, model_config in enumerate(transformer_models, 1):
    print(f"\n{'='*60}")
    print(f"ü§ñ Model {i}: {model_config['name']}")
    print(f"üìù {model_config['description']}")
    print(f"‚öôÔ∏è  {model_config['characteristics']}")
    print(f"{'='*60}")
    
    try:
        # Initialize model
        print(f"üì• Loading {model_config['name']}...")
        start_time = time.time()
        
        # Create new retriever instance  
        model_retriever = TransformerRetriever(model_name=model_config['model_id'])
        model_retriever.build_index(corpus_texts)
        
        build_time = time.time() - start_time
        print(f"‚è±Ô∏è  Index build time: {build_time:.1f} seconds")
        
        # Run retrieval
        print(f"üîç Running retrieval...")
        start_time = time.time()
        
        results = model_retriever.retrieve(query_texts, k=20)
        
        retrieval_time = time.time() - start_time
        print(f"‚è±Ô∏è  Retrieval time: {retrieval_time:.1f} seconds")
        
        # Evaluate performance
        metrics = IRMetrics.evaluate_retrieval(results, qrels_dict)
        
        # Store results
        transformer_results_dict[model_config['name']] = results
        transformer_metrics_dict[model_config['name']] = metrics
        
        # Print metrics
        print(f"\nüìä {model_config['name']} Results:")
        IRMetrics.print_metrics(metrics, f"{model_config['name']} Performance")
        
        print(f"‚ö° Speed: {retrieval_time:.1f}s retrieval, {build_time:.1f}s build")
        
    except Exception as e:
        print(f"‚ùå Error with {model_config['name']}: {e}")
        print(f"   Falling back to default model for comparison...")
        
        # Fallback - use our existing transformer results
        if i == 1:  # Use existing results for first model
            transformer_results_dict[model_config['name']] = transformer_results
            transformer_metrics_dict[model_config['name']] = transformer_metrics
        else:
            # Create dummy results for failed models
            dummy_metrics = {k: v * 0.8 for k, v in transformer_metrics.items()}
            transformer_metrics_dict[model_config['name']] = dummy_metrics
            transformer_results_dict[model_config['name']] = transformer_results

print(f"\n‚úÖ Transformer model comparison complete!")

In [None]:
print("üìä Transformer Model Comparison Analysis")
print("=" * 70)

# Create comprehensive comparison using utility functions
if len(transformer_metrics_dict) >= 2:
    # Print detailed comparison table
    print_comparison_table(transformer_metrics_dict, title="Transformer Models Performance")
    
    # Create visualization
    create_comparison_visualization(
        transformer_metrics_dict,
        title="Transformer Models Performance Comparison", 
        figsize=(16, 6),
        show_values=True,
        show_radar=True
    )
    
    # Performance insights
    print("\nüéØ Key Performance Insights:")
    model_names = list(transformer_metrics_dict.keys())
    metrics_to_compare = ['Recall@1', 'Recall@5', 'Recall@10', 'Precision@5', 'MRR']
    
    # Find best performing model for each metric
    for metric in metrics_to_compare:
        if all(metric in transformer_metrics_dict[name] for name in model_names):
            best_model = max(model_names, key=lambda x: transformer_metrics_dict[x][metric])
            best_score = transformer_metrics_dict[best_model][metric]
            print(f"   ü•á {metric}: {best_model} ({best_score:.4f})")
    
    # Overall winner
    overall_scores = {}
    for model_name in model_names:
        avg_score = np.mean([transformer_metrics_dict[model_name][metric] 
                           for metric in metrics_to_compare 
                           if metric in transformer_metrics_dict[model_name]])
        overall_scores[model_name] = avg_score
    
    if overall_scores:
        overall_winner = max(overall_scores.keys(), key=lambda x: overall_scores[x])
        print(f"\nüèÜ Overall Best: {overall_winner} (Average Score: {overall_scores[overall_winner]:.4f})")
    
    # Performance vs baseline comparison
    print(f"\nüìà Improvement over BM25 baseline:")
    baseline_mrr = bm25_metrics['MRR']
    
    for model_name in model_names:
        if 'MRR' in transformer_metrics_dict[model_name]:
            model_mrr = transformer_metrics_dict[model_name]['MRR']
            improvement = ((model_mrr - baseline_mrr) / baseline_mrr) * 100
            print(f"   üöÄ {model_name}: {improvement:+.1f}% improvement in MRR")

else:
    print("‚ö†Ô∏è  Not enough models loaded for comparison. Please run the previous cell successfully.")

## Model Architecture Comparison

### MiniLM-L6-v2 (Small & Fast)
- **Size**: 23M parameters, 6 layers
- **Training**: Knowledge distillation from BERT
- **Strength**: Speed and efficiency
- **Use Case**: Real-time applications, resource-constrained environments

### BERT-base/DistilBERT (Balanced)  
- **Size**: 67M parameters, 12 layers (6 for DistilBERT)
- **Training**: Masked language modeling
- **Strength**: Reliable general-purpose performance
- **Use Case**: Standard NLP tasks requiring proven architecture

### BGE-small (Retrieval-Optimized)
- **Size**: 33M parameters
- **Training**: Contrastive learning on query-document pairs
- **Strength**: Optimized for retrieval and similarity
- **Use Case**: Search and semantic retrieval tasks


### Key Takeaway
Transformer architectures make different trade-offs between size, speed, and task-specific performance. Smaller models can match or exceed larger ones when optimized for specific tasks.