# Learning Arc: Advanced Reranking Strategies

## Purpose
This notebook teaches you to implement four complementary reranking strategies that improve search result quality beyond basic cross-encoder models. You'll learn when to apply ensemble voting, diversity algorithms (MMR), temporal boosting, and user personalization to production RAG systems while respecting performance budgets.

## Concepts Covered
- **Ensemble Reranking**: Combining multiple cross-encoder models using weighted averaging, Borda voting, or confidence fusion to reduce bias and improve accuracy by 8-12%
- **Maximal Marginal Relevance (MMR)**: Balancing relevance versus diversity through iterative document selection with the formula `λ×relevance - (1-λ)×max_similarity`
- **Temporal Boosting**: Applying exponential decay to document scores based on age, with linguistic cue detection for time-sensitive queries
- **User Preference Learning**: Extracting features (source, type, depth, length) from documents and predicting user-specific appeal scores
- **Pipeline Orchestration**: Combining all strategies while staying under 200ms P95 latency SLA

## After Completing
You will be able to design production reranking pipelines that handle accuracy requirements (ensemble), diversity needs (MMR), time-sensitive queries (temporal), and personalization. You'll understand when NOT to use advanced reranking (first-pass retrieval <60% precision, low traffic, simple queries) and how to debug common failure modes like ensemble overconfidence, MMR over-diversification, recency bias overwhelming relevance, preference overfitting, and latency bottlenecks.

## Context in Track
This is **Module 9.4** in the Level 3 advanced retrieval track, building on M9.1-M9.3 (Query Decomposition, Multi-Hop Retrieval, HyDE) and Level 1 M1.4 (basic cross-encoder reranking). It prepares you for M10 (Production Monitoring), M11 (Cost Optimization), and M12 (Multi-Modal RAG). The module assumes you have a working single-model reranking system and understand the limitations of optimizing purely for similarity without considering recency, diversity, or user preferences.

# Module 9.4: Advanced Reranking Strategies

## Overview

This educational module teaches Level 3 learners to implement advanced retrieval reranking beyond basic single cross-encoder models. This notebook covers four complementary strategies for improving search result quality in production RAG systems:

1. **Ensemble Reranking with Voting** - Multiple cross-encoder models reduce bias
2. **Maximal Marginal Relevance (MMR)** - Balance relevance vs diversity
3. **Temporal/Recency Boosting** - Time-aware scoring
4. **User Preference Learning** - Personalization based on implicit feedback

> **Key Insight**: "A single cross-encoder makes one judgment call. It doesn't know about recency, optimizes for similarity not diversity, and has no idea what this particular user cares about."

## Setup and Imports

First, we'll import the required modules and load sample data.

In [None]:
# Offline mode detection
import os
OFFLINE = not bool(os.getenv("OPENAI_API_KEY"))
if OFFLINE:
    print("⚠️ Offline mode: model/network calls will be skipped or use mock data.")

### Configuration and Data Loading

We import the reranking classes and load configuration from environment variables. The example data includes 8 documents about ML frameworks with timestamps, metadata (source, type, technical depth), and a user profile with 150 interactions for testing personalization.

In [None]:
import sys
import os
import json
import numpy as np

# Add project root to path for imports
notebook_dir = os.path.dirname(os.path.abspath("__file__"))
project_root = os.path.dirname(notebook_dir)
if project_root not in sys.path:
    sys.path.insert(0, project_root)

from src.l3_m9_advanced_reranking.l3_m9_advanced_reranking_strategies import (
    Document,
    EnsembleReranker,
    MMRReranker,
    TemporalReranker,
    PersonalizationReranker,
    AdvancedReranker
)
from src.l3_m9_advanced_reranking.config import get_config

# Load configuration
config = get_config()
print("✓ Configuration loaded")

# Load example data
example_data_path = os.path.join(project_root, "example_data.json")
with open(example_data_path, "r") as f:
    data = json.load(f)

query = data["query"]
raw_documents = data["documents"]
user_profile = data["user_profile"]

print(f"✓ Loaded {len(raw_documents)} documents")
print(f"✓ Query: '{query}'")

## Strategy 1: Ensemble Reranking with Voting

Multiple cross-encoder models score documents independently, then aggregate results through weighted averaging, rank-based voting, or confidence fusion. This approach reduces individual model bias and improves accuracy by 8-12% (from 78% to 86-88% precision).

**Implementation Architecture:**
- `EnsembleReranker` class accepts three configured models with normalized weights
- `_aggregate_scores()` method supports:
  - **Weighted average**: Direct score combination
  - **Voting**: Borda count ranking system
  - **Confidence fusion**: Magnitude-based weighting

**Performance Budget**: 200-400ms latency

In [None]:
# Convert to Document objects
documents = [
    Document(id=d["id"], text=d["text"], metadata=d["metadata"], score=0.5)
    for d in raw_documents
]

# Initialize ensemble reranker with single model (to avoid long load times)
# ⚠️ Skipping API calls if models unavailable
try:
    ensemble = EnsembleReranker(
        model_names=[config.RERANKER_MODELS[0]],  # Use first model only
        aggregation="weighted"
    )
    
    # Perform ensemble reranking
    result = ensemble.rerank(query, documents[:], top_k=3)
    
    print(f"Latency: {result.latency_ms:.2f}ms")
    print(f"Top 3 documents (ensemble):")
    for i, doc in enumerate(result.documents, 1):
        print(f"  {i}. {doc.id}: {doc.score:.4f}")
except Exception as e:
    print(f"⚠️ Skipping ensemble (model unavailable): {e}")
    
# Expected: Top 3 reranked documents with scores

## Strategy 2: Maximal Marginal Relevance (MMR)

This algorithm balances relevance against diversity using the formula:

```
score = λ × relevance - (1-λ) × max_similarity_to_selected
```

Documents are selected iteratively, ensuring results contain varied perspectives rather than redundant content.

**Key Parameters:**
- `λ = 1.0`: Pure relevance (no diversity)
- `λ = 0.0`: Pure diversity (may sacrifice relevance)
- `λ = 0.7`: Recommended balance

**Performance Budget**: 10-20ms latency

In [None]:
# Assign relevance scores to documents
for doc in documents:
    doc.score = np.random.uniform(0.6, 0.9)

# Initialize MMR reranker
mmr = MMRReranker(lambda_param=config.MMR_LAMBDA)

# Perform MMR reranking
result = mmr.rerank(documents[:], top_k=3)

print(f"Latency: {result.latency_ms:.2f}ms")
print(f"Lambda (relevance/diversity balance): {config.MMR_LAMBDA}")
print(f"\nTop 3 diverse documents (MMR):")
for i, doc in enumerate(result.documents, 1):
    print(f"  {i}. {doc.id} - {doc.text[:60]}...")

# Expected: Top 3 diverse documents selected iteratively

## Strategy 3: Temporal/Recency Boosting

Time-sensitive queries receive exponential decay penalties for older documents. Detection uses linguistic cues ("latest," "current," "recent") and applies multiplicative boosters to base relevance scores when appropriate.

**Formula:**
```
recency_multiplier = boost_factor × exp(-decay_rate × age_days)
where decay_rate = ln(2) / decay_days (half-life)
```

**Key Parameters:**
- `decay_days`: Half-life for exponential decay (default: 30 days)
- `boost_factor`: Maximum boost for recent documents (default: 1.5)

**Performance Budget**: 5ms latency

In [None]:
# Reset document scores
for doc in documents:
    doc.score = 0.7

# Initialize temporal reranker
temporal = TemporalReranker(
    decay_days=config.RECENCY_DECAY_DAYS,
    boost_factor=config.RECENCY_BOOST_FACTOR
)

# Check if query is temporal
is_temporal = temporal.is_temporal_query(query)
print(f"Query detected as temporal: {is_temporal}")
print(f"Query: '{query}'")

# Perform temporal reranking
result = temporal.rerank(query, documents[:])

print(f"\nLatency: {result.latency_ms:.2f}ms")
print(f"Top 3 documents (with recency boost):")
for i, doc in enumerate(result.documents[:3], 1):
    timestamp = doc.metadata.get("timestamp", "N/A")
    print(f"  {i}. {doc.id}: {doc.score:.4f} (date: {timestamp[:10]})")

# Expected: Recent documents boosted in ranking

## Strategy 4: User Preference Learning

Click-based implicit feedback trains lightweight models predicting document appeal per user. Features include document type, source, length, and technical depth, generating personalized score multipliers.

**Feature Extraction:**
- Document type match (tutorial, research, reference, opinion)
- Technical depth alignment
- Length preferences
- Source preferences (blog, documentation, research papers)

**Key Parameters:**
- `min_interactions`: Minimum user interactions required (default: 100)

**Performance Budget**: 15-30ms latency

In [None]:
# Reset document scores
for doc in documents:
    doc.score = 0.7

# Initialize personalization reranker
personalization = PersonalizationReranker(min_interactions=config.MIN_USER_INTERACTIONS)

# Display user profile
print(f"User: {user_profile['user_id']}")
print(f"Interactions: {user_profile['interaction_count']}")
print(f"Preferred sources: {user_profile['preferences']['preferred_sources']}")

# Perform personalization reranking
result = personalization.rerank(documents[:], user_profile)

print(f"\nLatency: {result.latency_ms:.2f}ms")
print(f"Personalized: {result.debug_info['personalized']}")
print(f"\nTop 3 personalized documents:")
for i, doc in enumerate(result.documents[:3], 1):
    source = doc.metadata.get("source", "N/A")
    doc_type = doc.metadata.get("doc_type", "N/A")
    print(f"  {i}. {doc.id}: {doc.score:.4f} (source: {source}, type: {doc_type})")

# Expected: Documents matching user preferences ranked higher

## Combined Pipeline: All Strategies Together

The `AdvancedReranker` orchestrates all four strategies with performance budgets. Combined approaches must stay under 200ms P95 response time for SLA compliance.

**Pipeline Order:**
1. Ensemble Reranking (200-400ms) - Initial scoring
2. Temporal Boosting (5ms) - Time-aware adjustment
3. Personalization (15-30ms) - User-specific weighting
4. MMR Diversity (10-20ms) - Final selection

**Total Budget**: <200ms P95 (may require disabling ensemble for speed)

In [None]:
# Reset documents
documents = [
    Document(id=d["id"], text=d["text"], metadata=d["metadata"], score=0.5)
    for d in raw_documents
]

# Initialize advanced reranker (disable ensemble for speed)
advanced = AdvancedReranker(
    enable_ensemble=False,  # Disabled to meet latency budget
    enable_mmr=True,
    enable_temporal=True,
    enable_personalization=True,
    config={
        "mmr_lambda": config.MMR_LAMBDA,
        "decay_days": config.RECENCY_DECAY_DAYS,
        "boost_factor": config.RECENCY_BOOST_FACTOR,
        "min_interactions": config.MIN_USER_INTERACTIONS
    }
)

# Run full pipeline
result = advanced.rerank(query, documents, user_profile=user_profile, top_k=3)

print(f"Total Pipeline Latency: {result.latency_ms:.2f}ms")
print(f"\nPipeline Steps:")
for step in result.debug_info["pipeline_steps"]:
    print(f"  - {step['strategy']}: {step['latency_ms']:.2f}ms")

print(f"\nFinal Top 3 Results:")
for i, doc in enumerate(result.documents, 1):
    print(f"  {i}. {doc.id}: {doc.score:.4f}")

# Expected: All strategies applied, total latency <200ms

## Common Failure Modes

Understanding when advanced reranking breaks is crucial for production deployments.

### 1. Ensemble Overconfidence
**When**: Multiple models agreeing on incorrect rankings when all trained similarly  
**Fix**: Use diverse model architectures, switch to voting aggregation

### 2. MMR Trade-offs
**When**: Excessive diversity penalties sacrificing relevance  
**Fix**: Increase lambda parameter (e.g., from 0.5 to 0.8)

### 3. Recency Bias Overwhelming
**When**: Time-sensitive boost drowning out actual relevance signals  
**Fix**: Reduce boost_factor or increase decay_days

### 4. Preference Learning Overfitting
**When**: Models memorizing individual user quirks rather than generalizing  
**Fix**: Increase min_interactions threshold, blend with base relevance more conservatively

### 5. Latency Bottlenecks
**When**: Three models running sequentially violates performance constraints  
**Fix**: Disable ensemble or use single lightweight model

## Decision Card: When to Use Advanced Reranking

### ✅ USE Advanced Reranking When:
- First-pass retrieval has ≥60% precision
- Query complexity requires multiple perspectives
- User requests diverse results
- Time-sensitive information matters
- Personalization improves user engagement
- Production traffic >1,000 queries/day
- Accuracy improvements justify latency cost

### ❌ DO NOT USE When:
- **First-pass retrieval has <60% precision** (fix retrieval first!)
- Simple, unambiguous queries
- Low traffic (<1,000 queries/day)
- Latency budget <100ms total
- No user interaction data for personalization
- Cost/complexity exceeds value

### Strategy Selection Matrix

| Strategy | When to Apply | Skip If |
|----------|---------------|---------|
| **Ensemble** | High-stakes queries, need confidence | Latency critical, simple queries |
| **MMR** | User wants diverse perspectives | Single answer expected |
| **Temporal** | Query contains temporal keywords | Evergreen content queries |
| **Personalization** | User has ≥100 interactions | New user, no history |

### Production Checklist
1. ✓ Measure first-pass retrieval precision
2. ✓ Profile latency budget per strategy
3. ✓ Monitor P95 latency in production
4. ✓ A/B test accuracy improvements
5. ✓ Track user engagement metrics

## Summary and Next Steps

### What You Learned
✓ **Ensemble Reranking**: Combining multiple models reduces bias and improves accuracy by 8-12%  
✓ **MMR Diversity**: Balancing relevance vs diversity for varied perspectives  
✓ **Temporal Boosting**: Time-aware scoring with exponential decay  
✓ **Personalization**: User preference learning from implicit feedback  
✓ **Combined Pipeline**: Orchestrating all strategies with performance budgets

### Key Takeaways
1. **Fix retrieval first**: Advanced reranking only helps when first-pass retrieval has ≥60% precision
2. **Mind the latency**: Combined pipeline must stay under 200ms P95 for SLA compliance
3. **Choose strategies wisely**: Not all queries need all strategies—use the decision matrix
4. **Monitor production**: Track latency, accuracy improvements, and user engagement
5. **Handle failures gracefully**: Understand common failure modes and their fixes

### Prerequisites for This Module
- Level 1 M1.4: Basic Cross-Encoder Reranking
- M9.1-M9.3: Query Decomposition, Multi-Hop Retrieval, HyDE
- Working single-model reranking system

### Next Modules
- **M10**: Production Monitoring and Observability
- **M11**: Cost Optimization and Caching Strategies
- **M12**: Multi-Modal RAG Systems

### Additional Resources
- [Sentence Transformers Cross-Encoders](https://www.sbert.net/examples/applications/cross-encoder/README.html)
- [MMR Original Paper](https://www.cs.cmu.edu/~jgc/publication/The_Use_MMR_Diversity_Based_LTMIR_1998.pdf)
- README.md for detailed documentation and troubleshooting