### Hybrid Search with Reranking

**Core Concept**: Combine the speed of keyword matching (BM25) with the semantic understanding of dense embeddings, then use reranking to surface the most relevant results.

**The Pipeline**:
1. Stage 1 - Hybrid Retrieval: Cast a wide net using both BM25 and vector similarity
2. Stage 2 - Reranking: Apply a cross-encoder or LLM to deeply score query-document pairs

**Why This Matters**:
- BM25 catches exact term matches (good for technical queries)
- Dense retrieval catches semantic similarity (good for conceptual queries)
- Reranking ensures the *best* results bubble to the top, not just *good* results

**What You'll Learn**:
- Building hybrid retrievers with ensemble methods
- Cross-encoder reranking for quality improvements
- LLM-based reranking for complex reasoning
- Production patterns and cost considerations

In [11]:
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.retrievers import BM25Retriever
from langchain_classic.retrievers import EnsembleRetriever
from langchain_classic.retrievers.document_compressors import CrossEncoderReranker
from langchain_classic.retrievers import ContextualCompressionRetriever
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain_community.document_loaders import WikipediaLoader
from langchain_groq import ChatGroq
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

import os
from dotenv import load_dotenv

load_dotenv()
os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY")

### Data Preparation

Load and chunk documents. Chunking strategy matters for retrieval quality:
- Smaller chunks (200-300 tokens): Better precision, more granular matching
- Larger chunks (500-1000 tokens): More context, but can dilute relevance signals

For hybrid search, moderate chunk sizes with overlap work well.

In [12]:
loader = WikipediaLoader(query="Transformer (deep learning)", load_max_docs=8)
docs = loader.load()
print(f"Loaded {len(docs)} articles")
print("First doc preview:\n", docs[0].page_content[:220], "...")

Loaded 7 articles
First doc preview:
 In deep learning, the transformer is an artificial neural network architecture based on the multi-head attention mechanism, in which text is converted to numerical representations called tokens, and each token is convert ...




  lis = BeautifulSoup(html).find_all('li')


### Building the Hybrid Retriever

**BM25 Retriever**: Statistical, keyword-based. Fast but misses synonyms.
- Scores based on term frequency and inverse document frequency
- Good for exact matches like "LangChain agents" or "FAISS indexing"

**Dense Retriever**: Embedding-based, semantic. Slower but understands meaning.
- Scores based on cosine similarity in vector space
- Good for "how to build stateful assistants" → matches "memory in LangChain"

**EnsembleRetriever**: Combines both using weighted scoring (Reciprocal Rank Fusion by default).

In [13]:
# BM25 Retriever - keyword-based
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 10  # Retrieve top 10 from BM25

# Dense Retriever - embedding-based
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(docs, embeddings)
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

# Ensemble combines both
# Weights: [BM25, Dense] - adjust based on your use case
# 0.5/0.5 = balanced, 0.7/0.3 = favor keywords, 0.3/0.7 = favor semantics
hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, dense_retriever],
    weights=[0.5, 0.5]  # Equal weight for demonstration
)

print("Hybrid retriever created")
print("Strategy: Reciprocal Rank Fusion (RRF)")

Hybrid retriever created
Strategy: Reciprocal Rank Fusion (RRF)


### Understanding Reciprocal Rank Fusion (RRF)

RRF is the default scoring method in EnsembleRetriever. It works by:

```
For each document:
    RRF_score = Σ (weight_i / (k + rank_i))
    where k=60 (default constant)
```

**Example**:
- Doc A: BM25 rank=1, Dense rank=5
  - RRF = 0.5/(60+1) + 0.5/(60+5) = 0.0082 + 0.0077 = 0.0159
- Doc B: BM25 rank=10, Dense rank=2
  - RRF = 0.5/(60+10) + 0.5/(60+2) = 0.0071 + 0.0081 = 0.0152

Doc A wins despite worse dense ranking because it's much better in BM25.

In [14]:
# Test hybrid retrieval
query = "What is transformer and why is better than LSTM??"

hybrid_results = hybrid_retriever.invoke(query)

print(f"Query: {query}")
print(f"\nRetrieved {len(hybrid_results)} documents\n")

for i, doc in enumerate(hybrid_results[:3], 1):
    print(f"Result {i}:")
    print(doc.page_content[:150])
    print()

Query: What is transformer and why is better than LSTM??

Retrieved 7 documents

Result 1:
In deep learning, the transformer is an artificial neural network architecture based on the multi-head attention mechanism, in which text is converted

Result 2:
A vision transformer (ViT) is a transformer designed for computer vision. A ViT decomposes an input image into a series of patches (rather than text i

Result 3:
In machine learning, attention is a method that determines the importance of each component in a sequence relative to the other components in that seq



### Stage 2: Cross-Encoder Reranking

**The Problem**: Hybrid retrieval gives us candidates, but they're not perfectly ordered. A document ranked #8 might actually be more relevant than #1.

**The Solution**: Cross-encoders process [query, document] pairs jointly and output a relevance score. Unlike bi-encoders (used in dense retrieval), they see both inputs together, enabling deeper understanding.

**Trade-off**: 
- Much slower than bi-encoders (can't pre-compute)
- Much more accurate (full attention between query and doc)

**Production Pattern**: Retrieve 50-100 candidates with hybrid search, rerank with cross-encoder to get top 5-10.

In [15]:
# Cross-encoder model - trained specifically for reranking
# ms-marco-MiniLM is fast and effective for most use cases
cross_encoder = HuggingFaceCrossEncoder(
    model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"
)

# Reranker wrapper
reranker = CrossEncoderReranker(
    model=cross_encoder,
    top_n=5  # Return only top 5 after reranking all candidates
)

# Compression retriever orchestrates: retrieve → rerank → return
compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=hybrid_retriever
)

print("Reranking pipeline ready")
print(f"Flow: Hybrid retrieval → Cross-encoder scoring → Top {reranker.top_n} results")

Reranking pipeline ready
Flow: Hybrid retrieval → Cross-encoder scoring → Top 5 results


In [16]:
# Compare: Before and After Reranking
query = "What is transformer and why is better than LSTM??"

print("=" * 70)
print("BEFORE RERANKING (Hybrid Retrieval Only)")
print("=" * 70)

hybrid_results = hybrid_retriever.invoke(query)
for i, doc in enumerate(hybrid_results[:5], 1):
    print(f"\n[{i}] {doc.page_content[:120]}...")

print("\n" + "=" * 70)
print("AFTER RERANKING (Cross-Encoder Refined)")
print("=" * 70)

reranked_results = compression_retriever.invoke(query)
for i, doc in enumerate(reranked_results, 1):
    print(f"\n[{i}] {doc.page_content[:120]}...")

print("\n" + "=" * 70)
print("Notice how reranking reordered results based on deeper relevance")

BEFORE RERANKING (Hybrid Retrieval Only)

[1] In deep learning, the transformer is an artificial neural network architecture based on the multi-head attention mechani...

[2] A vision transformer (ViT) is a transformer designed for computer vision. A ViT decomposes an input image into a series ...

[3] In machine learning, attention is a method that determines the importance of each component in a sequence relative to th...

[4] Noam Shazeer (born 1975 or 1976) is an American computer scientist and entrepreneur known for his contributions to the f...

[5] A generative pre-trained transformer (GPT) is a type of large language model (LLM) that is widely used in generative AI ...

AFTER RERANKING (Cross-Encoder Refined)

[1] In deep learning, the transformer is an artificial neural network architecture based on the multi-head attention mechani...

[2] A vision transformer (ViT) is a transformer designed for computer vision. A ViT decomposes an input image into a series ...

[3] A generati

### LLM-Based Reranking

When cross-encoders aren't enough, use an LLM to rerank. This is powerful when you need:
- Complex reasoning ("Find docs that *contradict* the query")
- Domain-specific criteria ("Prioritize recent medical studies over older ones")
- Explanations ("Why is this doc ranked first?")

**Cost Warning**: LLMs are 10-100x more expensive than cross-encoders for reranking. Use sparingly.

**Pattern**: Cross-encoder for initial filtering (100→20), then LLM for final refinement (20→5).

In [17]:
# Initialize LLM for reranking
llm = ChatGroq(
    model="llama-3.3-70b-versatile",
    temperature=0  # Deterministic for consistent ranking
)

# Reranking prompt with clear instructions
rerank_prompt = PromptTemplate.from_template("""
You are a document relevance expert. Your task is to rank documents by relevance to a query.

Query: "{query}"

Documents:
{documents}

Instructions:
1. Analyze each document's relevance to the query
2. Consider semantic meaning, not just keyword matching
3. Prioritize documents that directly answer the query
4. Return a comma-separated list of document numbers in ranked order (most relevant first)

Output format: 3,1,5,2,4 (just the numbers, no explanations)
""")

rerank_chain = rerank_prompt | llm | StrOutputParser()

print("LLM reranking chain created")

LLM reranking chain created


In [18]:
def llm_rerank(query: str, documents: list, top_n: int = 5):
    """
    Use LLM to rerank documents.
    
    Args:
        query: User query
        documents: List of Document objects
        top_n: Number of results to return
    
    Returns:
        List of reranked documents
    """
    # Format documents for the prompt
    doc_texts = [
        f"{i+1}. {doc.page_content}"
        for i, doc in enumerate(documents)
    ]
    formatted_docs = "\n\n".join(doc_texts)
    
    # Get rankings from LLM
    response = rerank_chain.invoke({
        "query": query,
        "documents": formatted_docs
    })
    
    # Parse rankings (handle potential formatting issues)
    try:
        # Extract only digit sequences
        indices = [
            int(x.strip()) - 1  # Convert to 0-indexed
            for x in response.split(",")
            if x.strip().isdigit()
        ]
        
        # Reorder documents based on rankings
        reranked = [
            documents[i]
            for i in indices
            if 0 <= i < len(documents)
        ][:top_n]
        
        return reranked
    
    except Exception as e:
        print(f"Error parsing rankings: {e}")
        print(f"LLM response: {response}")
        return documents[:top_n]  # Fallback to original order

print("LLM reranking function ready")

LLM reranking function ready


In [19]:
# Test LLM reranking
query = "What is transformer and why is better than LSTM??"

# Get initial results from hybrid retrieval
candidates = hybrid_retriever.invoke(query)

print("=" * 70)
print("LLM RERANKING")
print("=" * 70)

llm_reranked = llm_rerank(query, candidates, top_n=5)

for i, doc in enumerate(llm_reranked, 1):
    print(f"\n[{i}] {doc.page_content[:150]}...")

print("\n" + "=" * 70)

LLM RERANKING

[1] In deep learning, the transformer is an artificial neural network architecture based on the multi-head attention mechanism, in which text is converted...

[2] In machine learning, attention is a method that determines the importance of each component in a sequence relative to the other components in that seq...

[3] A generative pre-trained transformer (GPT) is a type of large language model (LLM) that is widely used in generative AI chatbots. GPTs are based on a ...

[4] A vision transformer (ViT) is a transformer designed for computer vision. A ViT decomposes an input image into a series of patches (rather than text i...

[5] Ashish Vaswani (born 1986) is an Indian computer scientist. He worked as a research scientist at Google Brain and Information Sciences Institute. 
Vas...



### Three-Way Comparison

Let's compare all three approaches side by side:
1. Hybrid retrieval only (fast, decent quality)
2. Hybrid + Cross-encoder (balanced speed/quality)
3. Hybrid + LLM reranking (best quality, slowest/most expensive)

In [20]:
import time

query = "What is transformer and why is better than LSTM??"

print("\n" + "=" * 80)
print("COMPARISON: Three Reranking Approaches")
print("=" * 80)

# Approach 1: Hybrid only
start = time.time()
hybrid_only = hybrid_retriever.invoke(query)[:5]
time_hybrid = time.time() - start

print("\n[1] HYBRID RETRIEVAL ONLY")
print(f"Time: {time_hybrid:.3f}s")
for i, doc in enumerate(hybrid_only, 1):
    print(f"  {i}. {doc.page_content[:100]}...")

# Approach 2: Hybrid + Cross-encoder
start = time.time()
cross_encoder_reranked = compression_retriever.invoke(query)
time_cross = time.time() - start

print(f"\n[2] HYBRID + CROSS-ENCODER")
print(f"Time: {time_cross:.3f}s ({time_cross/time_hybrid:.1f}x slower)")
for i, doc in enumerate(cross_encoder_reranked, 1):
    print(f"  {i}. {doc.page_content[:100]}...")

# Approach 3: Hybrid + LLM
start = time.time()
candidates = hybrid_retriever.invoke(query)
llm_reranked_results = llm_rerank(query, candidates, top_n=5)
time_llm = time.time() - start

print(f"\n[3] HYBRID + LLM RERANKING")
print(f"Time: {time_llm:.3f}s ({time_llm/time_hybrid:.1f}x slower)")
for i, doc in enumerate(llm_reranked_results, 1):
    print(f"  {i}. {doc.page_content[:100]}...")

print("\n" + "=" * 80)
print("Key Takeaways:")
print("- Cross-encoder adds minimal latency with significant quality gain")
print("- LLM reranking is much slower but can handle complex reasoning")
print("- Choose based on your quality/speed/cost requirements")
print("=" * 80)


COMPARISON: Three Reranking Approaches

[1] HYBRID RETRIEVAL ONLY
Time: 0.433s
  1. In deep learning, the transformer is an artificial neural network architecture based on the multi-he...
  2. A vision transformer (ViT) is a transformer designed for computer vision. A ViT decomposes an input ...
  3. In machine learning, attention is a method that determines the importance of each component in a seq...
  4. Noam Shazeer (born 1975 or 1976) is an American computer scientist and entrepreneur known for his co...
  5. A generative pre-trained transformer (GPT) is a type of large language model (LLM) that is widely us...

[2] HYBRID + CROSS-ENCODER
Time: 0.226s (0.5x slower)
  1. In deep learning, the transformer is an artificial neural network architecture based on the multi-he...
  2. A vision transformer (ViT) is a transformer designed for computer vision. A ViT decomposes an input ...
  3. A generative pre-trained transformer (GPT) is a type of large language model (LLM) that is widely

### Production Reranking Pattern

For most production systems, use a staged approach:

```
Stage 1: Hybrid Retrieval (BM25 + Dense)
  ↓ 200 candidates (fast, broad coverage)

Stage 2: Cross-Encoder Reranking
  ↓ 20 results (accurate, still efficient)

Stage 3: LLM Reranking (optional)
  ↓ 5 final results (deep reasoning, expensive)
```

**Why This Works**:
- Stage 1 ensures recall (don't miss relevant docs)
- Stage 2 improves precision efficiently
- Stage 3 handles edge cases requiring complex logic

**Cost Analysis**:
- Hybrid retrieval: ~1ms, negligible cost
- Cross-encoder (200 docs): ~50-100ms, negligible cost
- LLM reranking (20 docs): ~1-3s, $0.001-0.01 per query

Skip Stage 3 for most queries; use it only when:
- User explicitly needs best possible results
- Query is ambiguous or requires reasoning
- You're building for low-volume, high-value use cases

In [24]:
def production_retrieve_and_rerank(
    query: str,
    use_llm_rerank: bool = False,
    final_top_k: int = 5
):
    """
    Production-grade retrieval with optional LLM reranking.
    
    Args:
        query: User query
        use_llm_rerank: Whether to apply expensive LLM reranking
        final_top_k: Number of results to return
    
    Returns:
        List of top documents with metadata
    """
    start_time = time.time()
    
    # Stage 1: Hybrid retrieval
    # Retrieve more candidates for better recall
    hybrid_retriever_wide = EnsembleRetriever(
        retrievers=[bm25_retriever, dense_retriever],
        weights=[0.5, 0.5]
    )
    
    candidates = hybrid_retriever_wide.invoke(query)
    stage1_time = time.time() - start_time
    
    # Stage 2: Cross-encoder reranking
    intermediate_k = 10 if use_llm_rerank else final_top_k
    
    reranker_stage2 = CrossEncoderReranker(
        model=cross_encoder,
        top_n=intermediate_k
    )
    
    compression_retriever_stage2 = ContextualCompressionRetriever(
        base_compressor=reranker_stage2,
        base_retriever=hybrid_retriever_wide
    )
    
    stage2_results = compression_retriever_stage2.invoke(query)
    stage2_time = time.time() - start_time
    
    # Stage 3: Optional LLM reranking
    if use_llm_rerank:
        final_results = llm_rerank(query, stage2_results, top_n=final_top_k)
        stage3_time = time.time() - start_time
    else:
        final_results = stage2_results[:final_top_k]
        stage3_time = stage2_time
    
    # Package results with metadata
    return {
        "documents": final_results,
        "timing": {
            "stage1_hybrid": f"{stage1_time:.3f}s",
            "stage2_cross_encoder": f"{stage2_time:.3f}s",
            "total": f"{stage3_time:.3f}s"
        },
        "stages_used": 3 if use_llm_rerank else 2
    }

# Test both configurations
query = "What is transformer and why is better than LSTM??"

print("\nPRODUCTION CONFIGURATION 1: Cross-encoder only")
result1 = production_retrieve_and_rerank(query, use_llm_rerank=False)
print(f"Stages used: {result1['stages_used']}")
print(f"Timing: {result1['timing']}")

print("\nPRODUCTION CONFIGURATION 2: Full pipeline with LLM")
result2 = production_retrieve_and_rerank(query, use_llm_rerank=True)
print(f"Stages used: {result2['stages_used']}")
print(f"Timing: {result2['timing']}")

print("\nTop result (LLM-reranked):")
print(result2['documents'][0].page_content[:500])


PRODUCTION CONFIGURATION 1: Cross-encoder only
Stages used: 2
Timing: {'stage1_hybrid': '0.173s', 'stage2_cross_encoder': '0.337s', 'total': '0.337s'}

PRODUCTION CONFIGURATION 2: Full pipeline with LLM
Stages used: 3
Timing: {'stage1_hybrid': '0.007s', 'stage2_cross_encoder': '0.094s', 'total': '0.633s'}

Top result (LLM-reranked):
In deep learning, the transformer is an artificial neural network architecture based on the multi-head attention mechanism, in which text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less


### Key Takeaways

**When to Use Each Approach**:

1. **Hybrid Only**: 
   - High-volume applications (>1000 QPS)
   - Latency critical (<50ms requirement)
   - Good enough quality for most queries

2. **Hybrid + Cross-Encoder**:
   - Production default for most applications
   - Moderate latency acceptable (100-200ms)
   - Significant quality improvement for minimal cost
   - Best bang for buck

3. **Hybrid + Cross-Encoder + LLM**:
   - Complex queries requiring reasoning
   - Low volume, high value use cases
   - User-facing features where quality matters most
   - Budget for LLM API costs

**Cost Comparison** (rough estimates per 1000 queries):
- Hybrid only: ~$0
- + Cross-encoder: ~$0 (self-hosted) or ~$0.10 (API)
- + LLM rerank: ~$5-50 depending on model

**Quality Improvements** (measured by NDCG@5):
- Hybrid only: Baseline
- + Cross-encoder: +10-20% improvement
- + LLM rerank: +5-15% additional improvement

**Production Recommendation**: 
Start with Hybrid + Cross-encoder. Add LLM reranking selectively based on query complexity signals (question length, ambiguity, user tier).