# Notebook 5: Hybrid Search & Reranking

**Difficulty:** Advanced | **Estimated Time:** 150-180 minutes

## Learning Objectives

By the end of this notebook, you will be able to:

1. ‚úÖ Understand dense vs sparse vectors and BM25 algorithm
2. ‚úÖ Implement hybrid search combining semantic and keyword matching
3. ‚úÖ Configure alpha parameter for optimal score fusion
4. ‚úÖ Apply reranking models (Cohere, SentenceTransformer cross-encoders)


## Prerequisites

- Completed Notebooks 1-4
- Understanding of embeddings and retrieval
- Optional: Cohere API key for reranking (free tier available)

## Curriculum Coverage

- **Section 4.1:** Hybrid Search Fundamentals
- **Section 4.2:** Implementing Hybrid Search
- **Section 4.5:** Reranking

---

## 1. Setup & Imports

In [1]:
# Core LlamaIndex
from llama_index.core import (
    VectorStoreIndex,
    Settings,
    Document,
    StorageContext,
)
from llama_index.core.query_engine import BaseQueryEngine
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.schema import QueryBundle, NodeWithScore

# Reranking
from llama_index.core.postprocessor import SentenceTransformerRerank
try:
    from llama_index.postprocessor.cohere_rerank import CohereRerank
    COHERE_AVAILABLE = True
except ImportError:
    COHERE_AVAILABLE = False
    print("‚ö†Ô∏è  Cohere rerank not available. Install with: pip install llama-index-postprocessor-cohere-rerank")

# Vector Stores (for hybrid search)
try:
    from llama_index.vector_stores.qdrant import QdrantVectorStore
    from qdrant_client import QdrantClient
    QDRANT_AVAILABLE = True
except ImportError:
    QDRANT_AVAILABLE = False
    print("‚ö†Ô∏è  Qdrant not available. Install with: pip install llama-index-vector-stores-qdrant qdrant-client")

# LLM and Embeddings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Utilities
from dotenv import load_dotenv
import os
import time
from typing import List, Optional, Any
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Imports successful!")

‚úÖ Imports successful!


In [2]:
# Configure Settings
load_dotenv()

Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0.1)
Settings.embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
    dimensions=1536
)

print("‚úÖ Settings configured")

‚úÖ Settings configured


---

## 2. Understanding Hybrid Search

### Dense vs Sparse Vectors

**Dense Vectors (Semantic):**
- Generated by neural embedding models
- All dimensions have non-zero values
- Capture semantic meaning
- Example: `[0.12, -0.34, 0.56, ..., 0.89]` (1536 dims)

**Sparse Vectors (Keyword/BM25):**
- Based on term frequency (TF-IDF, BM25)
- Most dimensions are zero
- Capture exact term matches
- Example: `{"vector": 0.8, "search": 0.6, "database": 0.4}`

### Why Hybrid?

**Strengths of Each:**

| Aspect | Dense (Semantic) | Sparse (Keyword) |
|--------|-----------------|------------------|
| **Synonyms** | ‚úÖ Excellent | ‚ùå Poor |
| **Exact terms** | ‚ö†Ô∏è  Good | ‚úÖ Excellent |
| **Context** | ‚úÖ Excellent | ‚ùå None |
| **Rare terms** | ‚ö†Ô∏è  Good | ‚úÖ Excellent |
| **Speed** | Fast | Very Fast |

**Hybrid combines both for best results!**

### BM25 Algorithm Overview

**BM25 (Best Matching 25)** is a ranking function:

```
BM25(D, Q) = Œ£ IDF(qi) * (f(qi, D) * (k1 + 1)) / (f(qi, D) + k1 * (1 - b + b * |D| / avgdl))
```

Where:
- `D`: Document
- `Q`: Query
- `f(qi, D)`: Term frequency of query term qi in document D
- `|D|`: Length of document D
- `avgdl`: Average document length
- `k1`: Term frequency saturation (typically 1.2-2.0)
- `b`: Length normalization (typically 0.75)
- `IDF(qi)`: Inverse document frequency

**Key Properties:**
- **Diminishing returns**: Additional term occurrences matter less
- **Length normalization**: Longer documents aren't unfairly penalized
- **IDF weighting**: Rare terms score higher

---

## 3. Prepare Sample Documents

In [3]:
# Create comprehensive documents for hybrid search testing
documents = [
    Document(
        text="""Vector databases are specialized systems for storing and querying high-dimensional vectors. 
        Popular vector databases include Qdrant, Pinecone, Weaviate, and Milvus. They use algorithms like 
        HNSW (Hierarchical Navigable Small World) for approximate nearest neighbor search. Vector databases 
        are essential for semantic search, recommendation systems, and RAG applications.""",
        metadata={"topic": "vector_databases", "doc_type": "overview", "keywords": "vector database HNSW"}
    ),
    Document(
        text="""BM25 is a probabilistic ranking function used for information retrieval. It improves upon 
        TF-IDF by adding term frequency saturation and document length normalization. The formula includes 
        parameters k1 (controls term frequency saturation) and b (controls length normalization). BM25 is 
        widely used in search engines like Elasticsearch and provides excellent keyword-based retrieval.""",
        metadata={"topic": "bm25", "doc_type": "technical", "keywords": "BM25 ranking TF-IDF"}
    ),
    Document(
        text="""Hybrid search combines semantic vector search with keyword-based methods like BM25. This 
        approach leverages the strengths of both: semantic search handles synonyms and context, while 
        keyword search ensures exact term matches. The results are typically fused using reciprocal rank 
        fusion or weighted score combination with an alpha parameter.""",
        metadata={"topic": "hybrid_search", "doc_type": "technical", "keywords": "hybrid search semantic keyword"}
    ),
    Document(
        text="""Reranking is a two-stage retrieval process where an initial set of candidates is retrieved 
        cheaply, then reranked using a more expensive but accurate model. Cross-encoder models like those 
        from sentence-transformers are popular for reranking. Cohere provides a reranking API that achieves 
        excellent results. Reranking significantly improves retrieval quality at modest cost increase.""",
        metadata={"topic": "reranking", "doc_type": "technical", "keywords": "reranking cross-encoder Cohere"}
    ),
    Document(
        text="""Qdrant is a vector database written in Rust that supports both dense and sparse vectors. 
        It enables hybrid search by combining semantic similarity with BM25 keyword matching. Qdrant uses 
        HNSW indexing for fast approximate nearest neighbor search and supports filtering, quantization, 
        and distributed deployments. It's particularly well-suited for production RAG systems.""",
        metadata={"topic": "qdrant", "doc_type": "product", "keywords": "Qdrant vector database hybrid"}
    ),
    Document(
        text="""The alpha parameter in hybrid search controls the balance between semantic and keyword scores. 
        Alpha=0 means pure keyword (BM25) search. Alpha=1 means pure semantic search. Alpha=0.5 gives equal 
        weight to both. The optimal alpha depends on your data and queries - typically between 0.3-0.7. 
        You should tune alpha on a validation set for best results.""",
        metadata={"topic": "alpha_tuning", "doc_type": "guide", "keywords": "alpha parameter hybrid tuning"}
    ),
]

print(f"‚úÖ Created {len(documents)} documents for hybrid search")
for doc in documents:
    print(f"  - {doc.metadata['topic']} ({doc.metadata['doc_type']})")

‚úÖ Created 6 documents for hybrid search
  - vector_databases (overview)
  - bm25 (technical)
  - hybrid_search (technical)
  - reranking (technical)
  - qdrant (product)
  - alpha_tuning (guide)


---

## 4. Simulating Hybrid Search (Conceptual)

**Note**: True hybrid search requires a vector database with BM25 support (like Qdrant). For this demo, we'll show the concept and implementation pattern.

In [4]:
# Create standard vector index
index = VectorStoreIndex.from_documents(documents, show_progress=True)
print("‚úÖ Vector index created")

Parsing nodes:   0%|          | 0/6 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/6 [00:00<?, ?it/s]

‚úÖ Vector index created


### Qdrant Hybrid Search (Code Example)

**With Qdrant installed:**

In [6]:
!uv pip install fastembed

[2mUsing Python 3.12.9 environment at: /Users/sourangshupal/Downloads/llama-index-tutorials/.venv[0m
[2K[2mResolved [1m27 packages[0m [2min 1.51s[0m[0m                                        [0m
[2K[2mPrepared [1m1 package[0m [2min 372ms[0m[0m                                              
[2mUninstalled [1m1 package[0m [2min 16ms[0m[0m
[2K[2mInstalled [1m4 packages[0m [2min 21ms[0m[0m                                [0m
 [32m+[39m [1mfastembed[0m[2m==0.7.4[0m
 [32m+[39m [1mloguru[0m[2m==0.7.3[0m
 [31m-[39m [1mpillow[0m[2m==12.0.0[0m
 [32m+[39m [1mpillow[0m[2m==11.3.0[0m
 [32m+[39m [1mpy-rust-stemmers[0m[2m==0.1.5[0m


In [5]:
if QDRANT_AVAILABLE:
    # Initialize Qdrant client
    qdrant_client = QdrantClient(location=":memory:")
    
    # Create Qdrant vector store with hybrid search enabled
    qdrant_vector_store = QdrantVectorStore(
        client=qdrant_client,
        collection_name="hybrid_search_demo",
        enable_hybrid=True,  # Enable BM25 + semantic search
        batch_size=20,
    )
    
    storage_context = StorageContext.from_defaults(vector_store=qdrant_vector_store)
    
    # Create index
    hybrid_index = VectorStoreIndex.from_documents(
        documents,
        storage_context=storage_context,
        show_progress=True,
    )
    
    # Query with hybrid search
    hybrid_query_engine = hybrid_index.as_query_engine(
        similarity_top_k=3,
        vector_store_query_mode="hybrid",  # Use hybrid search
        alpha=0.5,  # 50% semantic, 50% keyword
    )
    
    print("‚úÖ Qdrant hybrid search enabled")
    print("   Mode: hybrid (semantic + BM25)")
    print("   Alpha: 0.5 (equal weighting)")
else:
    print("‚ö†Ô∏è  Qdrant not available - skipping hybrid search demo")
    hybrid_query_engine = index.as_query_engine(similarity_top_k=3)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

model.onnx:   0%|          | 0.00/532M [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/755 [00:00<?, ?B/s]

Parsing nodes:   0%|          | 0/6 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/6 [00:00<?, ?it/s]

‚úÖ Qdrant hybrid search enabled
   Mode: hybrid (semantic + BM25)
   Alpha: 0.5 (equal weighting)


### Testing Hybrid Search Benefits

In [6]:
# Test queries that benefit from hybrid search
test_queries = [
    "What is BM25?",  # Exact term match important
    "How do you balance semantic and keyword search?",  # Conceptual + keyword
    "vector database for production",  # Keywords important
]

print("Hybrid Search Test Queries:\n")
for query in test_queries:
    response = hybrid_query_engine.query(query)
    print(f"Query: {query}")
    print(f"Response: {str(response)[:200]}...")
    print(f"Top source: {response.source_nodes[0].metadata.get('topic')}")
    print("-" * 80)

Hybrid Search Test Queries:

Query: What is BM25?
Response: BM25 is a probabilistic ranking function utilized for information retrieval. It enhances the traditional TF-IDF method by incorporating term frequency saturation and document length normalization. The...
Top source: bm25
--------------------------------------------------------------------------------
Query: How do you balance semantic and keyword search?
Response: To balance semantic and keyword search, you can adjust the alpha parameter in hybrid search. This parameter determines the weight given to each search method. An alpha value of 0 represents a pure key...
Top source: hybrid_search
--------------------------------------------------------------------------------
Query: vector database for production
Response: A suitable option for a production vector database is Qdrant. It is designed to handle both dense and sparse vectors and supports hybrid search, which combines semantic similarity with traditional key...
Top source

### üéØ ML Engineering Note: Alpha Parameter Tuning

**Alpha controls dense/sparse balance:**

```python
final_score = alpha * semantic_score + (1 - alpha) * bm25_score
```

**Tuning Strategy:**

1. **Create validation set**: 20-50 queries with known relevant docs
2. **Test alphas**: [0.0, 0.1, 0.2, ..., 0.9, 1.0]
3. **Measure metrics**: Precision@K, Recall@K, MRR
4. **Select optimal**: Highest metric value

**Common Patterns:**
- **Technical docs**: Œ± ‚âà 0.3-0.4 (favor keywords for code, APIs)
- **General knowledge**: Œ± ‚âà 0.6-0.7 (favor semantic)
- **Product search**: Œ± ‚âà 0.4-0.5 (balanced)
- **Scientific papers**: Œ± ‚âà 0.5-0.6 (balanced, slight semantic)

---

## 5. Reranking Models

### Why Rerank?

**Two-stage retrieval:**
1. **Stage 1 (Retrieval)**: Fast, recall-focused (get 50-100 candidates)
2. **Stage 2 (Reranking)**: Slow, precision-focused (rerank top 10-20)

**Cost-Performance Trade-off:**
- Initial retrieval: Cheap (vector similarity or BM25)
- Reranking: Expensive (cross-encoder models)
- Result: Best of both worlds

### 5.1 SentenceTransformer Cross-Encoder Reranking

In [7]:
# Create cross-encoder reranker
print("Loading cross-encoder model (first run may take a moment)...")

reranker = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L-2-v2",  # Fast, lightweight
    top_n=3,  # Return top 3 after reranking
)

# Create query engine with reranking
reranked_query_engine = index.as_query_engine(
    similarity_top_k=10,  # Retrieve 10 candidates
    node_postprocessors=[reranker],  # Rerank to top 3
)

print("\n‚úÖ Reranking query engine created")
print("   Initial retrieval: top 10")
print("   After reranking: top 3")
print("   Model: cross-encoder/ms-marco-MiniLM-L-2-v2")

Loading cross-encoder model (first run may take a moment)...


config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/62.5M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]


‚úÖ Reranking query engine created
   Initial retrieval: top 10
   After reranking: top 3
   Model: cross-encoder/ms-marco-MiniLM-L-2-v2


In [8]:
# Compare with and without reranking
test_query = "How does hybrid search combine different retrieval methods?"

print(f"Query: {test_query}\n")
print("=" * 80)

# Without reranking
print("\nWithout Reranking (similarity only):")
no_rerank_engine = index.as_query_engine(similarity_top_k=3)
no_rerank_response = no_rerank_engine.query(test_query)
print("Top 3 sources:")
for i, node in enumerate(no_rerank_response.source_nodes, 1):
    print(f"  {i}. {node.metadata.get('topic')} (score: {node.score:.4f})")

# With reranking
print("\nWith Cross-Encoder Reranking:")
reranked_response = reranked_query_engine.query(test_query)
print("Top 3 sources (after reranking):")
for i, node in enumerate(reranked_response.source_nodes, 1):
    print(f"  {i}. {node.metadata.get('topic')} (score: {node.score:.4f})")

Query: How does hybrid search combine different retrieval methods?


Without Reranking (similarity only):
Top 3 sources:
  1. hybrid_search (score: 0.7169)
  2. reranking (score: 0.4180)
  3. alpha_tuning (score: 0.4074)

With Cross-Encoder Reranking:
Top 3 sources (after reranking):
  1. hybrid_search (score: 2.9632)
  2. alpha_tuning (score: -3.7781)
  3. qdrant (score: -4.4877)


### 5.2 Cohere Rerank (API-based)

In [None]:
if COHERE_AVAILABLE and os.getenv("COHERE_API_KEY"):
    # Create Cohere reranker
    cohere_reranker = CohereRerank(
        api_key=os.getenv("COHERE_API_KEY"),
        top_n=3,
        model="rerank-english-v3.0",  # Latest Cohere rerank model
    )
    
    # Create query engine with Cohere reranking
    cohere_query_engine = index.as_query_engine(
        similarity_top_k=10,
        node_postprocessors=[cohere_reranker],
    )
    
    print("‚úÖ Cohere reranking enabled")
    print("   Model: rerank-english-v3.0")
    
    # Test Cohere reranking
    cohere_response = cohere_query_engine.query(test_query)
    print("\nCohere Reranked Results:")
    for i, node in enumerate(cohere_response.source_nodes, 1):
        print(f"  {i}. {node.metadata.get('topic')} (score: {node.score:.4f})")
else:
    print("‚ö†Ô∏è  Cohere reranking not available (missing API key or package)")
    print("   Get free API key at: https://dashboard.cohere.com/")

### Reranking Model Comparison

| Model | Quality | Speed | Cost | Hosting |
|-------|---------|-------|------|--------|
| **Cross-Encoder (local)** | Good | Fast (GPU) | Free | Self-hosted |
| **Cohere Rerank** | Excellent | Fast | $1/1000 searches | API |
| **Custom fine-tuned** | Variable | Variable | Training cost | Self-hosted |

**Recommendation:**
- **Development**: Cross-encoder (free, fast iteration)
- **Production (budget)**: Cross-encoder with GPU
- **Production (quality)**: Cohere Rerank