```{contents}
```
## Reranking

### What Reranking Is

**Reranking** is a **second-stage retrieval step** that **reorders candidate documents** using a **more accurate but more expensive model** after an initial retrieval step.

> First retrieve fast and broad, then rerank slow and precise.

Reranking significantly improves **answer quality** and **reduces hallucinations**.

---

### Why Reranking Is Needed

Initial retrieval (vector / hybrid search):

* Optimized for speed
* Approximate
* No deep query–document interaction

Reranking:

* Uses deep semantic matching
* Considers query + document together
* Produces higher precision results

---

### Where Reranking Fits in RAG

```
User Query
   ↓
Initial Retrieval (Vector / Hybrid)
   ↓
Top-N Candidates (e.g., 20)
   ↓
Reranker (Cross-Encoder)
   ↓
Top-K Results (e.g., 5)
   ↓
LLM
```

Reranking happens at **query time**.

---

### Core Idea (Two-Stage Retrieval)

| Stage   | Purpose   | Model Type               |
| ------- | --------- | ------------------------ |
| Stage 1 | Recall    | Bi-encoder (embeddings)  |
| Stage 2 | Precision | Cross-encoder (reranker) |

---

### How Reranking Works

### Bi-Encoder (Retriever)

* Query and documents embedded separately
* Fast ANN search
* Coarse ranking

---

### Cross-Encoder (Reranker)

* Query and document passed **together**
* Deep semantic scoring
* Accurate ranking
* Much slower

Example input to reranker:

```
[QUERY] How to reset Jira password?
[DOC] Password reset steps for Jira admin users...
```

---

### Types of Rerankers

### 1. Cross-Encoder Models (Most Common)

* BERT-style models
* Input: (query, document)
* Output: relevance score

Examples:

* Cohere Rerank
* BGE Reranker
* Cross-Encoder MiniLM

---

### 2. LLM-Based Reranking

Uses an LLM to:

* Score relevance
* Select best chunks
* Justify ranking

More accurate, but:

* Higher latency
* Higher cost

---

### Reranking Demonstration (LangChain)

#### Step 1: Initial Retrieval

```python
retriever = vectorstore.as_retriever(
    search_kwargs={"k": 20}
)

docs = retriever.get_relevant_documents(
    "How does Jira ticket escalation work?"
)
```

---

#### Step 2: Rerank Documents

```python
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

reranker = CrossEncoderReranker(
    model=HuggingFaceCrossEncoder(
        model_name="BAAI/bge-reranker-base"
    ),
    top_n=5
)

reranked_docs = reranker.compress_documents(
    docs,
    query="How does Jira ticket escalation work?"
)
```

Only top 5 remain.

---

#### Reranking with Contextual Compression

Reranking is often part of **Contextual Compression**:

```
Retriever → Reranker → LLM
```

LangChain abstraction:

```python
from langchain.retrievers import ContextualCompressionRetriever

compression_retriever = ContextualCompressionRetriever(
    base_retriever=retriever,
    base_compressor=reranker
)
```

---

### Reranking vs MMR

| Aspect        | Reranking     | MMR       |
| ------------- | ------------- | --------- |
| Purpose       | Precision     | Diversity |
| Model         | Cross-encoder | Heuristic |
| Cost          | High          | Low       |
| Used together | ✅             | ✅         |

---

### Reranking vs Similarity Thresholds

| Aspect           | Reranking | Threshold |
| ---------------- | --------- | --------- |
| Accuracy         | High      | Low       |
| Adaptive         | Yes       | No        |
| Production-grade | ✅         | ❌         |

---

### Production-Grade Reranking Concepts

### 1. Candidate Size (N)

Typical:

* Retrieve N = 20–50
* Rerank to K = 3–10

Tradeoff:

* Larger N → better recall
* Higher cost

---

### 2. Latency Budget

Typical:

* Reranking: 50–200 ms
* Total RAG: < 500 ms

Optimization:

* Batch reranker calls
* Use smaller cross-encoders

---

### 3. Cost Control

Strategies:

* Only rerank when confidence is low
* Skip reranking for short queries
* Cache reranker results

---

### 4. Metadata-Aware Reranking

Filter first, rerank later:

```python
filter={"source": "jira"}
```

Reduces reranker load.

---

### Common Mistakes

#### Reranking everything

❌ Unnecessary cost

#### Too few initial candidates

❌ Misses relevant context

#### Large documents

❌ Reranker context overflow

#### No reranking in prod

❌ Noisy LLM context

---

### When to Use Reranking

* Enterprise RAG
* IT support / ticketing
* Legal / compliance docs
* Precision-critical QA

---

### When NOT to Use Reranking

* Small datasets
* Low-traffic prototypes
* Exploratory search

---

### Best Practices

* Use hybrid search before reranking
* Keep chunk size small
* Limit top-N
* Monitor precision gains
* Log reranking decisions

---

### Interview-Ready Summary

> “Reranking is a second-stage retrieval step that reorders initially retrieved documents using a more accurate cross-encoder or LLM. It improves precision and answer quality in production RAG systems.”

---

### Rule of Thumb

* **Retriever = recall**
* **Reranker = precision**
* **LLM = reasoning**
* **Production RAG = all three**

