# Contextual AI Reranker Integration with DSPy

This tutorial demonstrates how to integrate Contextual AI's instruction-following reranker with DSPy for improved retrieval-augmented generation (RAG) performance. We'll show how DSPy's optimization capabilities can work with Contextual AI's reranking to achieve better results.


## Setup


In [None]:
import dspy
import os
import requests
import ujson
import random
from dspy.utils import download
from dspy.evaluate import SemanticF1

# Set API keys
# Get this key at http://app.contextual.ai/
CONTEXTUAL_API_KEY = "your_contextual_api_key"
OPENAI_API_KEY = "your_openai_api_key"

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

lm = dspy.LM(model="openai/gpt-4o-mini", api_key=OPENAI_API_KEY)
dspy.settings.configure(lm=lm)


## Contextual AI Reranker Implementation


In [36]:
class ContextualReranker(dspy.Retrieve):
    def __init__(self, api_key, base_retriever=None, k=5, rerank_instructions=""):
        super().__init__(k=k)
        self.api_key = api_key
        self.base_retriever = base_retriever
        self.rerank_instructions = rerank_instructions
    
    def forward(self, query_or_queries):
        if self.base_retriever:
            # Get initial documents from base retriever
            initial_docs = self.base_retriever(query_or_queries)
            documents = initial_docs.passages
        else:
            documents = query_or_queries if isinstance(query_or_queries, list) else [query_or_queries]
        
        url = "https://api.contextual.ai/v1/rerank"
        headers = {
            "accept": "application/json",
            "content-type": "application/json",
            "authorization": f"Bearer {self.api_key}"
        }
        
        payload = {
            "query": query_or_queries if isinstance(query_or_queries, str) else query_or_queries[0],
            "documents": documents,
            "model": "ctxl-rerank-v2-instruct-multilingual",
            "top_n": self.k,
            "instruction": self.rerank_instructions
        }
        
        response = requests.post(url, headers=headers, json=payload)
        result = response.json()
        
        if 'results' not in result:
            print("Contextual API Error:", result)
            return dspy.Prediction(passages=documents[:self.k])
        
        reranked_results = result['results']
        reranked_results.sort(key=lambda x: x['relevance_score'], reverse=True)
        
        top_docs = []
        for item in reranked_results[:self.k]:
            doc = documents[item['index']]
            if isinstance(doc, str):
                top_docs.append(doc)
            else:
                top_docs.append(doc.get('content', doc))
        
        return dspy.Prediction(passages=top_docs)


## Load Dataset


In [48]:
# Download and load the RAG-QA Arena Tech dataset
download("https://huggingface.co/dspy/cache/resolve/main/ragqa_arena_tech_corpus.jsonl")

with open("ragqa_arena_tech_corpus.jsonl") as f:
    corpus_lines = f.readlines()[:500]

corpus = []
for line in corpus_lines:
    data = ujson.loads(line)
    corpus.append(data['text'])

print(f"Loaded {len(corpus)} documents")


Loaded 500 documents


In [38]:
# Load question-answer pairs for evaluation
download("https://huggingface.co/dspy/cache/resolve/main/ragqa_arena_tech_examples.jsonl")

with open("ragqa_arena_tech_examples.jsonl") as f:
    data = [ujson.loads(line) for line in f]

data = [dspy.Example(**d).with_inputs('question') for d in data]

# Split data for training, validation, and testing
random.Random(0).shuffle(data)
trainset, devset, testset = data[:200], data[200:500], data[500:1000]

print(f"Training: {len(trainset)}, Dev: {len(devset)}, Test: {len(testset)}")


Downloading 'ragqa_arena_tech_examples.jsonl'...
Training: 200, Dev: 300, Test: 500


## Setup Base Retriever


In [None]:
# Create base embedding retriever
embedder = dspy.Embedder('openai/text-embedding-3-small', dimensions=512, api_key=OPENAI_API_KEY)
base_search = dspy.retrievers.Embeddings(embedder=embedder, corpus=corpus, k=10)
base_search_wide = dspy.retrievers.Embeddings(embedder=embedder, corpus=corpus, k=50)


## Create RAG Systems


In [None]:
class RAG(dspy.Module):
    def __init__(self, retriever):
        super().__init__()
        self.retriever = retriever
        self.respond = dspy.ChainOfThought('context, question -> response')
    
    def forward(self, question):
        if hasattr(question, 'question'):
            question = question.question
        elif isinstance(question, dict) and 'question' in question:
            question = question['question']
        context = self.retriever(question).passages
        return self.respond(context=context, question=question)

base_rag = RAG(base_search)

# RAG with Contextual AI reranking -> Changing the prompt can yield different results
contextual_reranker = ContextualReranker(
    api_key=CONTEXTUAL_API_KEY,
    base_retriever=base_search_wide,
    k=10,
    rerank_instructions="Prioritize documents that provide specific, actionable technical solutions and step-by-step instructions."
)
reranked_rag = RAG(contextual_reranker)


## Evaluation Setup


In [41]:
# Setup evaluation metric
metric = SemanticF1(decompositional=True)
evaluate = dspy.Evaluate(devset=devset, metric=metric, num_threads=4, display_progress=True)


## Baseline Evaluation


## Test the Reranker

Let's test our reranker with a simple example to see how it works.


In [51]:
# Test query - something relevant to the tech corpus
test_query = "How do I fix my Linux system"

print(f"Query: {test_query}\n")

base_results = base_search(test_query)
print("Base retrieval results:")
for i, doc in enumerate(base_results.passages[:3]):
    print(f"{i+1}. {doc[:200]}...\n")

reranked_results = contextual_reranker(test_query)
print("Reranked results:")
for i, doc in enumerate(reranked_results.passages[:3]):
    print(f"{i+1}. {doc[:200]}...\n")


Query: How do I fix my Linux system

Base retrieval results:
1. For some reason the disk my Mac was booting from was the macOS Installer Disk, all I had to do: Hold down the option (-alt) key before the Apple logo shows. Then just select your Macintosh HD (or how ...

2. In addition to the preceding answers, which mention only Windows, and since theres a dup-closed question Does WannaCry infect Linux? pointing to this one, Id like to add that Linux machines can get in...

3. Riffing off of Prabhats question above, I had this issue in macos high sierra when I stranded an encfs process, rebooting solved it, but this ps -ef | grep name-of-busy-dir Showed me the process and t...

Reranked results:
1. You dont typically clear the journal yourself. That is managed by systemd itself and old logs are rotated out as new data comes in. The correct thing to do would be to schedule journald to only keep a...

2. You need to give permission to the path. run this in command line and you will be fine

### Analyzing the Results

Notice the difference in quality between the two responses:

**Base Retrieval Issues:**
- Retrieved documents about macOS booting issues (not Linux-specific)
- Included content about WannaCry malware (not relevant to fixing systems)
- Retrieved macOS High Sierra troubleshooting (wrong OS)

**Reranked Retrieval Improvements:**
- Focused on systemd journal management (actual Linux system administration)
- Provided file permission fixes with specific commands (`sudo chown`)
- Delivered actionable solutions for common Linux issues

**Impact on Response Quality:**
- **Base RAG**: Gave generic troubleshooting steps not specific to the retrieved context
- **Reranked RAG**: Provided concrete, actionable Linux commands with file paths and configuration details

The reranker successfully filtered out irrelevant OS-specific content and prioritized documents with "specific, actionable technical solutions" as specified in our rerank instructions.


In [52]:
# Same Query
base_response = base_rag(question=test_query)
reranked_response = reranked_rag(question=test_query)

print(f"Base RAG Response:{base_response.response}")


Base RAG Response:To fix your Linux system, start by identifying any running processes that might be causing issues. You can use the command `ps aux` to list all processes. If you suspect a specific process is problematic, you can terminate it using `sudo kill -15 <PID>` where `<PID>` is the process ID. Additionally, check your disk partitions with `df -Th` to ensure they are correctly formatted and mounted. If you encounter file system errors, running `fsck` can help resolve them. If these steps do not resolve the issue, consider backing up your data and performing a clean installation of the operating system.


In [53]:
print(f"Reranked RAG Response:{reranked_response.response}")

Reranked RAG Response:To fix your Linux system, first identify the specific issue you're encountering. If it's a permission problem, use `sudo chown -R $USER /path/to/directory` to change ownership. For network issues, ensure your network interfaces are configured correctly in `/etc/network/interfaces` and restart the networking service. If you're having trouble with packages, use `dpkg-deb` to manage them. For log management, adjust the settings in `/etc/systemd/journald.conf` to control log size. If you provide more details about the specific problem, I can offer more targeted advice.


## Quantitative Evaluation Results

Now let's compare the performance of different RAG configurations on our development set.


In [None]:
evaluate(base_rag)

Average Metric: 121.75 / 300 (40.6%): 100%|██████████| 300/300 [16:48<00:00,  3.36s/it]

2025/10/13 12:36:26 INFO dspy.evaluate.evaluate: Average Metric: 121.75266058090479 / 300 (40.6%)





EvaluationResult(score=40.58, results=<list of 300 results>)

In [54]:
print("Evaluating Contextual AI reranked RAG system...")
reranked_score = evaluate(reranked_rag)


Evaluating Contextual AI reranked RAG system...
Average Metric: 127.27 / 300 (42.4%): 100%|██████████| 300/300 [20:48<00:00,  4.16s/it]

2025/10/13 13:44:46 INFO dspy.evaluate.evaluate: Average Metric: 127.27461854472807 / 300 (42.4%)





### Results Comparison

| Configuration | SemanticF1 Score | Improvement | Description |
|--------------|------------------|-------------|-------------|
| **Base RAG (CoT)** | **40.6%** | Baseline | Embeddings retrieval + Chain-of-Thought |
| **Reranked RAG** | **42.4%** | **+1.8%** | Contextual AI reranking + Chain-of-Thought |

### Understanding the Results

**Why scores seem low:** F1 score is strict—it penalizes different wording even when meaning is correct. Scores of 40-45% are typical for RAG systems since there are many correct ways to answer a question, but F1 only compares to one reference answer.

**What matters:** 
With only 500 documents and answers to choose from this improvement is impressive for a reranker. Our 1.8 percentage point gain  shows the reranker consistently produces better answers across 300 test questions—that's 2 additional correct answers per 100 queries with minimal cost and latency.

**Why it works:**
- Reranker examines 50 documents instead of 10
- Finds better technical content ranked lower by embeddings alone
- Uses instructions to prioritize actionable, step-by-step solutions

**Note:** In RAG research, 2-5% improvements are meaningful when building on strong baselines. Larger corpora (1000+ docs) typically show 5-10% gains.

---

**Key Insights:**

1. **Instruction-Following Advantage**: Contextual AI's reranker lets you:
   - Customize ranking through natural language instructions
   - Adapt to domains without retraining
   - Understand context (e.g., "fix" means troubleshooting, not just semantic similarity)

2. **When to Use Reranking:**
   - Domain-specific requirements (e.g., "prioritize official docs" or "prefer recent content")
   - Initial retrieval returns too many marginally relevant documents (**in this case there was only 500 documents**)
   - Need filtering by criteria embeddings can't capture (metadata, recency, source authority)
   - Working with larger corpora where improvement potential is higher

The reranker's instruction-following capability means you can adapt to different use cases by just changing instructions—no retraining required!