# GEPA Hands-On: Optimized Listwise Reranker

This notebook will illustrate how to use the **GEPA optimizer in DSPy to train a Listwise Reranker**!

We find a jump in Recall @ 1 from **32% to 45%** with a GEPA-optimized prompt. 

This is after a small-scale optimization run with 500 metric calls, running in about an hour and a half. We think it is very likely that we could find a larger gain with a longer optimization run, as indicated by the Pareto frontier.

### What are Rerankers?
Rerankers are given as input: (1) a user's query and (2) a list of candidate documents identified by a first stage ranking algorithm, such as Hybrid Search. 

Rerankers then predict a **new ranking** of the candidate documents.

There are two common forms of rerankers, cross encoders and listwise rerankers. Cross Encoders take as input each query and candidate document in isolation, whereas a Listwise Reranker takes the query and all candidate documents as input.

This notebook simplifies this task a bit, focusing on identifying the **most** relevant document, or **best match**, rather than a ranking of the entire input set.

### GEPA Optimizer

The primary focus of this notebook is to illustrate how you can use the GEPA optimizer to develop AI systems.

GEPA introducs several innovations for prompt optimization such as **Reflective Prompt Mutation**, **System-Aware Merge**, and **Pareto-Optimal Candidate Selection**.

For a conceptual explanation of GEPA, we hope you will find [this conceptual explanation video](https://www.youtube.com/watch?v=czy7hvXIImE) useful, or our [interview with Lakshya A. Agrawal](https://www.youtube.com/watch?v=fREQrxhBSk0). You can also find the [research publication introducing GEPA here](https://arxiv.org/pdf/2507.19457).

In this example, we will aim to help clarify these concepts further by sharing our experience using GEPA!

### There are 6 steps in this notebook:

1. Setup DSPy Program `BestMatchReranker`

2. Load Training Dataset (**EnronQA**)

3. Define Metric with Natural Language Feedback

4. Evaluate Unoptimized `BestMatchReranker`

5. Run GEPA Optimizer

6. Evaluate Optimized `BestMatchReranker`

## 1. Setup DSPy Program `BestMatchReranker`

In [None]:
import os
from pydantic import BaseModel
from typing import Optional, List, Dict, Any

import dspy

### CONFIGURE ###

os.environ["OPENAI_API_KEY"] = "sk-proj-foobar"

### MODELS ###

class SearchResult(BaseModel):
    id: int
    content: str
    dataset_id: Optional[str]

class Source(BaseModel):
    object_id: str

class DSPyAgentRAGResponse(dspy.Prediction):
    def __init__(self, final_answer: str = "", sources: List[Source] = None, 
                 searches: Optional[List[str]] = None, aggregations: Optional[List] = None,
                 usage: Optional[Dict[str, Any]] = None, **kwargs):
        super().__init__(**kwargs)
        self.final_answer = final_answer
        self.sources = sources or []
        self.searches = searches
        self.aggregations = aggregations
        self.usage = usage or {}

### SIGNATURES ###

class BestMatchRanker(dspy.Signature):
    """Identify the single most relevant passage to the query.
    
    Your task is to analyze ALL passages simultaneously and identify the one passage 
    that is most relevant for answering the query.
    
    Instructions:
    1. Read the query carefully and understand the information need
    2. Evaluate each passage for:
       - Direct relevance to answering the query
       - Factual accuracy and completeness
       - Information quality and clarity
    3. Compare passages against each other (not just individually)
    4. Return the ID of the single most relevant passage
    
    CRITICAL: You must return exactly 1 passage ID - the best match.
    """
    
    query: str = dspy.InputField(
        desc="The user's question or information need"
    )
    search_results: list[SearchResult] = dspy.InputField(
        desc="List of passages to analyze. Each contains: id, content"
    )
    best_match_id: int = dspy.OutputField(
        desc="The ID of the single most relevant passage. Must match an ID from search_results."
    )

### DSPy Language Program ###

class BestMatchReranker(dspy.Module):
    def __init__(
        self,
        verbose: bool = False,
    ):
        # init LLM
        self.lm = dspy.LM("openai/gpt-4.1-mini", api_key=os.getenv("OPENAI_API_KEY"))
        dspy.configure(lm=self.lm, track_usage=True)

        self.verbose = verbose
        self.reranker = dspy.ChainOfThought(BestMatchRanker) # update to send rationale through to metric

    def forward(self, question: str, candidates: list[SearchResult]) -> DSPyAgentRAGResponse:
        # Perform reranking
        rerank_pred = self.reranker(
            query=question,
            search_results=candidates,
        )
        
        # Find the best match result based on the returned ID
        best_match_result = None
        for candidate in candidates:
            if candidate.id == rerank_pred.best_match_id:
                best_match_result = candidate
                break
        
        reranked_sources = [best_match_result] if best_match_result else []
        
        if self.verbose:
            print(f"\033[96mReranked: Returning {len(reranked_sources)} Sources!\033[0m")
            if best_match_result:
                # Find the original position of this result in candidates
                original_rank = candidates.index(best_match_result) + 1  # +1 for 1-based ranking
                print(f"Best match ID: {rerank_pred.best_match_id} (was rank {original_rank})")
        
        # Get usage from reranker
        usage = rerank_pred.get_lm_usage() or {}
        
        return DSPyAgentRAGResponse(
            final_answer="",
            sources=reranked_sources,
            searches=[question],
            aggregations=None,
            usage=usage,
        )
    
    async def aforward(self, question: str) -> DSPyAgentRAGResponse:
        pass

reranker = BestMatchReranker()

## 2. Load Dataset

The dataset we will use can be found on [Weaviate's HuggingFace Repo](https://huggingface.co/datasets/weaviate/hard-questions-enronqa).

It contains 138 questions where state-of-the-art pre-trained Rerankers were able to achieve Recall @ 5, but not Recall @ 1, using the EnronQA dataset.

This presents an interesting opportunity to see if LLM-based rerankers can get us that extra mile to close the gap between Recall @ 5 and Recall @ 1.



In [10]:
from datasets import load_dataset

all_samples = load_dataset("weaviate/hard-questions-enronqa")['train']

In [11]:
all_samples[0]

{'question': "According to Sarah Novosel's email, what is the timeframe for submitting responses to the CPUC's data requests?",
 'shortlisted_candidates': [{'content': "The passage directly addresses the query by providing the timeframe for submitting responses to the CPUC's data requests, indicating that responses are due on the Friday following the email date. It clearly reflects Sarah Novosel's communication about the deadline, fulfilling the query's intent. The only limitation is the lack of an explicit calendar date for the Friday deadline, but the context allows for a reasonable inference.",
   'dataset_id': '4001',
   'id': 36},
  {'content': "The passage directly answers the query by stating that responses to the CPUC's data requests are due on Friday, according to Sarah Novosel's email. It effectively provides the timeframe for submission, fulfilling the query's intent. However, it does not specify the exact calendar date for the Friday deadline, which could be a minor limitat

### Dataset Preprocessing

We will quickly add the `ground_truth_content` to our samples so that GEPA can use it in the metric feedback.

(Sorry this isn't already in the dataset on HuggingFace)

In [12]:
# convert these samples to `dspy.Example` objects

import dspy

all_samples_cleaned = []

for sample in all_samples:
    candidates = []
    for idx, c in enumerate(sample["shortlisted_candidates"]):
        new_candidate = SearchResult(
            id=idx,
            content=c["content"],
            dataset_id=c["dataset_id"]
        )
        candidates.append(new_candidate)
    
    # Find ground truth content by matching dataset_id
    ground_truth_content = None
    for ground_truth_id in sample["ground_truths"]:
        for c in sample["shortlisted_candidates"]:
            if c["dataset_id"] == str(ground_truth_id):
                ground_truth_content = c["content"]
                break
        if ground_truth_content:
            break

    ex = dspy.Example().with_inputs("question", "candidates")
    ex["question"] = sample["question"]
    ex["candidates"] = candidates
    ex["ground_truths"] = sample["ground_truths"]
    ex["ground_truth_content"] = ground_truth_content

    all_samples_cleaned.append(ex)

print(len(all_samples_cleaned))

138


In [None]:
all_samples_cleaned[0]["ground_truth_content"]

"The passage directly answers the query by stating that responses to the CPUC's data requests are due on Friday, according to Sarah Novosel's email. It effectively provides the timeframe for submission, fulfilling the query's intent. However, it does not specify the exact calendar date for the Friday deadline, which could be a minor limitation for precise scheduling."

In [17]:
reranker.forward(**all_samples_cleaned[0].inputs())



Prediction(
    final_answer='',
    sources=[SearchResult(id=2, content="The passage directly addresses the query by providing the specific deadlines for submitting responses to the CPUC's data requests as communicated in the email. It clearly states the final date for submitting information (October 4, 2000) and the earlier deadline for submitting claims of privilege or confidentiality (September 29, 2000). This makes the passage highly relevant and complete in answering the query about the timeframe.", dataset_id='2386')],
    searches=["According to Sarah Novosel's email, what is the timeframe for submitting responses to the CPUC's data requests?"],
    aggregations=None,
    usage={'openai/gpt-4.1-mini': {'completion_tokens': 146, 'prompt_tokens': 904, 'total_tokens': 1050, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0, 'text_tokens': None}, 'prompt_tokens_details': {'audio_tokens': 0, 'cach

In [18]:
reranker.lm.inspect_history()





[34m[2025-08-21T16:27:21.911466][0m

[31mSystem message:[0m

Your input fields are:
1. `query` (str): The user's question or information need
2. `search_results` (list[SearchResult]): List of passages to analyze. Each contains: id, text, initial_rank, and hybrid_score
Your output fields are:
1. `reasoning` (str): 
2. `best_match_id` (int): The ID of the single most relevant passage. Must match an ID from search_results.
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## query ## ]]
{query}

[[ ## search_results ## ]]
{search_results}

[[ ## reasoning ## ]]
{reasoning}

[[ ## best_match_id ## ]]
{best_match_id}        # note: the value you produce must be a single int value

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Identify the single most relevant passage to the query.
        
        Your task is to analyze ALL passages simultaneously and identify the one passage 
        that is most

## 3. Metric with Feedback

One of the key innovations in GEPA is to provide the Prompt Proposer with more detailed information about input, output pairs. Rather than a scalar reward, such as "0.7", the metric sends back natural language feedback about what went wrong or right with this inference.

In our case, we will show the ground truth document that should have been ranked so that the Prompt Proposer can try to reverse engineer why the current instructions selected the document it did instead of the correct one.

In [19]:
# define the metric
from dspy import Example, Prediction

def recall_metric_with_feedback(
        example: Example, 
        prediction, 
        trace=None,
        pred_name=None,
        pred_trace=None
    ) -> Prediction:
        retrieved_id = prediction.sources[0].dataset_id # only reranking to 1 result for now
            
        ground_truth = str(example.ground_truths[0])
            
        if retrieved_id == ground_truth:
            return Prediction(
                score=1.0,
                feedback="Awesome! The system correctly predicted the top document."
            )
        else:
            predicted_content = prediction.sources[0].content
            ground_truth_content = example.ground_truth_content
            question = example.question
            return Prediction(
                score=0.0,
                feedback=f"Incorrect document selected for the query: {question}. The correct answer was: {ground_truth_content}. The system incorrectly predicted {predicted_content}."
            )

## 4. Run Unoptimized Eval

Evaluate our `BestMatchReranker` before optimizing it with GEPA!

In [21]:
trainset, testset = all_samples_cleaned[:100], all_samples_cleaned[100:]

In [22]:
# run the evaluator on the test set to understand the performance of the zero-shot listwise reranker
evaluator = dspy.Evaluate(
    devset=testset,
    metric=recall_metric_with_feedback, 
    num_threads=1,
    display_progress=True,
    max_errors=1,
    provide_traceback=True
)

dspy_evaluator_kwargs = {
    "num_threads": 5
}

evaluator(reranker, **dspy_evaluator_kwargs)

Average Metric: 12.00 / 38 (31.6%): 100%|██████████| 38/38 [00:20<00:00,  1.86it/s]

2025/08/21 16:27:53 INFO dspy.evaluate.evaluate: Average Metric: 12.0 / 38 (31.6%)





EvaluationResult(score=31.58, results=<list of 38 results>)

`Note:` We improve from 4/38 to 12/38 with summarized query <> candidate document relevances used for reranking, rather than the raw emails.

## 5. GEPA Optimization

As a quick reminder, GEPA is using a **Pareto-frontier** to sample candidates. 

Each candidate on the frontier is better than all the other candidates on at least 1 of your validation samples. 

This is critical to understanding how the optimizer works:

![pareto_frontier](./pareto-sampling.png)

Before we begin by constructing and running the GEPA optimizer. Here are a couple of tips we recommend for monitoring the optimization run.

1. Setup Weights & Biases Logging!

This is already built into the GEPA optimizer. It will log things that make it easy to keep track of your GEPA run such as the best score on your entire validation set, the Pareto frontier score, and the iteration of the training run, amongst others.


### Weights & Biases Logging

#### Best Overall Score
![agg_score](./wandb-gepa-2.png)

#### Pareto Frontier
![pareto](./wandb-gepa-1.png)

2. Check in with Gemini

We found it very helpful to copy and paste the output from GEPA and ask Gemini --

```
Can you analyze this prompt optimization run? How's it going?

{paste your GEPA output here}
```

### Code to setup the GEPA Optimizer Run

In [None]:
# optimize!!

import dspy

import logging

# Simple setup for Jupyter
logging.basicConfig(level=logging.INFO, force=True)
logging.getLogger('dspy.teleprompt.gepa').setLevel(logging.INFO)
logging.getLogger('gepa').setLevel(logging.INFO)

# SILENCE the noisy HTTP loggers
logging.getLogger('httpx').setLevel(logging.WARNING)  # Only warnings and errors
logging.getLogger('openai').setLevel(logging.WARNING)
logging.getLogger('weaviate').setLevel(logging.WARNING)
logging.getLogger('httpcore').setLevel(logging.WARNING)

reflection_lm = dspy.LM(
    model="gpt-5",
    temperature=1.0,
    max_tokens=32_000
)

wandb_config = {
    "project": "gepa-best-match-ranker-run",
    "name": "run-500-calls-val25-merge10",
    "notes": "Increasing budget and merge invocations, reduce val set size.",
    "config": {
        "max_metric_calls": 500,
        "reflection_minibatch_size": 5,
        "max_merge_invocations": 10,
        "val_subset_size": 25,
        "train_size": 75
    }
}

optimizer = dspy.GEPA(
    metric=recall_metric_with_feedback,
    max_metric_calls=500,
    reflection_lm=reflection_lm,
    reflection_minibatch_size=5,
    use_merge=True,
    max_merge_invocations=10,
    num_threads=5,
    use_wandb=True,
    wandb_api_key=os.getenv("WANDB_API_KEY"),
    wandb_init_kwargs=wandb_config
)

train_samples=trainset[:75]
val_samples=trainset[75:]

optimized_reranker = optimizer.compile(
    reranker,
    trainset=train_samples,
    valset=val_samples
)

2025/08/21 16:44:19 INFO dspy.teleprompt.gepa.gepa: Running GEPA for approx 500 metric calls of the program. This amounts to 5.00 full evals on the train+val set.
2025/08/21 16:44:19 INFO dspy.teleprompt.gepa.gepa: Using 25 examples for tracking Pareto scores. You can consider using a smaller sample of the valset to allow GEPA to explore more diverse solutions within the same budget.
  from jsonschema import FormatChecker, RefResolver, validators
  from jsonschema import Draft7Validator, RefResolver, ValidationError
  from jsonschema import ErrorTree, ValidationError
  return LooseVersion(v) >= LooseVersion(check)
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /Users/cshorten/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mcshorten[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
  self.scope.user = {"email": email}
  self.scope.user = {"email": email}


  PydanticSerializationUnexpectedValue(Expected 9 fields but got 6: Expected `Message` - serialized value may not be as expected [input_value=Message(content='[[ ## re...: None}, annotations=[]), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [input_value=Choices(finish_reason='st...ider_specific_fields={}), input_type=Choices])
  return self.__pydantic_serializer__.to_python(
  PydanticSerializationUnexpectedValue(Expected 9 fields but got 6: Expected `Message` - serialized value may not be as expected [input_value=Message(content="[[ ## re...: None}, annotations=[]), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [input_value=Choices(finish_reason='st...ider_specific_fields={}), input_type=Choices])
  return self.__pydantic_serializer__.to_python(
2025/08/21 16:44:33 INFO dspy.evaluate.evaluate: Average Metric: 8.0 / 25 

Average Metric: 0.00 / 5 (0.0%): 100%|██████████| 5/5 [00:04<00:00,  1.23it/s]

2025/08/21 16:44:37 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 5 (0.0%)





  PydanticSerializationUnexpectedValue(Expected 9 fields but got 6: Expected `Message` - serialized value may not be as expected [input_value=Message(content='```\nTas...: None}, annotations=[]), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [input_value=Choices(finish_reason='st...ider_specific_fields={}), input_type=Choices])
  return self.__pydantic_serializer__.to_python(
2025/08/21 16:45:17 INFO dspy.teleprompt.gepa.gepa: Iteration 1: Proposed new text for reranker.predict: Task: Given a query and a list of SearchResult items (each with id, content, dataset_id), identify the single most relevant passage to answer the query and return exactly one passage ID (SearchResult.id).

Input format:
- query: A natural-language question asking for a specific fact (who/what/where/when).
- search_results: A list of SearchResult objects with fields:
  - id: the passage ID you must return
  - content: a descripti

Average Metric: 2.00 / 5 (40.0%): 100%|██████████| 5/5 [00:03<00:00,  1.37it/s] 

2025/08/21 16:45:40 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 5 (40.0%)





2025/08/21 16:46:25 INFO dspy.teleprompt.gepa.gepa: Iteration 2: Proposed new text for reranker.predict: Task: Select the single most relevant passage ID for a given query.

Input format:
- query: a single question string.
- search_results: a list of SearchResult objects with fields:
  - id: integer passage identifier to return.
  - content: a description/summary of what the passage says (may itself be meta-descriptive).
  - dataset_id: an opaque identifier (do not use as the answer).

Goal:
Return exactly one passage ID (a single integer) that best answers the query on its own.

Process:
1) Parse the query carefully. Extract all constraints and qualifiers, including:
   - Who the information is attributed to (e.g., “according to Margaret Huson’s email”).
   - The medium/source (e.g., “email,” “email signature”).
   - The specific information requested (e.g., due date, destination, phone number, legislative steps).
   - Temporal/modality cues (e.g., “will try to,” “is due,” “according 

Average Metric: 0.00 / 5 (0.0%): 100%|██████████| 5/5 [00:04<00:00,  1.09it/s]

2025/08/21 16:46:34 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 5 (0.0%)





2025/08/21 16:48:38 INFO dspy.teleprompt.gepa.gepa: Iteration 3: Proposed new text for reranker.predict: Task: Given a query and a list of passages (SearchResult objects with fields id, content, dataset_id), select the single most relevant passage for answering the query. Return only the passage id.

Input format:
- query: a natural-language question.
- search_results: a list of SearchResult items:
  - id: integer identifier to return
  - content: text summarizing or describing what the passage says
  - dataset_id: opaque identifier (do not return this)

Output format:
- Exactly one integer: the id of the single best passage. No explanations, no extra text, no JSON.

Core objective:
- Analyze all passages, compare them against each other, and pick the one passage that most directly, accurately, and completely answers the query (and matches all query constraints such as names, dates, file types, and requested granularity).

Process:
1) Parse the query and extract all required elements:


Average Metric: 2.00 / 5 (40.0%): 100%|██████████| 5/5 [00:03<00:00,  1.39it/s] 

2025/08/21 16:48:45 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 5 (40.0%)





  PydanticSerializationUnexpectedValue(Expected 9 fields but got 6: Expected `Message` - serialized value may not be as expected [input_value=Message(content='```\nYou...: None}, annotations=[]), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [input_value=Choices(finish_reason='st...ider_specific_fields={}), input_type=Choices])
  return self.__pydantic_serializer__.to_python(
2025/08/21 16:49:33 INFO dspy.teleprompt.gepa.gepa: Iteration 4: Proposed new text for reranker.predict: You are given:
- A query asking for specific information.
- A list of search_results. Each search result has:
  - id: the passage ID to return if it is the best match.
  - content: a meta-summary describing what the underlying passage says and how it relates to the query (often starting with phrases like “The passage directly addresses the query by...”).
  - dataset_id: ignore this; do not use it for selection.

Your task:
Selec

Average Metric: 2.00 / 5 (40.0%): 100%|██████████| 5/5 [00:03<00:00,  1.44it/s]

2025/08/21 16:49:40 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 5 (40.0%)





2025/08/21 16:51:17 INFO dspy.teleprompt.gepa.gepa: Iteration 5: Proposed new text for reranker.predict: Task: Given a query and a list of passages (search_results), select the single passage that best answers the query, and return exactly one passage ID (the SearchResult.id integer).

Input format:
- query: a natural-language question that may include constraints such as the source (e.g., “According to James Steffes’ email”), the scope (“the issue at hand”), or the exact type of answer (entity, date/month, list of two items).
- search_results: a list of SearchResult objects with fields:
  - id: the integer identifier you must return
  - content: a short summary/assessment of the passage’s relevance
  - dataset_id: a source identifier (do not return this)

Core instructions:
1) Read the query carefully and extract all constraints:
   - Who/what is the authoritative source? (e.g., “According to Sarah’s email”)
   - What exactly is being asked? (a specific entity, date/month, or an exact

Average Metric: 3.00 / 5 (60.0%): 100%|██████████| 5/5 [00:03<00:00,  1.49it/s]

2025/08/21 16:51:38 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 5 (60.0%)





2025/08/21 16:52:59 INFO dspy.teleprompt.gepa.gepa: Iteration 6: Proposed new text for reranker.predict: You are given:
- query: a natural-language question about some document/email content.
- search_results: a list of SearchResult objects, each with:
  - id: the passage identifier you must return
  - content: a summary/analysis describing what the underlying passage contains and how it relates to the query
  - (Ignore dataset_id; you do not return it.)

Your task:
Identify the single most relevant passage for answering the query and return exactly one passage ID (SearchResult.id). Output only the ID, nothing else.

How to decide “most relevant” (compare ALL passages against each other):

1) Parse the query carefully
   - Extract named entities, people, organizations, and exact constructs the query anchors on (e.g., person names, agencies, email sender, investigation body).
   - Identify the specific information requested (e.g., primary objective, omitted aspect, resort name, exact gr

Average Metric: 2.00 / 5 (40.0%): 100%|██████████| 5/5 [00:02<00:00,  1.87it/s]

2025/08/21 16:53:05 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 5 (40.0%)





2025/08/21 16:54:26 INFO dspy.teleprompt.gepa.gepa: Iteration 7: Proposed new text for reranker.predict: Task: Given a user query and a list of passage summaries (search_results), select the single passage ID that best answers the query.

Input format:
- query: A natural-language question that may include constraints like source (e.g., “according to Sarah Novosel’s email”), subject lines, dates, and named entities.
- search_results: A list of items with fields:
  - id: integer passage identifier to return
  - content: a summary/description of what the passage states (not the raw passage)
  - dataset_id: ignore for selection purposes

Your goal:
- Analyze ALL passages and return exactly ONE id (the best match). No explanations or extra text—only the integer ID.

Core selection criteria (apply in order):
1) Source alignment (hard filter):
   - If the query specifies a source or context (e.g., “according to Sarah Novosel’s email,” “according to Dana’s notes from the FERC meeting,” “in the

Average Metric: 1.00 / 5 (20.0%): 100%|██████████| 5/5 [00:02<00:00,  1.92it/s]

2025/08/21 16:54:33 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 5 (20.0%)





2025/08/21 16:55:56 INFO dspy.teleprompt.gepa.gepa: Iteration 8: Proposed new text for reranker.predict: Task: Select the single most relevant passage (by ID) to answer a given query.

Input format:
- query: A natural-language question that may reference specific documents (e.g., emails) with metadata such as sender, subject, date/time.
- search_results: A list of SearchResult objects with:
  - id: numeric identifier you must return
  - content: a descriptive summary of what the passage contains or how well it answers the query
  - dataset_id: source identifier (do not return this)

Critical output rule:
- Return EXACTLY ONE item: the numeric id of the single best passage. No explanations, labels, or extra text.

How to choose the best passage:
1. Parse the query carefully and extract all constraints and requested elements:
   - Who/what is being cited (“according to [sender]”, “subject …”, “on [date/time]”).
   - What is being asked (e.g., a title, URL, authority/parties, location, ex

Average Metric: 1.00 / 5 (20.0%): 100%|██████████| 5/5 [00:03<00:00,  1.65it/s]

2025/08/21 16:56:02 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 5 (20.0%)





2025/08/21 16:57:17 INFO dspy.teleprompt.gepa.gepa: Iteration 9: Proposed new text for reranker.predict: Task: Identify the single most relevant passage to the query.

Input format:
- query: A natural language question that often includes precise qualifiers (names, dates, organizations, locations, and source context like “according to [person]’s email” or “article quoted in the email”).
- search_results: A list of SearchResult objects with:
  - id: integer identifier to return
  - content: a descriptive summary of the underlying passage’s relevance and what it contains (treat this as authoritative; you will not see the full original passage)
  - dataset_id: auxiliary metadata (do not return this)

Your goal: Analyze ALL passages together and return exactly one id — the single best passage for answering the query.

Evaluation criteria (apply in this order):
1) Exact constraint match:
   - Extract all explicit constraints from the query (e.g., named entities, teams, people, roles, dates,

Average Metric: 2.00 / 5 (40.0%): 100%|██████████| 5/5 [00:03<00:00,  1.31it/s]

2025/08/21 16:57:41 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 5 (40.0%)





2025/08/21 16:58:34 INFO dspy.teleprompt.gepa.gepa: Iteration 10: Proposed new text for reranker.predict: Task: Select the single most relevant passage to answer the query.

Input format:
- query: A natural-language question (often multi-part), sometimes with explicit source/time constraints (e.g., “according to Karen Denne’s email,” “in the article,” “on June 27, 2001,” “email exchange on September 18, 2000”).
- search_results: A list of SearchResult objects with:
  - id: integer passage identifier to return
  - content: textual summary of the passage’s contents/relevance
  - dataset_id: ignore for selection output (do not return this)

What to return:
- Return exactly one value: the single SearchResult.id (integer) of the best passage. Do not return text, reasoning, or dataset_id.

Process:
1) Parse the query carefully:
   - Extract the exact information requested (e.g., specific names, occupation, quoted phrase to include, action taken).
   - Note all constraints: source (email vs a

Average Metric: 1.00 / 5 (20.0%): 100%|██████████| 5/5 [00:02<00:00,  1.88it/s]

2025/08/21 16:58:40 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 5 (20.0%)





2025/08/21 16:59:57 INFO dspy.teleprompt.gepa.gepa: Iteration 11: Proposed new text for reranker.predict: Task
Select the single most relevant passage (by id) from a list of candidates that best answers the given query.

Input format
- query: A question to answer.
- search_results: A list of SearchResult objects, each with:
  - id: integer (the passage identifier to return)
  - content: a description/summary of what the passage contains
  - dataset_id: ignore this field

What to do
1) Read the query carefully and determine exactly what information is required (entities, number of items, scope).
2) Evaluate ALL passages against the query, comparing them to each other (not just individually).

Primary selection criteria
- Directness: Prefer passages that explicitly and unambiguously answer the question asked (not just “related to,” “context for,” or “could be used”).
- Completeness: The passage must contain all required elements the query asks for (e.g., both date AND location; exactly t

Average Metric: 0.00 / 5 (0.0%): 100%|██████████| 5/5 [00:02<00:00,  2.01it/s]

2025/08/21 17:00:02 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 5 (0.0%)





2025/08/21 17:01:17 INFO dspy.teleprompt.gepa.gepa: Iteration 12: Proposed new text for reranker.predict: Task: Given a query and a list of passages (search_results), select the single passage that best answers the query, and return exactly one passage ID (the SearchResult.id integer). Output only the integer ID.

Input format:
- query: a natural-language question that may include constraints such as:
  - the authoritative source and medium (e.g., “According to Hedy Govenar’s email on 07/02/2001 06:51 PM, subject ‘Language on bonds’”)
  - scope/context qualifiers (e.g., “the issue at hand,” “to Jeff Dasovich”)
  - the exact answer type (single entity/name, date/month, “two things,” “first after July,” etc.)
- search_results: a list of SearchResult objects:
  - id: the integer identifier you must return
  - content: a short summary/assessment of the passage’s relevance
  - dataset_id: a source identifier (never return or use as a tie-breaker)

Core instructions:
1) Parse the query and e

Average Metric: 0.00 / 5 (0.0%): 100%|██████████| 5/5 [00:03<00:00,  1.25it/s]

2025/08/21 17:01:25 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 5 (0.0%)





2025/08/21 17:02:46 INFO dspy.teleprompt.gepa.gepa: Iteration 13: Proposed new text for reranker.predict: Task: Given a query and a list of passages (search_results), select the single passage that best answers the query, and return exactly one passage ID (the SearchResult.id integer).

Input format:
- query: a natural-language question that may include constraints such as the authoritative source (e.g., “According to James Steffes’ email”), the scope (“the issue at hand”), recipients (“to Jeff Dasovich”), or the exact type of answer (entity, date/month, “two things,” or an exact count like “six individuals”).
- search_results: a list of SearchResult objects with fields:
  - id: the integer identifier you must return
  - content: a short summary/assessment of the passage’s relevance
  - dataset_id: a source identifier (do not use or return this)

Core instructions:
1) Parse the query and extract ALL constraints:
   - Attribution/source: Who is the authoritative source (e.g., “According

Average Metric: 3.00 / 5 (60.0%): 100%|██████████| 5/5 [00:05<00:00,  1.06s/it] 

2025/08/21 17:02:54 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 5 (60.0%)





2025/08/21 17:03:47 INFO dspy.teleprompt.gepa.gepa: Iteration 14: Proposed new text for reranker.predict: Task summary:
- You will be given a query and a list of SearchResult passages.
- Your job is to select the single passage that best answers the query and output exactly one integer: the SearchResult.id of that passage.
- Do not output any other text.

Inputs:
- query: A natural-language question that may include constraints such as the authoritative source (“According to [person]’s email”), subject line, date, recipients, scope/context (“the issue at hand,” “to Jeff Dasovich”), and the exact type of answer required (entity, date/month, or an exact list of two items).
- search_results: A list of SearchResult objects with fields:
  - id: integer identifier you must output
  - content: a short summary/assessment of the passage’s relevance
  - dataset_id: an internal source identifier (never output this)

Core decision process:
1) Parse the query carefully and extract all constraints:


Average Metric: 0.00 / 5 (0.0%): 100%|██████████| 5/5 [00:05<00:00,  1.03s/it]

2025/08/21 17:03:56 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 5 (0.0%)





2025/08/21 17:05:51 INFO dspy.teleprompt.gepa.gepa: Iteration 15: Proposed new text for reranker.predict: Task: Select the single passage that best answers a natural-language query and return exactly one passage ID (the SearchResult.id integer).

Input format:
- query: a question that may include constraints such as an authoritative source (“According to [person]’s email”), scope (“the issue at hand”), timing (“after the July meetings”), or required answer type (entity, date/month, list of two items).
- search_results: a list of SearchResult objects with fields:
  - id: the integer identifier you must return
  - content: a short summary/assessment of the passage’s relevance
  - dataset_id: a source identifier (never return this)

What to output:
- Return exactly 1 value: the SearchResult.id of the best passage.
- Output only the integer ID. Do not include any other text, labels, punctuation, or whitespace.

Core selection instructions:
1) Parse the query carefully. Extract all constrai

Average Metric: 1.00 / 5 (20.0%): 100%|██████████| 5/5 [00:02<00:00,  2.03it/s] 

2025/08/21 17:06:10 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 5 (20.0%)





2025/08/21 17:07:29 INFO dspy.teleprompt.gepa.gepa: Iteration 16: Proposed new text for reranker.predict: Task: Given a query and a list of passages (search_results), select the single passage that best answers the query, and return exactly one passage ID (the SearchResult.id integer). Output only the integer ID—no text, no reasoning, no dataset_id.

Input format:
- query: a natural-language question that may include constraints such as the authoritative source (e.g., “According to James Steffes’ email”), the scope (“the issue at hand”), or the exact type of answer (entity, date/month, list of two items).
- search_results: a list of SearchResult objects with fields:
  - id: the integer identifier you must return
  - content: a short summary/assessment of the passage’s relevance
  - dataset_id: a source identifier (do not return or rely on this)

Core selection steps:
1) Parse the query and extract all constraints:
   - Source attribution: Who is the authoritative source and medium? e.g

Average Metric: 2.00 / 5 (40.0%): 100%|██████████| 5/5 [00:02<00:00,  1.88it/s]

2025/08/21 17:07:36 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 5 (40.0%)





  PydanticSerializationUnexpectedValue(Expected 9 fields but got 6: Expected `Message` - serialized value may not be as expected [input_value=Message(content="```\nTas...: None}, annotations=[]), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [input_value=Choices(finish_reason='st...ider_specific_fields={}), input_type=Choices])
  return self.__pydantic_serializer__.to_python(
2025/08/21 17:09:34 INFO dspy.teleprompt.gepa.gepa: Iteration 17: Proposed new text for reranker.predict: Task: Given a query and a list of passages (search_results), select the single passage that best answers the query, and return exactly one passage ID (the SearchResult.id integer).

Input format:
- query: a natural-language question that may include constraints such as the source (e.g., “According to James Steffes’ email”), the scope (“the issue at hand”), or the exact type of answer (entity, date/month, list of two items).
- s

Average Metric: 2.00 / 5 (40.0%): 100%|██████████| 5/5 [00:03<00:00,  1.28it/s] 

2025/08/21 17:09:57 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 5 (40.0%)





2025/08/21 17:11:05 INFO dspy.teleprompt.gepa.gepa: Iteration 18: Proposed new text for reranker.predict: Task: Select the single most relevant passage ID that best answers the query.

Input format:
- query: A question that often names a specific source (e.g., “According to Sarah’s email”, “According to Jeff Dasovich’s email”) and may specify quantity (e.g., “two things”, “two hurdles”), timeframe (e.g., “after the July meetings”), or a causation (“what prompted…”).
- search_results: A list of SearchResult objects with fields:
  - id: integer identifier to return
  - content: a short description of what the underlying passage contains
  - dataset_id: an identifier of the source collection

What to do:
1) Read the query carefully and extract all constraints:
   - Source attribution (e.g., “According to Anil’s email”, “According to Sarah’s email”, “According to Jeff Dasovich’s email”).
   - Count/scope (e.g., “two things”, “two hurdles”).
   - Timeframe/order (e.g., “first after July”).


Average Metric: 1.00 / 5 (20.0%): 100%|██████████| 5/5 [00:02<00:00,  2.39it/s] 

2025/08/21 17:11:12 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 5 (20.0%)





2025/08/21 17:12:49 INFO dspy.teleprompt.gepa.gepa: Iteration 19: Proposed new text for reranker.predict: You are given:
- A query (natural language question).
- A list of passages as SearchResult objects: each has an integer id and a content string that summarizes what the underlying passage says (treat content as the only text available).

Your task:
Select exactly one passage (by its id) that is the single best match to answer the query.

How to decide:
1) Understand the query precisely
   - Identify the exact information requested (entities, dates, locations, names, pairings).
   - Note qualifiers (e.g., “suggested” vs. “final,” “according to [person/email/message],” “this year,” multi-part asks).
   - If the query names a source (“according to Jeff Dasovich’s email,” “according to Jeremy Blachman’s message”), prioritize passages that explicitly attribute the information to that source or clearly quote/excerpt the relevant source content.

2) Evaluate each passage against the query

Average Metric: 1.00 / 5 (20.0%): 100%|██████████| 5/5 [00:06<00:00,  1.33s/it]

2025/08/21 17:12:58 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 5 (20.0%)





2025/08/21 17:14:18 INFO dspy.teleprompt.gepa.gepa: Iteration 20: Proposed new text for reranker.predict: Task: Given a query and a list of passages (search_results), select the single passage that best answers the query, and return exactly one passage ID (the SearchResult.id integer).

Input format:
- query: a natural-language question that may include constraints such as the source (e.g., “According to James Steffes’ email”), the scope (“the issue at hand”), or the exact type of answer (entity, date/month, list of two items).
- search_results: a list of SearchResult objects with fields:
  - id: the integer identifier you must return
  - content: a short summary/assessment of the passage’s relevance
  - dataset_id: a source identifier (do not return or use this)

Core decision process (follow in order):
1) Parse the query and extract all constraints:
   - Source attribution: person and medium (e.g., “According to Sarah’s email,” “Jeff Dasovich’s July 3, 2001 email”).
   - What is bein

Average Metric: 1.00 / 5 (20.0%): 100%|██████████| 5/5 [00:02<00:00,  1.72it/s]

2025/08/21 17:14:25 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 5 (20.0%)





2025/08/21 17:15:47 INFO dspy.teleprompt.gepa.gepa: Iteration 21: Proposed new text for reranker.predict: Task: Select the single best passage that answers a natural-language query, and return exactly one passage ID (the SearchResult.id integer).

Input:
- query: a question that may include constraints such as the named source (e.g., “According to [person]’s email”), date/time, subject line, scope/issue, and the exact answer type (entity, date/month, “two things,” filename, etc.).
- search_results: a list of SearchResult objects with fields:
  - id: the integer you must return
  - content: a short summary/assessment of the passage’s relevance
  - dataset_id: a source identifier (never return or use this directly)

Primary rules:
- Return exactly one value: the integer SearchResult.id of the single best passage.
- Do not include any reasoning, text, or dataset_id. Output only the integer ID.
- Never combine information across passages. The selected passage must independently satisfy the

Average Metric: 2.00 / 5 (40.0%): 100%|██████████| 5/5 [00:03<00:00,  1.53it/s] 

2025/08/21 17:15:54 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 5 (40.0%)





2025/08/21 17:16:43 INFO dspy.teleprompt.gepa.gepa: Iteration 22: Proposed new text for reranker.predict: Task: Select the single most relevant passage to answer the query.

Input format:
- query: A natural-language question (often referencing specific emails).
- search_results: A list of SearchResult objects with fields:
  - id: integer identifier of the passage (this is the ONLY value you should return)
  - content: passage text or a concise description/synthesis of it
  - dataset_id: ignore this field

What to do:
1) Read the query carefully and extract the precise information need (who/what/when/how, the exact phrasing sought, and any source constraint like “According to the email/email chain”).
2) Evaluate ALL passages against the query simultaneously. For each passage, assess:
   - Directness: Does it explicitly and fully answer the question as asked?
   - Anchoring to the source: If the query says “According to the email/email chain,” prefer passages that explicitly present what

Average Metric: 3.00 / 5 (60.0%): 100%|██████████| 5/5 [00:03<00:00,  1.63it/s]

2025/08/21 17:17:03 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 5 (60.0%)





2025/08/21 17:18:21 INFO dspy.teleprompt.gepa.gepa: Iteration 23: Proposed new text for reranker.predict: Task: Select the single passage that best answers the query, and return exactly one passage ID (the SearchResult.id integer).

Input format:
- query: a natural-language question that may include constraints such as the source (e.g., “According to James Steffes’ email”), the scope (“the issue at hand”), timing qualifiers, or the exact type of answer (entity/date/month/list of two items).
- search_results: a list of SearchResult objects with fields:
  - id: the integer identifier you must return
  - content: a short summary/assessment of the passage’s relevance
  - dataset_id: a source identifier (never return this)

Absolute output rule:
- Return exactly one value: the integer SearchResult.id of the single best passage.
- Do not include any reasoning, extra text, or dataset_id. Output only the integer ID.

Core decision process (apply in order):

1) Parse the query and extract all c

Average Metric: 2.00 / 5 (40.0%): 100%|██████████| 5/5 [00:01<00:00,  2.70it/s] 

2025/08/21 17:18:27 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 5 (40.0%)





2025/08/21 17:19:26 INFO dspy.teleprompt.gepa.gepa: Iteration 24: Proposed new text for reranker.predict: You are given:
- A query string.
- A list of SearchResult objects: each has an integer id and a content string that summarizes what a candidate passage says.

Your task:
Identify and return exactly one passage id: the single most relevant passage for answering the query.

General approach:
1) Read and understand the query. Determine precisely what is being asked (e.g., a specific phone number, email, resort name, action/requirement, or deadline date).
2) Evaluate ALL passages in context of the query. Compare them against each other, not just individually.

Relevance criteria (apply in this order):
- Exactness to the information need:
  - Prefer passages that explicitly and unambiguously state the exact answer the query seeks (e.g., the phone number itself, the exact email address “joe@joenation.com”, the resort name “Squaw/Squaw Valley”, the required action “meet and confer”, a spe

Average Metric: 4.00 / 5 (80.0%): 100%|██████████| 5/5 [00:03<00:00,  1.34it/s] 

2025/08/21 17:19:35 INFO dspy.evaluate.evaluate: Average Metric: 4.0 / 5 (80.0%)





2025/08/21 17:20:52 INFO dspy.teleprompt.gepa.gepa: Iteration 25: Proposed new text for reranker.predict: Task: Select the single passage that best answers the query and return exactly one passage ID (SearchResult.id as an integer).

Input format:
- query: A natural-language question that may include constraints such as:
  - Source attribution (e.g., “According to James Steffes’ email,” “Jeff Dasovich’s July 3, 2001 email,” subject line, date/time).
  - Exact answer type (single entity/name, date/month, specific time, “two things,” etc.).
  - Context qualifiers (issue or event, recipient/audience, timing qualifiers such as “first after July,” “following the rescheduled December 4 meeting,” etc.).
- search_results: A list of SearchResult objects with fields:
  - id: integer identifier (this is the only value you must output)
  - content: a summary/assessment of the passage’s relevance
  - dataset_id: a source identifier (never use or return this)

Your job: Evaluate all passages against

Average Metric: 2.00 / 5 (40.0%): 100%|██████████| 5/5 [00:01<00:00,  2.91it/s] 

2025/08/21 17:20:57 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 5 (40.0%)





2025/08/21 17:23:27 INFO dspy.teleprompt.gepa.gepa: Iteration 26: Proposed new text for reranker.predict: Task: Given a query and a list of search_results (each with fields: id, content, dataset_id), identify and return exactly 1 passage ID (the id field) that is the single most relevant for answering the query.

Core principles:
- Read the query carefully and extract all key constraints (who, what, when, where, subject lines, organizations).
- Evaluate ALL passages against the query simultaneously.
- Base your judgment ONLY on the provided search_results content (which may be descriptive summaries of passages). Do not use outside knowledge.

Relevance criteria (apply in order):
1) Directness: Prefer passages that explicitly and unambiguously answer the specific question asked (e.g., listing the requested names, stating the requested task, or naming the city).
2) Constraint alignment: Prefer passages that match the query’s unique anchors, such as:
   - Named people (e.g., Jeremy Blachm

Average Metric: 3.00 / 5 (60.0%): 100%|██████████| 5/5 [00:03<00:00,  1.26it/s] 

2025/08/21 17:23:34 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 5 (60.0%)





2025/08/21 17:24:50 INFO dspy.teleprompt.gepa.gepa: Iteration 27: Proposed new text for reranker.predict: You are given a query and a list of passages (search_results). Your job is to select the single passage that best answers the query and return exactly one passage ID (the SearchResult.id integer). Do not include any explanation—output only the integer ID.

Input format:
- query: a natural-language question that may include constraints such as the source (e.g., “According to James Steffes’ email”), the scope (“the issue at hand”), or the exact type of answer (entity, date/month, list of two items).
- search_results: a list of SearchResult objects with fields:
  - id: the integer identifier you must return
  - content: a short summary/assessment of the passage’s relevance
  - dataset_id: a source identifier (do not return this)

Core decision process (follow in order):
1) Parse the query and extract all constraints:
   - Source attribution: person and medium (e.g., “According to Sara

Average Metric: 0.00 / 5 (0.0%): 100%|██████████| 5/5 [00:02<00:00,  1.92it/s] 

2025/08/21 17:25:10 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 5 (0.0%)





2025/08/21 17:26:58 INFO dspy.teleprompt.gepa.gepa: Iteration 28: Proposed new text for reranker.predict: Task: Identify and return the single most relevant passage ID that best answers the query.

Input format:
- query: A question to be answered using the passages.
- search_results: A list of SearchResult objects with fields:
  - id: integer (the passage ID you must return)
  - content: a concise description of what the passage says and how it relates to the query
  - dataset_id: string (metadata; do not use for selection)

What to return:
- Output exactly one integer: the id of the single best passage.
- Do not return explanations, labels, or multiple IDs.

How to select the best passage:
1) Parse the query to extract all critical constraints:
   - Who/what/where/when and any numbers or steps requested.
   - Named entities and qualifiers: people (e.g., Jeff Dasovich, Karen Denne, Mr. Glynn), organizations (e.g., Enron), legislative bodies and bill numbers (e.g., Assembly, SB 78, Edis

Average Metric: 1.00 / 5 (20.0%): 100%|██████████| 5/5 [00:04<00:00,  1.19it/s]

2025/08/21 17:27:05 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 5 (20.0%)





2025/08/21 17:28:46 INFO dspy.teleprompt.gepa.gepa: Iteration 29: Proposed new text for reranker.predict: Task
Given a query and a list of passages (SearchResult objects), analyze ALL passages simultaneously and select the single passage that best answers the query. Return exactly one passage ID (the integer id field), and nothing else.

Input format
- query: a natural-language question (often attributed to a specific person/email or tied to a specific event/date/case).
- search_results: a list of SearchResult objects with:
  - id: integer identifier to return
  - content: a short description of what the passage asserts in relation to the query (often meta-summaries like “The passage directly addresses the query by stating …”)
  - dataset_id: string identifier (do not return this)

Output format
- Exactly one integer: the id of the single best passage. No extra text, no reasoning.

General approach
1) Parse the query carefully and extract every constraint:
   - Who/attribution: e.g., “

Average Metric: 0.00 / 5 (0.0%): 100%|██████████| 5/5 [00:02<00:00,  2.15it/s]

2025/08/21 17:28:51 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 5 (0.0%)





2025/08/21 17:30:35 INFO dspy.teleprompt.gepa.gepa: Iteration 30: Proposed new text for reranker.predict: You are given a query and a list of passages (search_results). Your job is to select the single passage that best answers the query and return exactly one passage ID (the SearchResult.id integer). Do not include any explanation—output only the integer ID.

Input format:
- query: a natural-language question that may include constraints such as the source (e.g., “According to James Steffes’ email”), the scope (“the issue at hand”), or the exact type of answer (entity, date/month, list of two items).
- search_results: a list of SearchResult objects with fields:
  - id: the integer identifier you must return
  - content: a short summary/assessment of the passage’s relevance
  - dataset_id: a source identifier (do not return this)

Core decision process (follow in order):
1) Parse the query and extract all constraints:
   - Source attribution: person and medium (e.g., “According to Sara

## 6. Save and Evaluate Optimized `BestMatchReranker`

In [25]:
# save optimized listwise reranker
optimized_reranker.save("gepa_optimized_best_match_reranker.json")

In [26]:
# evaluate optimized listwise reranker on the test set
dspy_evaluator_kwargs = {
    "num_threads": 5
}

evaluator(optimized_reranker, **dspy_evaluator_kwargs)

Average Metric: 12.00 / 19 (63.2%):  47%|████▋     | 18/38 [00:10<00:12,  1.61it/s]

  return _compile(pattern, flags).finditer(string)
  return _compile(pattern, flags).finditer(string)
  return _compile(pattern, flags).finditer(string)
  return _compile(pattern, flags).finditer(string)
  return _compile(pattern, flags).finditer(string)
  return _compile(pattern, flags).finditer(string)
  return _compile(pattern, flags).finditer(string)
  return _compile(pattern, flags).finditer(string)
  return _compile(pattern, flags).finditer(string)


Average Metric: 17.00 / 38 (44.7%): 100%|██████████| 38/38 [00:21<00:00,  1.75it/s]

2025/08/21 17:36:19 INFO dspy.evaluate.evaluate: Average Metric: 17.0 / 38 (44.7%)





EvaluationResult(score=44.74, results=<list of 38 results>)



### Visualize Optimized Prompt

In [27]:
optimized_reranker.lm.inspect_history()





[34m[2025-08-21T17:30:40.131122][0m

[31mSystem message:[0m

Your input fields are:
1. `query` (str): The user's question or information need
2. `search_results` (list[SearchResult]): List of passages to analyze. Each contains: id, text, initial_rank, and hybrid_score
Your output fields are:
1. `reasoning` (str): 
2. `best_match_id` (int): The ID of the single most relevant passage. Must match an ID from search_results.
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## query ## ]]
{query}

[[ ## search_results ## ]]
{search_results}

[[ ## reasoning ## ]]
{reasoning}

[[ ## best_match_id ## ]]
{best_match_id}        # note: the value you produce must be a single int value

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are given a query and a list of passages (search_results). Your job is to select the single passage that best answers the query and return exactly one passage ID (the Sear