# Nova Customization

For those interested in learning more about Nova Customization, we recommend checking out [Nova Forge](https://aws.amazon.com/nova/forge/).

Nova Forge is a new service to build your own frontier models using Nova. Nova Forge customers can start their development from early model checkpoints, blend proprietary data with Amazon Nova-curated training data, and host their custom models securely on AWS.

# Nova Prompt Optimizer

This notebook will illustrate how to use the open-source [Nova Prompt Optimizer](https://github.com/aws/nova-prompt-optimizer/tree/main) to optimize a RAG system.

The primary goal of this notebook is illustrate how to use the `BedrockInferenceAdapter` with the `DatasetAdapter`, `MetricAdapter`, and the `NovaPromptOptimizer`.

In this particular example, we will look at tuning the prompt to respond with succinct answers. We are currently seeing a lot of interesting cases like this where prompt optimization is used to guide the style of the system's answers to match a given application.

Our metric is the word count of the RAG system's response which we will optimize with MIPROv2.

We start with the prompt:

```text
Assess the context and answer the question.
Answer the question as succinctly as you are able to.
```

Answers questions with an average word count of 125.

We then optimize it to this prompt, which answers the test questions with an average word count of 31.

```text
Task: Assess the context and answer the question succinctly.

Context:

- The user provides a context and a question.

Instructions:

- Answer the question based on the provided context.
- Ensure the response is as succinct as possible.

Any other section from Original Prompt:

- There are no other sections in the Original Prompt.

Response Format:

- The response MUST be succinct and directly address the question.
- DO NOT include unnecessary details or elaborate explanations.
```

# BedrockInferenceAdapter

In [29]:
from amzn_nova_prompt_optimizer.core.inference.adapter import BedrockInferenceAdapter

NOVA_MODEL_ID = "us.amazon.nova-pro-v1:0"

inference_adapter = BedrockInferenceAdapter(region_name="us-east-1")

response = inference_adapter.call_model(
    model_id=NOVA_MODEL_ID,
    system_prompt="You are a helpful assistant that can answer questions and help with tasks.",
    messages=[{"user": "What is the capital of the moon?"}],
    # CHANGE HERE: Add "top_k" and keep it under 128
    inf_config={
        "max_tokens": 1000, 
        "temperature": 1, 
        "top_p": 0.9, 
        "top_k": 50
    } 
)

print(response)

The Moon, Earth's natural satellite, does not have any form of government, cities, or a capital. It is an astronomical body without political or administrative structures. Therefore, the concept of a "capital" does not apply to the Moon.

However, if you're interested in the Moon from a scientific or exploratory perspective, various countries and space agencies, such as NASA, the European Space Agency (ESA), and the China National Space Administration (CNSA), have conducted missions to study and explore it. The Apollo missions by NASA, for example, were a series of human spaceflights that landed astronauts on the Moon between 1969 and 1972.


# Create Toy Dataset

In [5]:
from weaviate.agents.query import QueryAgent
import weaviate

dataset = [
    {"question": "Can you provide a detailed comparison between HyDE and LameR, highlighting their respective strengths, weaknesses, and ideal use cases in modern information retrieval?"},
    {"question": "Please thoroughly explain the concept of Top-Down Partitioning Reranking, including its core algorithmic steps, advantages over other reranking strategies, and scenarios in which it is particularly effective."},
    {"question": "What are the most advanced and effective techniques currently employed in Agentic Search optimization using reinforcement learning, and how do they address existing challenges in this field?"},
    {"question": "To what extent has the Chain-of-Thought approach contributed to measurable improvements in Cross Encoder architectures, particularly in terms of accuracy, reasoning capability, and generalization across benchmarks? Please cite relevant studies or experiments."},
    {"question": "Could you survey the leading methods for LLM query expansion and discuss how the state-of-the-art approaches differ in methodology, effectiveness, and limits, especially for complex or open-ended queries?"},
]

def get_qa() -> weaviate.agents.query.query_agent.QueryAgent:
    weaviate_client = weaviate.connect_to_weaviate_cloud(
        cluster_url=os.getenv("WEAVIATE_URL"),
        auth_credentials=weaviate.auth.AuthApiKey(api_key=os.getenv("WEAVIATE_API_KEY"))
    )
    return QueryAgent(
        client=weaviate_client,
        collections=["IRPapersText_Default"]
    )

def weaviate_search_mode_tool(
    qa: weaviate.agents.query.query_agent.QueryAgent, 
    question: str
) -> str:
    search_response = qa.search(
        question,
        limit=1
    )
    contexts = []
    for o in search_response.search_results.objects:
        contexts.append(o.properties["content"])

    return "\n".join(contexts)

qa = get_qa()

for row in dataset:
    row["context"] = weaviate_search_mode_tool(qa, row["question"])
    row["answer"] = ""


In [6]:
import pandas as pd

df = pd.DataFrame(dataset)
df.to_csv("dataset.csv", index=False)


# CSVDatasetAdapter

In [7]:
from amzn_nova_prompt_optimizer.core.input_adapters.dataset_adapter import CSVDatasetAdapter

input_columns = {"question", "context"}
output_columns = {"answer"}

dataset_adapter = CSVDatasetAdapter(input_columns, output_columns)

# Adapt
dataset_adapter.adapt("dataset.csv")

train_set, test_set = dataset_adapter.split(0.5)

# TextPromptAdapter

In [8]:
from amzn_nova_prompt_optimizer.core.input_adapters.prompt_adapter import TextPromptAdapter

prompt_adapter = TextPromptAdapter()

prompt_adapter.set_user_prompt(
    content="""
    Assess the context and answer the question.
    Answer the question as succinctly as you are able to.
    Question: {question}
    Context: {context}
    """,
    variables={"question", "context"}
)

prompt_adapter.adapt()

2025/11/25 17:02:34 INFO amzn_nova_prompt_optimizer.core.input_adapters.prompt_adapter: System Prompt not set, initializing as empty string...


<amzn_nova_prompt_optimizer.core.input_adapters.prompt_adapter.TextPromptAdapter at 0x123fb3050>

# MetricAdapter

In [9]:
from amzn_nova_prompt_optimizer.core.input_adapters.metric_adapter import MetricAdapter
from typing import List, Any, Dict
import re

class WordCountMetric(MetricAdapter):
    def _calculate_metrics(self, y_pred: Any, y_true: Any) -> Dict:
        """
        Calculates the number of words in the predicted answer (y_pred).
        Ignores y_true as it is not needed for a raw length count.
        """
        words = y_pred.split(" ")
        word_count = len(words)

        print(y_pred)
        
        # Return the metrics
        # 'total' is usually the primary metric the optimizer looks at.
        return {
            "word_count": word_count,
            "score": 1 / float(word_count) 
        }

    def apply(self, y_pred: Any, y_true: Any):
        return self._calculate_metrics(y_pred, y_true)

    def batch_apply(self, y_preds: List[Any], y_trues: List[Any]):
        # Calculate metrics for individual items
        evals = [self.apply(y_pred, y_true) for y_pred, y_true in zip(y_preds, y_trues)]
        
        # Calculate the average for all numeric metrics
        float_keys = [k for k, v in evals[0].items() if isinstance(v, (int, float, bool))]
        return {k: sum(e[k] for e in evals) / len(evals) for k in float_keys}

metric_adapter = WordCountMetric()

# Evaluator

In [11]:
from amzn_nova_prompt_optimizer.core.evaluation import Evaluator

evaluator = Evaluator(prompt_adapter, test_set, metric_adapter, inference_adapter)

In [12]:
original_prompt_score = evaluator.aggregate_score(model_id=NOVA_MODEL_ID)

print(f"Original Prompt Evaluation Score = {original_prompt_score}")

2025/11/25 17:02:34 INFO amzn_nova_prompt_optimizer.core.evaluation: Cache miss - Running new inference on Dataset
Running inference: 100%|██████████| 3/3 [00:04<00:00,  1.46s/it]
2025/11/25 17:02:38 INFO amzn_nova_prompt_optimizer.core.evaluation: Running Batch Evaluation on Dataset, using `batch_apply` metric
2025/11/25 17:02:38 INFO amzn_nova_prompt_optimizer.core.evaluation: Using cached inference results
2025/11/25 17:02:38 INFO amzn_nova_prompt_optimizer.core.evaluation: Running Evaluation on Dataset, using `apply` metric


LLM-based query expansion methods (e.g., Q2D, Q2E, CoT) generally outperform classical methods (e.g., BM25, Bo1) across datasets, with Chain-of-Thought (CoT) prompts showing particular promise. State-of-the-art approaches differ in their use of generated explanations and integration of Pseudo Relevance Feedback (PRF) documents, leading to improved performance metrics like Recall@1K. However, limitations include computational costs, focus on sparse retrieval, and specific prompt templates, suggesting areas for future research and optimization.
The most advanced and effective techniques currently employed in Agentic Search optimization using reinforcement learning, as demonstrated by the WebDancer project, include:

1. **Browsing Data Construction**: Acquiring high-quality, fine-grained browsing data that reflects diverse user intents and rich interaction contexts.
2. **Trajectories Sampling**: Constructing reliable trajectories that support long-horizon reasoning and task decomposition.

# NovaMetric

In [13]:
class WordCountNovaMetric(WordCountMetric):
    def apply(self, y_pred: Any, y_true: Any):
        # Calculate the metrics dictionary using the parent class logic
        metrics = self._calculate_metrics(y_pred, y_true)
        
        # Return ONLY the scalar value (word count) as required by this specific adapter interface
        return metrics["score"]
        
    def batch_apply(self, y_preds: List[Any], y_trues: List[Any]):
        # Keeping this as pass as per your snippet
        # (The framework likely iterates over 'apply' individually or ignores this method for this adapter type)
        pass

nova_metric_adapter = WordCountNovaMetric()

# NovaPromptOptimizer

In [14]:
from amzn_nova_prompt_optimizer.core.optimizers import NovaPromptOptimizer

nova_prompt_optimizer = NovaPromptOptimizer(prompt_adapter=prompt_adapter, inference_adapter=inference_adapter, dataset_adapter=train_set, metric_adapter=nova_metric_adapter)

optimized_prompt_adapter = nova_prompt_optimizer.optimize(mode="lite")

/opt/homebrew/lib/python3.11/site-packages/litellm/types/llms/anthropic.py:465: PydanticDeprecatedSince20: Support for class-based `config` is deprecated, use ConfigDict instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/
  class AnthropicResponseContentBlockToolUse(BaseModel):
2025/11/25 17:02:40 INFO amzn_nova_prompt_optimizer.core.optimizers.nova_meta_prompter.nova_mp_optimizer: Optimizing prompt using Nova Meta Prompter with Model: us.amazon.nova-premier-v1:0
2025/11/25 17:02:42 INFO amzn_nova_prompt_optimizer.core.optimizers.miprov2.miprov2_optimizer: Using us.amazon.nova-lite-v1:0 for Evaluation
2025/11/25 17:02:42 INFO amzn_nova_prompt_optimizer.core.optimizers.miprov2.miprov2_optimizer: Using us.amazon.nova-premier-v1:0 for Prompting
2025/11/25 17:02:42 INFO amzn_nova_prompt_optimizer.core.optimizers.miprov2.custom_adapters.custom_chat_adapter: Initializing CustomChatAdapter with enable_json_

Bootstrapping set 1/20
Bootstrapping set 2/20
Bootstrapping set 3/20


  0%|          | 0/1 [00:00<?, ?it/s]

**Comparison of HyDE and LameR:**

- **HyDE:**
  - **Strengths:** Unsupervised, no relevance labels needed, real-time generation, comparable performance at initial stages.
  - **Weaknesses:** High dependency on LLMs, potential latency issues, content bias.
  - **Ideal Use Cases:** Early stages of search system, novel queries, real-time relevance judgments.

- **LameR:**
  - **Strengths:** Not detailed in the context.
  - **Weaknesses:** Not detailed in the context.
  - **Ideal Use Cases:** Not detailed in the context.


  from .autonotebook import tqdm as notebook_tqdm
100%|██████████| 1/1 [00:01<00:00,  1.72s/it]


Bootstrapped 1 full traces after 0 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 4/20


100%|██████████| 1/1 [00:00<00:00, 2222.74it/s]


**Comparison of HyDE and LameR:**

- **HyDE:**
  - **Strengths:** Unsupervised, no relevance labels needed, real-time generation, comparable performance at initial stages.
  - **Weaknesses:** High dependency on LLMs, potential latency issues, content bias.
  - **Ideal Use Cases:** Early stages of search system, novel queries, real-time relevance judgments.

- **LameR:**
  - **Strengths:** Not detailed in the context.
  - **Weaknesses:** Not detailed in the context.
  - **Ideal Use Cases:** Not detailed in the context.
Bootstrapped 1 full traces after 0 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 5/20


100%|██████████| 1/1 [00:00<00:00, 2565.32it/s]


**Comparison of HyDE and LameR:**

- **HyDE:**
  - **Strengths:** Unsupervised, no relevance labels needed, real-time generation, comparable performance at initial stages.
  - **Weaknesses:** High dependency on LLMs, potential latency issues, content bias.
  - **Ideal Use Cases:** Early stages of search system, novel queries, real-time relevance judgments.

- **LameR:**
  - **Strengths:** Not detailed in the context.
  - **Weaknesses:** Not detailed in the context.
  - **Ideal Use Cases:** Not detailed in the context.
Bootstrapped 1 full traces after 0 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 6/20


100%|██████████| 1/1 [00:01<00:00,  1.40s/it]


**Comparison of HyDE and LameR:**

- **HyDE:**
  - **Strengths:** Unsupervised, no relevance labels needed, real-time generation, comparable performance at initial stages.
  - **Weaknesses:** High dependency on LLMs, potential latency issues, content bias.
  - **Ideal Use Cases:** Early stages of search system, novel queries, real-time relevance judgments.

- **LameR:**
  - **Strengths:** Not detailed in the context.
  - **Weaknesses:** Not detailed in the context.
  - **Ideal Use Cases:** Not detailed in the context.
Bootstrapped 1 full traces after 0 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 7/20


100%|██████████| 1/1 [00:00<00:00, 989.92it/s]


**Comparison of HyDE and LameR:**

- **HyDE:**
  - **Strengths:** Unsupervised, no relevance labels needed, real-time generation, comparable performance at initial stages.
  - **Weaknesses:** High dependency on LLMs, potential latency issues, content bias.
  - **Ideal Use Cases:** Early stages of search system, novel queries, real-time relevance judgments.

- **LameR:**
  - **Strengths:** Not detailed in the context.
  - **Weaknesses:** Not detailed in the context.
  - **Ideal Use Cases:** Not detailed in the context.
Bootstrapped 1 full traces after 0 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 8/20


100%|██████████| 1/1 [00:01<00:00,  1.25s/it]


**Comparison of HyDE and LameR:**

- **HyDE:**
  - **Strengths:** Unsupervised, no relevance labels needed, real-time generation, comparable performance at initial stages.
  - **Weaknesses:** High dependency on LLMs, potential latency issues, content bias.
  - **Ideal Use Cases:** Early stages of search system, novel queries, real-time relevance judgments.

- **LameR:**
  - **Strengths:** Not detailed in the context.
  - **Weaknesses:** Not detailed in the context.
  - **Ideal Use Cases:** Not detailed in the context.
Bootstrapped 1 full traces after 0 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 9/20


100%|██████████| 1/1 [00:00<00:00, 3155.98it/s]


**Comparison of HyDE and LameR:**

- **HyDE:**
  - **Strengths:** Unsupervised, no relevance labels needed, real-time generation, comparable performance at initial stages.
  - **Weaknesses:** High dependency on LLMs, potential latency issues, content bias.
  - **Ideal Use Cases:** Early stages of search system, novel queries, real-time relevance judgments.

- **LameR:**
  - **Strengths:** Not detailed in the context.
  - **Weaknesses:** Not detailed in the context.
  - **Ideal Use Cases:** Not detailed in the context.
Bootstrapped 1 full traces after 0 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 10/20


100%|██████████| 1/1 [00:00<00:00,  1.07it/s]


**Comparison of HyDE and LameR:**

- **HyDE:**
  - **Strengths:** Unsupervised, no relevance labels needed, real-time generation, comparable performance at initial stages.
  - **Weaknesses:** High dependency on LLMs, potential latency issues, content bias.
  - **Ideal Use Cases:** Early stages of search system, novel queries, real-time relevance judgments.

- **LameR:**
  - **Strengths:** Not detailed in the context.
  - **Weaknesses:** Not detailed in the context.
  - **Ideal Use Cases:** Not detailed in the context.
Bootstrapped 1 full traces after 0 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 11/20


100%|██████████| 1/1 [00:00<00:00, 1161.54it/s]


**Comparison of HyDE and LameR:**

- **HyDE:**
  - **Strengths:** Unsupervised, no relevance labels needed, real-time generation, comparable performance at initial stages.
  - **Weaknesses:** High dependency on LLMs, potential latency issues, content bias.
  - **Ideal Use Cases:** Early stages of search system, novel queries, real-time relevance judgments.

- **LameR:**
  - **Strengths:** Not detailed in the context.
  - **Weaknesses:** Not detailed in the context.
  - **Ideal Use Cases:** Not detailed in the context.
Bootstrapped 1 full traces after 0 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 12/20


100%|██████████| 1/1 [00:00<00:00,  3.88it/s]


**Comparison of HyDE and LameR:**

- **HyDE:**
  - **Strengths:** Unsupervised, no relevance labels needed, real-time generation, comparable performance at initial stages.
  - **Weaknesses:** High dependency on LLMs, potential latency issues, content bias.
  - **Ideal Use Cases:** Early stages of search system, novel queries, real-time relevance judgments.

- **LameR:**
  - **Strengths:** Not detailed in the context.
  - **Weaknesses:** Not detailed in the context.
  - **Ideal Use Cases:** Not detailed in the context.
Bootstrapped 1 full traces after 0 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 13/20


100%|██████████| 1/1 [00:00<00:00,  2.62it/s]


**Comparison of HyDE and LameR:**

- **HyDE:**
  - **Strengths:** Unsupervised, no relevance labels needed, real-time generation, comparable performance at initial stages.
  - **Weaknesses:** High dependency on LLMs, potential latency issues, content bias.
  - **Ideal Use Cases:** Early stages of search system, novel queries, real-time relevance judgments.

- **LameR:**
  - **Strengths:** Not detailed in the context.
  - **Weaknesses:** Not detailed in the context.
  - **Ideal Use Cases:** Not detailed in the context.
Bootstrapped 1 full traces after 0 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 14/20


100%|██████████| 1/1 [00:00<00:00,  3.05it/s]


**Comparison of HyDE and LameR:**

- **HyDE:**
  - **Strengths:** Unsupervised, no relevance labels needed, real-time generation, comparable performance at initial stages.
  - **Weaknesses:** High dependency on LLMs, potential latency issues, content bias.
  - **Ideal Use Cases:** Early stages of search system, novel queries, real-time relevance judgments.

- **LameR:**
  - **Strengths:** Not detailed in the context.
  - **Weaknesses:** Not detailed in the context.
  - **Ideal Use Cases:** Not detailed in the context.
Bootstrapped 1 full traces after 0 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 15/20


100%|██████████| 1/1 [00:00<00:00,  3.58it/s]


**Comparison of HyDE and LameR:**

- **HyDE:**
  - **Strengths:** Unsupervised, no relevance labels needed, real-time generation, comparable performance at initial stages.
  - **Weaknesses:** High dependency on LLMs, potential latency issues, content bias.
  - **Ideal Use Cases:** Early stages of search system, novel queries, real-time relevance judgments.

- **LameR:**
  - **Strengths:** Not detailed in the context.
  - **Weaknesses:** Not detailed in the context.
  - **Ideal Use Cases:** Not detailed in the context.
Bootstrapped 1 full traces after 0 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 16/20


100%|██████████| 1/1 [00:00<00:00,  3.69it/s]


**Comparison of HyDE and LameR:**

- **HyDE:**
  - **Strengths:** Unsupervised, no relevance labels needed, real-time generation, comparable performance at initial stages.
  - **Weaknesses:** High dependency on LLMs, potential latency issues, content bias.
  - **Ideal Use Cases:** Early stages of search system, novel queries, real-time relevance judgments.

- **LameR:**
  - **Strengths:** Not detailed in the context.
  - **Weaknesses:** Not detailed in the context.
  - **Ideal Use Cases:** Not detailed in the context.
Bootstrapped 1 full traces after 0 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 17/20


100%|██████████| 1/1 [00:00<00:00,  3.31it/s]


**Comparison of HyDE and LameR:**

- **HyDE:**
  - **Strengths:** Unsupervised, no relevance labels needed, real-time generation, comparable performance at initial stages.
  - **Weaknesses:** High dependency on LLMs, potential latency issues, content bias.
  - **Ideal Use Cases:** Early stages of search system, novel queries, real-time relevance judgments.

- **LameR:**
  - **Strengths:** Not detailed in the context.
  - **Weaknesses:** Not detailed in the context.
  - **Ideal Use Cases:** Not detailed in the context.
Bootstrapped 1 full traces after 0 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 18/20


100%|██████████| 1/1 [00:00<00:00, 2197.12it/s]


**Comparison of HyDE and LameR:**

- **HyDE:**
  - **Strengths:** Unsupervised, no relevance labels needed, real-time generation, comparable performance at initial stages.
  - **Weaknesses:** High dependency on LLMs, potential latency issues, content bias.
  - **Ideal Use Cases:** Early stages of search system, novel queries, real-time relevance judgments.

- **LameR:**
  - **Strengths:** Not detailed in the context.
  - **Weaknesses:** Not detailed in the context.
  - **Ideal Use Cases:** Not detailed in the context.
Bootstrapped 1 full traces after 0 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 19/20


100%|██████████| 1/1 [00:00<00:00,  6.33it/s]


**Comparison of HyDE and LameR:**

- **HyDE:**
  - **Strengths:** Unsupervised, no relevance labels needed, real-time generation, comparable performance at initial stages.
  - **Weaknesses:** High dependency on LLMs, potential latency issues, content bias.
  - **Ideal Use Cases:** Early stages of search system, novel queries, real-time relevance judgments.

- **LameR:**
  - **Strengths:** Not detailed in the context.
  - **Weaknesses:** Not detailed in the context.
  - **Ideal Use Cases:** Not detailed in the context.
Bootstrapped 1 full traces after 0 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 20/20


100%|██████████| 1/1 [00:00<00:00,  2.85it/s]
2025/11/25 17:02:49 INFO amzn_nova_prompt_optimizer.core.optimizers.miprov2.miprov2_optimizer: Entering patched_propose_instructions, patching GroundedProposer with NovaGroundedProposer
2025/11/25 17:02:49 INFO amzn_nova_prompt_optimizer.core.optimizers.miprov2.miprov2_optimizer: Patched GroundedProposer, current GroundedProposer class=<class 'amzn_nova_prompt_optimizer.core.optimizers.nova_prompt_optimizer.nova_grounded_proposer.NovaGroundedProposer'>
2025/11/25 17:02:49 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/11/25 17:02:49 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.
2025/11/25 17:02:49 INFO amzn_nova_prompt_optimizer.core.optimizers.nova_prompt_optimizer.nova_grounded_proposer: Initializing NovaGroundedPropose

**Comparison of HyDE and LameR:**

- **HyDE:**
  - **Strengths:** Unsupervised, no relevance labels needed, real-time generation, comparable performance at initial stages.
  - **Weaknesses:** High dependency on LLMs, potential latency issues, content bias.
  - **Ideal Use Cases:** Early stages of search system, novel queries, real-time relevance judgments.

- **LameR:**
  - **Strengths:** Not detailed in the context.
  - **Weaknesses:** Not detailed in the context.
  - **Ideal Use Cases:** Not detailed in the context.
Bootstrapped 1 full traces after 0 examples for up to 1 rounds, amounting to 1 attempts.


2025/11/25 17:02:55 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing N=20 instructions...



[Nova] Selected tip: description
[Nova] Selected tip: simple
[Nova] Selected tip: rules_based
[Nova] Selected tip: format_control
[Nova] Selected tip: creative
[Nova] Selected tip: description
[Nova] Selected tip: format_control
[Nova] Selected tip: high_stakes
[Nova] Selected tip: format_control
[Nova] Selected tip: none
[Nova] Selected tip: none
[Nova] Selected tip: none
[Nova] Selected tip: structured_prompt
[Nova] Selected tip: description
[Nova] Selected tip: creative
[Nova] Selected tip: none
[Nova] Selected tip: rules_based
[Nova] Selected tip: description
[Nova] Selected tip: creative
[Nova] Selected tip: none


2025/11/25 17:06:44 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/11/25 17:06:44 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Task: Assess the context and answer the question succinctly.

Context:

- The user provides a context and a question.

Instructions:

- Answer the question based on the provided context.
- Ensure the response is as succinct as possible.

Any other section from Original Prompt:

- There are no other sections in the Original Prompt.

Response Format:

- The response MUST be succinct and directly address the question.
- DO NOT include unnecessary details or elaborate explanations.

2025/11/25 17:06:44 INFO dspy.teleprompt.mipro_optimizer_v2: 1: Given a context and a question, provide a succinct comparison of the methods mentioned.

2025/11/25 17:06:44 INFO dspy.teleprompt.mipro_optimizer_v2: 2: Given the context and the question, generate a detailed comparison between HyDE and LameR, focusing on their strengths, weaknesses, an

  0%|          | 0/1 [00:00<?, ?it/s]The Chain-of-Thought approach has improved accuracy, reasoning capability, and generalization in Cross Encoder architectures, as evidenced by the performance gains in various benchmarks.
Average Metric: 0.04 / 1 (4.3%): 100%|██████████| 1/1 [00:00<00:00,  1.31it/s]

2025/11/25 17:06:45 INFO dspy.evaluate.evaluate: Average Metric: 0.043478260869565216 / 1 (4.3%)
2025/11/25 17:06:45 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 4.35

2025/11/25 17:06:45 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 37 - Minibatch ==



  0%|          | 0/1 [00:00<?, ?it/s]The Chain-of-Thought (CoT) approach has significantly contributed to measurable improvements in Cross Encoder architectures, particularly in terms of accuracy, reasoning capability, and generalization across benchmarks. 

- **Accuracy**: The incorporation of CoT generally improves search performance across various prompting and aggregation methods, as evidenced by the NDCG@3 comparisons in Figure 2. For instance, LLM4CS (RAR + Mean + CoT) significantly outperforms other baselines in most cases.
  
- **Reasoning Capability**: The use of CoT helps large language models better understand and reason through complex queries, leading to more accurate and relevant responses. This is highlighted by the performance improvements in tasks like Conversational Query Rewriting.

- **Generalization**: The effectiveness of CoT is consistent across different datasets (CAsT-19, CAsT-20, CAsT-21), indicating its ability to generalize well. The relative improvements o

2025/11/25 17:06:47 INFO dspy.evaluate.evaluate: Average Metric: 0.006329113924050633 / 1 (0.6%)
2025/11/25 17:06:47 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 0.63 on minibatch of size 1 with parameters ['Predictor 0: Instruction 12', 'Predictor 0: Few-Shot Set 6'].
2025/11/25 17:06:47 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:06:47 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63]
2025/11/25 17:06:47 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:06:47 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 37 - Minibatch ==



  0%|          | 0/1 [00:00<?, ?it/s]The Chain-of-Thought (CoT) approach has significantly contributed to measurable improvements in Cross Encoder architectures, particularly in terms of accuracy, reasoning capability, and generalization across benchmarks. 

- **Accuracy**: The incorporation of CoT generally improves search performance, as evidenced by the improvements in metrics like MRR (Mean Reciprocal Rank) and NDCG@3 (Normalized Discounted Cumulative Gain at 3) across various datasets (CAsT-19, CAsT-20, CAsT-21). For example, LLM4CS (RAR + Mean + CoT) significantly outperforms baselines in these metrics.

- **Reasoning Capability**: The CoT approach enhances the model's reasoning capability, leading to better performance in tasks requiring complex reasoning, such as query rewriting and relevance retrieval. This is shown in the improvements in REW (Rewrite) and RTR (Retrieve) metrics.

- **Generalization**: The effectiveness of CoT is consistent across different datasets and bench

2025/11/25 17:06:50 INFO dspy.evaluate.evaluate: Average Metric: 0.005952380952380952 / 1 (0.6%)
2025/11/25 17:06:50 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 0.6 on minibatch of size 1 with parameters ['Predictor 0: Instruction 8', 'Predictor 0: Few-Shot Set 4'].
2025/11/25 17:06:50 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:06:50 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6]
2025/11/25 17:06:50 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:06:50 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 37 - Minibatch ==



  0%|          | 0/1 [00:00<?, ?it/s]The Chain-of-Thought (CoT) approach has significantly contributed to measurable improvements in Cross Encoder architectures, particularly in terms of accuracy, reasoning capability, and generalization across benchmarks. 

- **Accuracy**: The incorporation of CoT generally improves search performance, as evidenced by the improvements in NDCG@3 scores across different datasets and methods (CAsT-20 and CAsT-21 datasets in Figure 2).
  
- **Reasoning Capability**: The CoT approach helps large language models (LLMs) achieve a correct understanding of the task, leading to better performance in tasks like query rewriting and relevance retrieval.

- **Generalization**: The CoT method shows consistent improvements across various prompting and aggregation methods, indicating its effectiveness in generalizing across different benchmarks.

Relevant studies or experiments include the ablation results in Figure 2, which demonstrate the positive impact of CoT on 

2025/11/25 17:06:53 INFO dspy.evaluate.evaluate: Average Metric: 0.006578947368421052 / 1 (0.7%)
2025/11/25 17:06:53 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 0.66 on minibatch of size 1 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 13'].
2025/11/25 17:06:53 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:06:53 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66]
2025/11/25 17:06:53 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:06:53 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 37 - Minibatch ==



  0%|          | 0/1 [00:00<?, ?it/s]The Chain-of-Thought (CoT) approach has significantly contributed to measurable improvements in Cross Encoder architectures, particularly in terms of accuracy, reasoning capability, and generalization across benchmarks. 

1. **Accuracy**:
   - The study shows that incorporating CoT generally improves search performance, as evidenced by the improvements in metrics like MRR, NDCG@3, and R@100 across various datasets (CAsT-19, CAsT-20, CAsT-21).

2. **Reasoning Capability**:
   - The CoT approach aids in guiding the large language model towards a correct understanding of the task, enhancing its reasoning capability. This is demonstrated by the performance improvements in tasks like query rewriting and relevance ranking.

3. **Generalization**:
   - The CoT-enhanced models show better generalization across different benchmarks, as seen in the consistent performance improvements across multiple datasets and metrics.

**Relevant Studies/Experiments**:
- 

2025/11/25 17:06:56 INFO dspy.evaluate.evaluate: Average Metric: 0.0055248618784530384 / 1 (0.6%)
2025/11/25 17:06:56 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 0.55 on minibatch of size 1 with parameters ['Predictor 0: Instruction 9', 'Predictor 0: Few-Shot Set 7'].
2025/11/25 17:06:56 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:06:56 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55]
2025/11/25 17:06:56 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:06:56 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 6 / 37 - Minibatch ==



  0%|          | 0/1 [00:00<?, ?it/s]The Chain-of-Thought (CoT) approach has significantly contributed to measurable improvements in Cross Encoder architectures, particularly in terms of accuracy, reasoning capability, and generalization across benchmarks. 

1. **Accuracy**: The study shows that incorporating CoT generally improves search performance, as evidenced by the improvements in metrics like MRR, NDCG@3, and R@100 across various datasets (CAsT-19, CAsT-20, CAsT-21). For instance, LLM4CS, which utilizes CoT, significantly outperforms other baselines in most metrics.

2. **Reasoning Capability**: The CoT approach enhances the model's reasoning capability, enabling it to better understand and process complex queries. This is demonstrated by the relative improvements over human rewrites and the second-best results, particularly in the NDCG@3 metric.

3. **Generalization**: The effectiveness of CoT is consistent across different prompting and aggregation methods, as shown in Table 

2025/11/25 17:06:58 INFO dspy.evaluate.evaluate: Average Metric: 0.005208333333333333 / 1 (0.5%)
2025/11/25 17:06:58 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 0.52 on minibatch of size 1 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 9'].
2025/11/25 17:06:58 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:06:58 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52]
2025/11/25 17:06:58 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:06:58 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 37 - Full Evaluation =====
2025/11/25 17:06:58 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 0.66) from minibatch trials...



The Chain-of-Thought (CoT) approach has significantly contributed to measurable improvements in Cross Encoder architectures, particularly in terms of accuracy, reasoning capability, and generalization across benchmarks. 

- **Accuracy**: The incorporation of CoT generally improves search performance, as evidenced by the improvements in NDCG@3 scores across different datasets and methods (CAsT-20 and CAsT-21 datasets in Figure 2).
  
- **Reasoning Capability**: The CoT approach helps large language models (LLMs) achieve a correct understanding of the task, leading to better performance in tasks like query rewriting and relevance retrieval.

- **Generalization**: The CoT method shows consistent improvements across various prompting and aggregation methods, indicating its effectiveness in generalizing across different benchmarks.

Relevant studies or experiments include the ablation results in Figure 2, which demonstrate the positive impact of CoT on search performance. Additionally, the

2025/11/25 17:06:58 INFO dspy.evaluate.evaluate: Average Metric: 0.006578947368421052 / 1 (0.7%)
2025/11/25 17:06:58 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66]
2025/11/25 17:06:58 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35
2025/11/25 17:06:58 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/11/25 17:06:58 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 8 / 37 - Minibatch ==



  0%|          | 0/1 [00:00<?, ?it/s]The Chain-of-Thought (CoT) approach has significantly contributed to measurable improvements in Cross Encoder architectures, particularly in terms of accuracy, reasoning capability, and generalization across benchmarks. 

- **Accuracy**: The incorporation of CoT generally improves search performance, as evidenced by the NDCG@3 comparisons in Figure 2. For example, LLM4CS (RAR + Mean + CoT) significantly outperforms other baselines in various metrics, indicating enhanced accuracy.

- **Reasoning Capability**: The CoT approach aids the model in understanding and reasoning about queries more effectively, leading to better performance in tasks like query rewriting and relevance ranking. This is shown in the improvements in metrics like MRR and R@100.

- **Generalization**: The CoT method demonstrates consistent improvements across different datasets (CAsT-19, CAsT-20, CAsT-21), suggesting better generalization across benchmarks.

Relevant studies or ex

2025/11/25 17:07:00 INFO dspy.evaluate.evaluate: Average Metric: 0.006944444444444444 / 1 (0.7%)
2025/11/25 17:07:00 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 0.69 on minibatch of size 1 with parameters ['Predictor 0: Instruction 10', 'Predictor 0: Few-Shot Set 15'].
2025/11/25 17:07:00 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:07:00 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69]
2025/11/25 17:07:00 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:07:00 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 9 / 37 - Minibatch ==



  0%|          | 0/1 [00:00<?, ?it/s]The Chain-of-Thought (CoT) approach has significantly contributed to measurable improvements in Cross Encoder architectures, particularly in terms of accuracy, reasoning capability, and generalization across benchmarks. 

**Accuracy:** The study shows that incorporating CoT generally improves search performance, as evidenced by the improvements in metrics such as MRR, NDCG@3, and R@100 across various datasets (CAsT-19, CAsT-20, CAsT-21).

**Reasoning Capability:** The CoT approach enhances the model's ability to understand and reason about complex queries, leading to better performance in tasks requiring nuanced comprehension, such as query rewriting and relevance ranking.

**Generalization:** The effectiveness of CoT is consistent across different prompting and aggregation methods, indicating its robustness and general applicability in various cross-encoder setups.

**Relevant Studies:** The study demonstrates that tailored CoT significantly outpe

2025/11/25 17:07:02 INFO dspy.evaluate.evaluate: Average Metric: 0.007142857142857143 / 1 (0.7%)
2025/11/25 17:07:02 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 0.71 on minibatch of size 1 with parameters ['Predictor 0: Instruction 6', 'Predictor 0: Few-Shot Set 17'].
2025/11/25 17:07:02 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:07:02 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71]
2025/11/25 17:07:02 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:07:02 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 10 / 37 - Minibatch ==



  0%|          | 0/1 [00:00<?, ?it/s]The Chain-of-Thought (CoT) approach has significantly contributed to measurable improvements in Cross Encoder architectures, particularly in terms of accuracy, reasoning capability, and generalization across benchmarks. 

**Accuracy:** The integration of CoT has led to notable performance improvements. For instance, LLM4CS (RAR + Mean + CoT) significantly outperforms other baselines in terms of MRR, NDCG@3, and R@100 across multiple datasets, as shown in Table 2. The relative improvements over human performance and the second-best results also highlight the effectiveness of CoT.

**Reasoning Capability:** The CoT approach enhances the model's ability to reason through complex queries. This is demonstrated in Table 3, where different prompting and aggregation methods incorporating CoT show better performance compared to methods without CoT.

**Generalization:** Across different datasets (CAsT-19, CAsT-20, CAsT-21), the CoT approach consistently impr

2025/11/25 17:07:05 INFO dspy.evaluate.evaluate: Average Metric: 0.005988023952095809 / 1 (0.6%)
2025/11/25 17:07:05 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 0.6 on minibatch of size 1 with parameters ['Predictor 0: Instruction 18', 'Predictor 0: Few-Shot Set 9'].
2025/11/25 17:07:05 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:07:05 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6]
2025/11/25 17:07:05 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:07:05 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 11 / 37 - Minibatch ==



The Chain-of-Thought (CoT) approach has significantly contributed to measurable improvements in Cross Encoder architectures, particularly in terms of accuracy, reasoning capability, and generalization across benchmarks. 

1. **Accuracy**: The study shows that incorporating CoT generally improves search performance, as evidenced by the improvements in metrics like MRR, NDCG@3, and R@100 across various datasets (CAsT-19, CAsT-20, CAsT-21). For instance, LLM4CS, which utilizes CoT, significantly outperforms other baselines in most metrics.

2. **Reasoning Capability**: The CoT approach enhances the model's reasoning capability, enabling it to better understand and process complex queries. This is demonstrated by the relative improvements over human rewrites and the second-best results, particularly in the NDCG@3 metric.

3. **Generalization**: The effectiveness of CoT is consistent across different prompting and aggregation methods, as shown in Table 3 and Figure 2. The improvements are 

2025/11/25 17:07:05 INFO dspy.evaluate.evaluate: Average Metric: 0.005208333333333333 / 1 (0.5%)
2025/11/25 17:07:05 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 0.52 on minibatch of size 1 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 16'].
2025/11/25 17:07:05 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:07:05 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6, 0.52]
2025/11/25 17:07:05 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:07:05 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 12 / 37 - Minibatch ==



  0%|          | 0/1 [00:00<?, ?it/s]### Comparative Analysis: Chain-of-Thought Approach in Cross Encoder Architectures

**Strengths:**
1. **Improved Accuracy:** The Chain-of-Thought (CoT) approach has shown significant improvements in accuracy across various benchmarks. For instance, LLM4CS (RAR + Mean + CoT) significantly outperforms other baselines in terms of MRR, NDCG@3, and R@100 on CAsT datasets.
2. **Enhanced Reasoning Capability:** The CoT approach aids in guiding the model towards a more accurate understanding of the task, as evidenced by the improvements in search performance metrics.
3. **Better Generalization:** The CoT approach demonstrates consistent improvements across different datasets and evaluation metrics, indicating better generalization capabilities.

**Weaknesses:**
1. **Complexity:** Implementing CoT can increase the complexity of the model, potentially making it more resource-intensive.
2. **Dependency on Quality of Thought Process:** The effectiveness of CoT

2025/11/25 17:07:08 INFO dspy.evaluate.evaluate: Average Metric: 0.0034965034965034965 / 1 (0.3%)
2025/11/25 17:07:08 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 0.35 on minibatch of size 1 with parameters ['Predictor 0: Instruction 16', 'Predictor 0: Few-Shot Set 0'].
2025/11/25 17:07:08 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:07:08 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6, 0.52, 0.35]
2025/11/25 17:07:08 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:07:08 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 13 / 37 - Full Evaluation =====
2025/11/25 17:07:08 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 0.71) from minibatch trials...



The Chain-of-Thought (CoT) approach has significantly contributed to measurable improvements in Cross Encoder architectures, particularly in terms of accuracy, reasoning capability, and generalization across benchmarks. 

**Accuracy:** The study shows that incorporating CoT generally improves search performance, as evidenced by the improvements in metrics such as MRR, NDCG@3, and R@100 across various datasets (CAsT-19, CAsT-20, CAsT-21).

**Reasoning Capability:** The CoT approach enhances the model's ability to understand and reason about complex queries, leading to better performance in tasks requiring nuanced comprehension, such as query rewriting and relevance ranking.

**Generalization:** The effectiveness of CoT is consistent across different prompting and aggregation methods, indicating its robustness and general applicability in various cross-encoder setups.

**Relevant Studies:** The study demonstrates that tailored CoT significantly outperforms baselines, especially when com

2025/11/25 17:07:08 INFO dspy.evaluate.evaluate: Average Metric: 0.007142857142857143 / 1 (0.7%)
2025/11/25 17:07:08 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6, 0.52, 0.35, 0.71]
2025/11/25 17:07:08 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35
2025/11/25 17:07:08 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/11/25 17:07:08 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 14 / 37 - Minibatch ==



The Chain-of-Thought (CoT) approach has significantly contributed to measurable improvements in Cross Encoder architectures, particularly in terms of accuracy, reasoning capability, and generalization across benchmarks. 

1. **Accuracy**: The study shows that incorporating CoT generally improves search performance, as evidenced by the improvements in metrics like MRR, NDCG@3, and R@100 across various datasets (CAsT-19, CAsT-20, CAsT-21). For instance, LLM4CS, which utilizes CoT, significantly outperforms other baselines in most metrics.

2. **Reasoning Capability**: The CoT approach enhances the model's reasoning capability, enabling it to better understand and process complex queries. This is demonstrated by the relative improvements over human rewrites and the second-best results, particularly in the NDCG@3 metric.

3. **Generalization**: The effectiveness of CoT is consistent across different prompting and aggregation methods, as shown in Table 3 and Figure 2. The improvements are 

2025/11/25 17:07:08 INFO dspy.evaluate.evaluate: Average Metric: 0.005208333333333333 / 1 (0.5%)
2025/11/25 17:07:08 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 0.52 on minibatch of size 1 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 17'].
2025/11/25 17:07:08 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:07:08 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6, 0.52, 0.35, 0.71, 0.52]
2025/11/25 17:07:08 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:07:08 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 15 / 37 - Minibatch ==



  0%|          | 0/1 [00:00<?, ?it/s]### Chain-of-Thought Approach in Cross Encoder Architectures

**Contribution to Accuracy:**
The Chain-of-Thought (CoT) approach has significantly contributed to the accuracy of Cross Encoder architectures. Studies have shown that incorporating CoT leads to improved performance metrics such as Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) across various benchmarks. For instance, the LLM4CS model, which utilizes CoT, achieved significant improvements in accuracy compared to baselines, especially in the CAsT-20 and CAsT-21 datasets.

**Contribution to Reasoning Capability:**
The CoT approach enhances the reasoning capability of Cross Encoders by enabling them to break down complex queries into simpler, more manageable steps. This step-by-step reasoning process allows the model to better understand and process the query, leading to more accurate and relevant responses. The results in Table 3 and Figure 2 demonstrate that m

2025/11/25 17:07:14 INFO dspy.evaluate.evaluate: Average Metric: 0.002824858757062147 / 1 (0.3%)
2025/11/25 17:07:14 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 0.28 on minibatch of size 1 with parameters ['Predictor 0: Instruction 14', 'Predictor 0: Few-Shot Set 0'].
2025/11/25 17:07:14 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:07:14 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6, 0.52, 0.35, 0.71, 0.52, 0.28]
2025/11/25 17:07:14 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:07:14 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 16 / 37 - Minibatch ==



  0%|          | 0/1 [00:00<?, ?it/s]The Chain-of-Thought (CoT) approach has significantly contributed to measurable improvements in Cross Encoder architectures, particularly in terms of accuracy, reasoning capability, and generalization across benchmarks. 

1. **Accuracy**: The study shows that incorporating CoT generally improves search performance, as evidenced by the improvements in NDCG@3 scores across various datasets (CAsT-20 and CAsT-21). For example, LLM4CS (RAR + Mean + CoT) significantly outperforms other baselines, indicating that CoT enhances the model's ability to understand and respond correctly.

2. **Reasoning Capability**: The CoT approach aids in guiding the large language model towards a correct understanding, as demonstrated by the real example in Appendix B.1. This suggests that CoT enhances the model's reasoning capabilities, allowing it to better process and respond to queries.

3. **Generalization**: The improvements seen across different prompting and aggrega

2025/11/25 17:07:16 INFO dspy.evaluate.evaluate: Average Metric: 0.00558659217877095 / 1 (0.6%)
2025/11/25 17:07:16 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 0.56 on minibatch of size 1 with parameters ['Predictor 0: Instruction 17', 'Predictor 0: Few-Shot Set 17'].
2025/11/25 17:07:16 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:07:16 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6, 0.52, 0.35, 0.71, 0.52, 0.28, 0.56]
2025/11/25 17:07:16 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:07:16 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 17 / 37 - Minibatch ==



  0%|          | 0/1 [00:00<?, ?it/s]The Chain-of-Thought (CoT) approach has significantly contributed to measurable improvements in Cross Encoder architectures, particularly in terms of accuracy, reasoning capability, and generalization across benchmarks. Here are the key points:

1. **Accuracy**:
   - The incorporation of CoT generally improves search performance, as evidenced by the improvements in metrics such as MRR, NDCG@3, and R@100 across various datasets (CAsT-19, CAsT-20, CAsT-21).

2. **Reasoning Capability**:
   - CoT helps large language models (LLMs) achieve a more correct understanding of the task, as shown in the ablation results and the real example provided in the context. This is particularly evident in the performance improvements when using tailored CoT in conjunction with different prompting and aggregation methods.

3. **Generalization**:
   - The effectiveness of CoT is consistent across different datasets and benchmarks, indicating its ability to generalize we

2025/11/25 17:07:20 INFO dspy.evaluate.evaluate: Average Metric: 0.004524886877828055 / 1 (0.5%)
2025/11/25 17:07:20 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 0.45 on minibatch of size 1 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 11'].
2025/11/25 17:07:20 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:07:20 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6, 0.52, 0.35, 0.71, 0.52, 0.28, 0.56, 0.45]
2025/11/25 17:07:20 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:07:20 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 18 / 37 - Minibatch ==



The Chain-of-Thought (CoT) approach has significantly contributed to measurable improvements in Cross Encoder architectures, particularly in terms of accuracy, reasoning capability, and generalization across benchmarks. 

**Accuracy:** The study shows that incorporating CoT generally improves search performance, as evidenced by the improvements in metrics such as MRR, NDCG@3, and R@100 across various datasets (CAsT-19, CAsT-20, CAsT-21).

**Reasoning Capability:** The CoT approach enhances the model's ability to understand and reason about complex queries, leading to better performance in tasks requiring nuanced comprehension, such as query rewriting and relevance ranking.

**Generalization:** The effectiveness of CoT is consistent across different prompting and aggregation methods, indicating its robustness and general applicability in various cross-encoder setups.

**Relevant Studies:** The study demonstrates that tailored CoT significantly outperforms baselines, especially when com

2025/11/25 17:07:20 INFO dspy.evaluate.evaluate: Average Metric: 0.007142857142857143 / 1 (0.7%)
2025/11/25 17:07:20 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 0.71 on minibatch of size 1 with parameters ['Predictor 0: Instruction 6', 'Predictor 0: Few-Shot Set 17'].
2025/11/25 17:07:20 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:07:20 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6, 0.52, 0.35, 0.71, 0.52, 0.28, 0.56, 0.45, 0.71]
2025/11/25 17:07:20 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:07:20 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 19 / 37 - Full Evaluation =====
2025/11/25 17:07:20 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 0.69) from minibatch trials...



The Chain-of-Thought (CoT) approach has significantly contributed to measurable improvements in Cross Encoder architectures, particularly in terms of accuracy, reasoning capability, and generalization across benchmarks. 

- **Accuracy**: The incorporation of CoT generally improves search performance, as evidenced by the NDCG@3 comparisons in Figure 2. For example, LLM4CS (RAR + Mean + CoT) significantly outperforms other baselines in various metrics, indicating enhanced accuracy.

- **Reasoning Capability**: The CoT approach aids the model in understanding and reasoning about queries more effectively, leading to better performance in tasks like query rewriting and relevance ranking. This is shown in the improvements in metrics like MRR and R@100.

- **Generalization**: The CoT method demonstrates consistent improvements across different datasets (CAsT-19, CAsT-20, CAsT-21), suggesting better generalization across benchmarks.

Relevant studies or experiments include the performance com

2025/11/25 17:07:20 INFO dspy.evaluate.evaluate: Average Metric: 0.006944444444444444 / 1 (0.7%)
2025/11/25 17:07:20 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6, 0.52, 0.35, 0.71, 0.52, 0.28, 0.56, 0.45, 0.71, 0.69]
2025/11/25 17:07:20 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35
2025/11/25 17:07:20 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/11/25 17:07:20 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 20 / 37 - Minibatch ==



  0%|          | 0/1 [00:00<?, ?it/s]The Chain-of-Thought (CoT) approach has significantly contributed to measurable improvements in Cross Encoder architectures, particularly in terms of accuracy, reasoning capability, and generalization across benchmarks. 

1. **Accuracy**: The incorporation of CoT has led to improvements in metrics such as MRR, NDCG@3, and R@100 across various datasets. For example, LLM4CS, which utilizes CoT, shows significant improvements over baselines in the CAsT datasets.

2. **Reasoning Capability**: The CoT approach enhances the model's ability to reason through complex queries, leading to better performance in tasks requiring multi-step reasoning. This is evident in the improvements seen in the NDCG@3 scores when CoT is applied.

3. **Generalization**: The effectiveness of CoT is consistent across different datasets and benchmarks, indicating its generalizability. The relative improvements over human rewrites and the second-best results further underscore th

2025/11/25 17:07:24 INFO dspy.evaluate.evaluate: Average Metric: 0.005714285714285714 / 1 (0.6%)
2025/11/25 17:07:24 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 0.57 on minibatch of size 1 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 12'].
2025/11/25 17:07:24 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:07:24 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6, 0.52, 0.35, 0.71, 0.52, 0.28, 0.56, 0.45, 0.71, 0.69, 0.57]
2025/11/25 17:07:24 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:07:24 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 21 / 37 - Minibatch ==



  0%|          | 0/1 [00:00<?, ?it/s]### Chain-of-Thought Approach in Cross Encoder Architectures

The Chain-of-Thought (CoT) approach has significantly contributed to measurable improvements in Cross Encoder architectures, particularly in terms of accuracy, reasoning capability, and generalization across benchmarks. Below is a detailed comparison based on the provided context and relevant studies.

#### **Strengths of the Chain-of-Thought Approach**

1. **Improved Accuracy**:
   - **Evidence from Tables**: The incorporation of CoT has shown consistent improvements across various benchmarks. For instance, LLM4CS (RAR + Mean + CoT) significantly outperforms other baselines in terms of MRR, NDCG@3, and R@100 on the CAsT-19, CAsT-20, and CAsT-21 datasets.
   - **Relative Improvements**: The RI-H and RI-2nd-Best metrics indicate that CoT-enhanced models outperform human baselines and other state-of-the-art models by a considerable margin.

2. **Enhanced Reasoning Capability**:
   - **Figu

2025/11/25 17:07:30 INFO dspy.evaluate.evaluate: Average Metric: 0.002044989775051125 / 1 (0.2%)
2025/11/25 17:07:30 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 0.2 on minibatch of size 1 with parameters ['Predictor 0: Instruction 11', 'Predictor 0: Few-Shot Set 0'].
2025/11/25 17:07:30 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:07:30 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6, 0.52, 0.35, 0.71, 0.52, 0.28, 0.56, 0.45, 0.71, 0.69, 0.57, 0.2]
2025/11/25 17:07:30 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:07:30 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 22 / 37 - Minibatch ==



The Chain-of-Thought (CoT) approach has significantly contributed to measurable improvements in Cross Encoder architectures, particularly in terms of accuracy, reasoning capability, and generalization across benchmarks. 

**Accuracy:** The study shows that incorporating CoT generally improves search performance, as evidenced by the improvements in metrics such as MRR, NDCG@3, and R@100 across various datasets (CAsT-19, CAsT-20, CAsT-21).

**Reasoning Capability:** The CoT approach enhances the model's ability to understand and reason about complex queries, leading to better performance in tasks requiring nuanced comprehension, such as query rewriting and relevance ranking.

**Generalization:** The effectiveness of CoT is consistent across different prompting and aggregation methods, indicating its robustness and general applicability in various cross-encoder setups.

**Relevant Studies:** The study demonstrates that tailored CoT significantly outperforms baselines, especially when com

2025/11/25 17:07:30 INFO dspy.evaluate.evaluate: Average Metric: 0.007142857142857143 / 1 (0.7%)
2025/11/25 17:07:30 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 0.71 on minibatch of size 1 with parameters ['Predictor 0: Instruction 6', 'Predictor 0: Few-Shot Set 6'].
2025/11/25 17:07:30 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:07:30 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6, 0.52, 0.35, 0.71, 0.52, 0.28, 0.56, 0.45, 0.71, 0.69, 0.57, 0.2, 0.71]
2025/11/25 17:07:30 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:07:30 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 23 / 37 - Minibatch ==



  0%|          | 0/1 [00:00<?, ?it/s]The Chain-of-Thought (CoT) approach has significantly improved the accuracy, reasoning capability, and generalization of Cross Encoder architectures. Studies show that incorporating CoT generally enhances search performance across various prompting and aggregation methods, as evidenced by improvements in metrics like MRR, NDCG@3, and R@100. For instance, LLM4CS (RAR + Mean + CoT) outperforms other baselines, indicating a notable boost in performance. This approach aids in guiding the model towards better understanding and more accurate responses, particularly in complex reasoning tasks.
Average Metric: 0.01 / 1 (1.3%): 100%|██████████| 1/1 [00:01<00:00,  1.34s/it]

2025/11/25 17:07:31 INFO dspy.evaluate.evaluate: Average Metric: 0.012658227848101266 / 1 (1.3%)
2025/11/25 17:07:31 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 1.27 on minibatch of size 1 with parameters ['Predictor 0: Instruction 13', 'Predictor 0: Few-Shot Set 14'].
2025/11/25 17:07:31 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:07:31 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6, 0.52, 0.35, 0.71, 0.52, 0.28, 0.56, 0.45, 0.71, 0.69, 0.57, 0.2, 0.71, 1.27]
2025/11/25 17:07:31 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:07:31 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 24 / 37 - Minibatch ==



The Chain-of-Thought (CoT) approach has significantly improved the accuracy, reasoning capability, and generalization of Cross Encoder architectures. Studies show that incorporating CoT generally enhances search performance across various prompting and aggregation methods, as evidenced by improvements in metrics like MRR, NDCG@3, and R@100. For instance, LLM4CS (RAR + Mean + CoT) outperforms other baselines, indicating a notable boost in performance. This approach aids in guiding the model towards better understanding and more accurate responses, particularly in complex reasoning tasks.
Average Metric: 0.01 / 1 (1.3%): 100%|██████████| 1/1 [00:00<00:00, 2150.93it/s]

2025/11/25 17:07:31 INFO dspy.evaluate.evaluate: Average Metric: 0.012658227848101266 / 1 (1.3%)
2025/11/25 17:07:31 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 1.27 on minibatch of size 1 with parameters ['Predictor 0: Instruction 13', 'Predictor 0: Few-Shot Set 11'].
2025/11/25 17:07:31 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:07:31 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6, 0.52, 0.35, 0.71, 0.52, 0.28, 0.56, 0.45, 0.71, 0.69, 0.57, 0.2, 0.71, 1.27, 1.27]
2025/11/25 17:07:31 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:07:31 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 25 / 37 - Full Evaluation =====
2025/11/25 17:07:31 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 1.27) from minibatch trials...



The Chain-of-Thought (CoT) approach has significantly improved the accuracy, reasoning capability, and generalization of Cross Encoder architectures. Studies show that incorporating CoT generally enhances search performance across various prompting and aggregation methods, as evidenced by improvements in metrics like MRR, NDCG@3, and R@100. For instance, LLM4CS (RAR + Mean + CoT) outperforms other baselines, indicating a notable boost in performance. This approach aids in guiding the model towards better understanding and more accurate responses, particularly in complex reasoning tasks.
Average Metric: 0.01 / 1 (1.3%): 100%|██████████| 1/1 [00:00<00:00, 2101.35it/s]

2025/11/25 17:07:31 INFO dspy.evaluate.evaluate: Average Metric: 0.012658227848101266 / 1 (1.3%)
2025/11/25 17:07:31 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6, 0.52, 0.35, 0.71, 0.52, 0.28, 0.56, 0.45, 0.71, 0.69, 0.57, 0.2, 0.71, 1.27, 1.27, 1.27]
2025/11/25 17:07:31 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35
2025/11/25 17:07:31 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/11/25 17:07:31 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 26 / 37 - Minibatch ==



  0%|          | 0/1 [00:00<?, ?it/s]The Chain-of-Thought (CoT) approach has significantly improved the accuracy, reasoning capability, and generalization of Cross Encoder architectures. Studies show that incorporating CoT generally enhances search performance across various prompting and aggregation methods, as evidenced by improvements in metrics like MRR, NDCG@3, and R@100. For instance, LLM4CS (RAR + Mean + CoT) outperforms other baselines, indicating a notable boost in performance. This approach aids in guiding the model towards better understanding and more accurate responses, particularly in complex reasoning tasks.
Average Metric: 0.01 / 1 (1.3%): 100%|██████████| 1/1 [00:00<00:00,  1.64it/s]

2025/11/25 17:07:32 INFO dspy.evaluate.evaluate: Average Metric: 0.012658227848101266 / 1 (1.3%)
2025/11/25 17:07:32 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 1.27 on minibatch of size 1 with parameters ['Predictor 0: Instruction 13', 'Predictor 0: Few-Shot Set 14'].
2025/11/25 17:07:32 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:07:32 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6, 0.52, 0.35, 0.71, 0.52, 0.28, 0.56, 0.45, 0.71, 0.69, 0.57, 0.2, 0.71, 1.27, 1.27, 1.27, 1.27]
2025/11/25 17:07:32 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:07:32 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 27 / 37 - Minibatch ==



  0%|          | 0/1 [00:00<?, ?it/s]The Chain-of-Thought (CoT) approach has significantly improved the accuracy, reasoning capability, and generalization of Cross Encoder architectures. Studies show that incorporating CoT generally enhances search performance across various prompting and aggregation methods, as evidenced by improvements in metrics like MRR, NDCG@3, and R@100. For instance, LLM4CS (RAR + Mean + CoT) outperforms other baselines, indicating a notable boost in performance. This approach aids in guiding the model towards better understanding and more accurate responses, particularly in complex reasoning tasks.
Average Metric: 0.01 / 1 (1.3%): 100%|██████████| 1/1 [00:00<00:00,  1.83it/s]

2025/11/25 17:07:33 INFO dspy.evaluate.evaluate: Average Metric: 0.012658227848101266 / 1 (1.3%)
2025/11/25 17:07:33 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 1.27 on minibatch of size 1 with parameters ['Predictor 0: Instruction 13', 'Predictor 0: Few-Shot Set 11'].
2025/11/25 17:07:33 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:07:33 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6, 0.52, 0.35, 0.71, 0.52, 0.28, 0.56, 0.45, 0.71, 0.69, 0.57, 0.2, 0.71, 1.27, 1.27, 1.27, 1.27, 1.27]
2025/11/25 17:07:33 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:07:33 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 28 / 37 - Minibatch ==



  0%|          | 0/1 [00:00<?, ?it/s]The Chain-of-Thought approach has improved accuracy, reasoning capability, and generalization in Cross Encoder architectures, as evidenced by the performance gains in various benchmarks.
Average Metric: 0.04 / 1 (4.3%): 100%|██████████| 1/1 [00:00<00:00,  1.96it/s]

2025/11/25 17:07:33 INFO dspy.evaluate.evaluate: Average Metric: 0.043478260869565216 / 1 (4.3%)
2025/11/25 17:07:33 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 4.35 on minibatch of size 1 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 0'].
2025/11/25 17:07:33 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:07:33 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6, 0.52, 0.35, 0.71, 0.52, 0.28, 0.56, 0.45, 0.71, 0.69, 0.57, 0.2, 0.71, 1.27, 1.27, 1.27, 1.27, 1.27, 4.35]
2025/11/25 17:07:33 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:07:33 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 29 / 37 - Minibatch ==



  0%|          | 0/1 [00:00<?, ?it/s]The Chain-of-Thought (CoT) approach has significantly contributed to measurable improvements in Cross Encoder architectures, particularly in terms of accuracy, reasoning capability, and generalization across benchmarks. 

**Accuracy:** The CoT approach has shown consistent improvements in various metrics such as MRR, NDCG@3, and R@100 across different datasets (CAsT-19, CAsT-20, CAsT-21). For instance, LLM4CS (RAR + Mean + CoT) significantly outperforms other baselines in most metrics, indicating enhanced accuracy.

**Reasoning Capability:** The CoT approach aids in guiding the large language model towards a correct understanding of the task, leading to better performance. This is evident from the ablation results showing that incorporating CoT generally improves search performance.

**Generalization:** The effectiveness of CoT is demonstrated across different datasets and benchmarks, indicating its ability to generalize well. The improvements seen

2025/11/25 17:07:36 INFO dspy.evaluate.evaluate: Average Metric: 0.00558659217877095 / 1 (0.6%)
2025/11/25 17:07:36 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 0.56 on minibatch of size 1 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 14'].
2025/11/25 17:07:36 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:07:36 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6, 0.52, 0.35, 0.71, 0.52, 0.28, 0.56, 0.45, 0.71, 0.69, 0.57, 0.2, 0.71, 1.27, 1.27, 1.27, 1.27, 1.27, 4.35, 0.56]
2025/11/25 17:07:36 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:07:36 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 30 / 37 - Minibatch ==



  0%|          | 0/1 [00:00<?, ?it/s]The Chain-of-Thought (CoT) approach has significantly contributed to measurable improvements in Cross Encoder architectures, particularly in terms of accuracy, reasoning capability, and generalization across benchmarks. 

**Accuracy**: The integration of CoT has led to notable improvements in performance metrics such as Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain (NDCG@3), and Recall@100 (R@100). For instance, the LLM4CS model, which employs CoT, achieved significant improvements in these metrics across various datasets (CAsT-19, CAsT-20, CAsT-21), outperforming other baselines.

**Reasoning Capability**: The CoT approach enhances the model's reasoning capability by guiding the large language model (LLM) to understand and process queries more effectively. This is evident in the performance gains observed when CoT is incorporated into different prompting and aggregation methods, as shown in Table 2 and Figure 2. The CoT method 

2025/11/25 17:07:40 INFO dspy.evaluate.evaluate: Average Metric: 0.0037313432835820895 / 1 (0.4%)
2025/11/25 17:07:40 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 0.37 on minibatch of size 1 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 1'].
2025/11/25 17:07:40 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:07:40 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6, 0.52, 0.35, 0.71, 0.52, 0.28, 0.56, 0.45, 0.71, 0.69, 0.57, 0.2, 0.71, 1.27, 1.27, 1.27, 1.27, 1.27, 4.35, 0.56, 0.37]
2025/11/25 17:07:40 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:07:40 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 31 / 37 - Full Evaluation =====
2025/11/25 17:07:40 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 4.35) from minibatch trials...



The Chain-of-Thought approach has improved accuracy, reasoning capability, and generalization in Cross Encoder architectures, as evidenced by the performance gains in various benchmarks.
Average Metric: 0.04 / 1 (4.3%): 100%|██████████| 1/1 [00:00<00:00, 2445.66it/s]

2025/11/25 17:07:40 INFO dspy.evaluate.evaluate: Average Metric: 0.043478260869565216 / 1 (4.3%)
2025/11/25 17:07:40 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6, 0.52, 0.35, 0.71, 0.52, 0.28, 0.56, 0.45, 0.71, 0.69, 0.57, 0.2, 0.71, 1.27, 1.27, 1.27, 1.27, 1.27, 4.35, 0.56, 0.37, 4.35]
2025/11/25 17:07:40 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35
2025/11/25 17:07:40 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/11/25 17:07:40 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 32 / 37 - Minibatch ==



The Chain-of-Thought approach has improved accuracy, reasoning capability, and generalization in Cross Encoder architectures, as evidenced by the performance gains in various benchmarks.
Average Metric: 0.04 / 1 (4.3%): 100%|██████████| 1/1 [00:00<00:00, 1908.24it/s]

2025/11/25 17:07:40 INFO dspy.evaluate.evaluate: Average Metric: 0.043478260869565216 / 1 (4.3%)
2025/11/25 17:07:40 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 4.35 on minibatch of size 1 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 0'].
2025/11/25 17:07:40 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:07:40 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6, 0.52, 0.35, 0.71, 0.52, 0.28, 0.56, 0.45, 0.71, 0.69, 0.57, 0.2, 0.71, 1.27, 1.27, 1.27, 1.27, 1.27, 4.35, 0.56, 0.37, 4.35, 4.35]
2025/11/25 17:07:40 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:07:40 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 33 / 37 - Minibatch ==



  0%|          | 0/1 [00:00<?, ?it/s]### Chain-of-Thought Approach in Cross Encoder Architectures

The Chain-of-Thought (CoT) approach has significantly contributed to measurable improvements in Cross Encoder architectures, particularly in terms of accuracy, reasoning capability, and generalization across benchmarks. Below is a detailed comparison highlighting the strengths, weaknesses, and ideal use cases of the CoT approach.

#### Strengths

1. **Accuracy**:
   - **Improved Performance**: Studies have shown that incorporating CoT into Cross Encoder architectures leads to improved performance metrics such as Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain (NDCG), and Recall at 100 (R@100). For instance, LLM4CS (RAR + Mean + CoT) significantly outperforms other baselines in multiple datasets (p < 0.05 level).
   - **Benchmark Results**: The table comparisons (Table 2 and Table 3) indicate that models utilizing CoT achieve higher scores in various metrics, demonstrati

2025/11/25 17:07:47 INFO dspy.evaluate.evaluate: Average Metric: 0.0020242914979757085 / 1 (0.2%)
2025/11/25 17:07:47 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 0.2 on minibatch of size 1 with parameters ['Predictor 0: Instruction 19', 'Predictor 0: Few-Shot Set 0'].
2025/11/25 17:07:47 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:07:47 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6, 0.52, 0.35, 0.71, 0.52, 0.28, 0.56, 0.45, 0.71, 0.69, 0.57, 0.2, 0.71, 1.27, 1.27, 1.27, 1.27, 1.27, 4.35, 0.56, 0.37, 4.35, 4.35, 0.2]
2025/11/25 17:07:47 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:07:47 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 34 / 37 - Minibatch ==



The Chain-of-Thought approach has improved accuracy, reasoning capability, and generalization in Cross Encoder architectures, as evidenced by the performance gains in various benchmarks.
Average Metric: 0.04 / 1 (4.3%): 100%|██████████| 1/1 [00:00<00:00, 2486.25it/s]

2025/11/25 17:07:47 INFO dspy.evaluate.evaluate: Average Metric: 0.043478260869565216 / 1 (4.3%)
2025/11/25 17:07:47 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 4.35 on minibatch of size 1 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 0'].
2025/11/25 17:07:47 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:07:47 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6, 0.52, 0.35, 0.71, 0.52, 0.28, 0.56, 0.45, 0.71, 0.69, 0.57, 0.2, 0.71, 1.27, 1.27, 1.27, 1.27, 1.27, 4.35, 0.56, 0.37, 4.35, 4.35, 0.2, 4.35]
2025/11/25 17:07:47 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:07:47 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 35 / 37 - Minibatch ==



  0%|          | 0/1 [00:00<?, ?it/s]### Chain-of-Thought Approach in Cross Encoder Architectures

**Contribution to Measurable Improvements:**

- **Accuracy:** 
  - Studies show that the Chain-of-Thought (CoT) approach significantly enhances the accuracy of Cross Encoder architectures. For instance, the integration of CoT in models like LLM4CS has shown marked improvements in metrics such as MRR, NDCG@3, and R@100 across various benchmarks (CAsT-19, CAsT-20, CAsT-21).

- **Reasoning Capability:** 
  - CoT improves the reasoning capability of models by providing a structured way to break down complex problems into simpler, more manageable steps. This structured approach helps models to better understand and reason through the input data, leading to more accurate and contextually relevant outputs.

- **Generalization Across Benchmarks:**
  - The CoT approach has been shown to enhance generalization across different benchmarks. For example, the relative improvements over human baselines

2025/11/25 17:07:50 INFO dspy.evaluate.evaluate: Average Metric: 0.0030581039755351682 / 1 (0.3%)
2025/11/25 17:07:50 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 0.31 on minibatch of size 1 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 0'].
2025/11/25 17:07:50 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:07:50 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6, 0.52, 0.35, 0.71, 0.52, 0.28, 0.56, 0.45, 0.71, 0.69, 0.57, 0.2, 0.71, 1.27, 1.27, 1.27, 1.27, 1.27, 4.35, 0.56, 0.37, 4.35, 4.35, 0.2, 4.35, 0.31]
2025/11/25 17:07:50 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:07:50 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 36 / 37 - Minibatch ==



The Chain-of-Thought (CoT) approach has significantly contributed to measurable improvements in Cross Encoder architectures, particularly in terms of accuracy, reasoning capability, and generalization across benchmarks. 

1. **Accuracy**: The study shows that incorporating CoT generally improves search performance, as evidenced by the improvements in metrics like MRR, NDCG@3, and R@100 across various datasets (CAsT-19, CAsT-20, CAsT-21). For instance, LLM4CS, which utilizes CoT, significantly outperforms other baselines in most metrics.

2. **Reasoning Capability**: The CoT approach enhances the model's reasoning capability, enabling it to better understand and process complex queries. This is demonstrated by the relative improvements over human rewrites and the second-best results, particularly in the NDCG@3 metric.

3. **Generalization**: The effectiveness of CoT is consistent across different prompting and aggregation methods, as shown in Table 3 and Figure 2. The improvements are 

2025/11/25 17:07:50 INFO dspy.evaluate.evaluate: Average Metric: 0.005208333333333333 / 1 (0.5%)
2025/11/25 17:07:50 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 0.52 on minibatch of size 1 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 7'].
2025/11/25 17:07:50 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/11/25 17:07:50 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6, 0.52, 0.35, 0.71, 0.52, 0.28, 0.56, 0.45, 0.71, 0.69, 0.57, 0.2, 0.71, 1.27, 1.27, 1.27, 1.27, 1.27, 4.35, 0.56, 0.37, 4.35, 4.35, 0.2, 4.35, 0.31, 0.52]
2025/11/25 17:07:50 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35


2025/11/25 17:07:50 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 37 / 37 - Full Evaluation =====
2025/11/25 17:07:50 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 1.27) from minibatch tri


The Chain-of-Thought (CoT) approach has significantly improved the accuracy, reasoning capability, and generalization of Cross Encoder architectures. Studies show that incorporating CoT generally enhances search performance across various prompting and aggregation methods, as evidenced by improvements in metrics like MRR, NDCG@3, and R@100. For instance, LLM4CS (RAR + Mean + CoT) outperforms other baselines, indicating a notable boost in performance. This approach aids in guiding the model towards better understanding and more accurate responses, particularly in complex reasoning tasks.
Average Metric: 0.01 / 1 (1.3%): 100%|██████████| 1/1 [00:00<00:00, 1953.56it/s]

2025/11/25 17:07:50 INFO dspy.evaluate.evaluate: Average Metric: 0.012658227848101266 / 1 (1.3%)
2025/11/25 17:07:50 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [4.35, 0.63, 0.6, 0.66, 0.55, 0.52, 0.66, 0.69, 0.71, 0.6, 0.52, 0.35, 0.71, 0.52, 0.28, 0.56, 0.45, 0.71, 0.69, 0.57, 0.2, 0.71, 1.27, 1.27, 1.27, 1.27, 1.27, 4.35, 0.56, 0.37, 4.35, 4.35, 0.2, 4.35, 0.31, 0.52, 1.27]
2025/11/25 17:07:50 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 4.35
2025/11/25 17:07:50 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/11/25 17:07:50 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 4.35!





# Visualize, Evaluate, and Save Optimized Prompt

In [15]:
optimized_prompt_adapter.show()

2025/11/25 17:07:50 INFO amzn_nova_prompt_optimizer.core.input_adapters.prompt_adapter: 
Standardized Prompt:


{
  "user_prompt": {
    "variables": [
      "context",
      "question"
    ],
    "template": "Assess the context and answer the question succinctly.\nQuestion: [{{question}}]\nContext: [{{context}}]",
    "metadata": {
      "format": "text"
    }
  },
  "system_prompt": {
    "variables": [],
    "template": "Task: Assess the context and answer the question succinctly.\n\nContext:\n\n- The user provides a context and a question.\n\nInstructions:\n\n- Answer the question based on the provided context.\n- Ensure the response is as succinct as possible.\n\nAny other section from Original Prompt:\n\n- There are no other sections in the Original Prompt.\n\nResponse Format:\n\n- The response MUST be succinct and directly address the question.\n- DO NOT include unnecessary details or elaborate explanations.",
    "metadata": {
      "format": "text"
    }
  }
}


In [16]:
print(optimized_prompt_adapter.system_prompt)

Task: Assess the context and answer the question succinctly.

Context:

- The user provides a context and a question.

Instructions:

- Answer the question based on the provided context.
- Ensure the response is as succinct as possible.

Any other section from Original Prompt:

- There are no other sections in the Original Prompt.

Response Format:

- The response MUST be succinct and directly address the question.
- DO NOT include unnecessary details or elaborate explanations.


In [17]:
print(optimized_prompt_adapter.user_prompt)

Assess the context and answer the question succinctly.
Question: [{{question}}]
Context: [{{context}}]


In [18]:
from amzn_nova_prompt_optimizer.core.evaluation import Evaluator

evaluator = Evaluator(optimized_prompt_adapter, test_set, metric_adapter, inference_adapter)

nova_prompt_optimizer_evaluation_score = evaluator.aggregate_score(model_id=NOVA_MODEL_ID)

print(f"Nova Prompt Optimizer = {nova_prompt_optimizer_evaluation_score}")

2025/11/25 17:07:50 INFO amzn_nova_prompt_optimizer.core.evaluation: Cache miss - Running new inference on Dataset
Running inference: 100%|██████████| 3/3 [00:02<00:00,  1.44it/s]
2025/11/25 17:07:53 INFO amzn_nova_prompt_optimizer.core.evaluation: Running Batch Evaluation on Dataset, using `batch_apply` metric
2025/11/25 17:07:53 INFO amzn_nova_prompt_optimizer.core.evaluation: Using cached inference results
2025/11/25 17:07:53 INFO amzn_nova_prompt_optimizer.core.evaluation: Running Evaluation on Dataset, using `apply` metric


LLM-based QE methods, especially Chain-of-Thought prompts, show promise with improved Recall@1K over classical methods. However, they face limitations in computational cost and applicability to dense retrieval systems.
Advanced techniques include browsing data construction, trajectories sampling, supervised fine-tuning, and reinforcement learning. They address challenges by enhancing data quality, supporting long-horizon reasoning, and improving scalability and generalization.
Top-Down Partitioning Reranking (TDPart) is an efficient list-wise reranking approach that finds a pivot at cutoff k within a top-w window, then compares subsequent windows to this pivot, reducing computation via early stopping and no stride, and is particularly effective for precision-focused top-k document retrieval.
LLM-based QE methods, especially Chain-of-Thought prompts, show promise with improved Recall@1K over classical methods. However, they face limitations in computational cost and applicability to den

In [19]:
optimized_prompt_adapter.save("optimized_prompt/")

# Use Optimized RAG System

In [27]:
class RAGSystem():
    def __init__(
        self,
        qa: weaviate.agents.query.query_agent.QueryAgent,
        inference_adapter: BedrockInferenceAdapter,
        system_prompt: str,
        user_prompt: str,
    ):
        self.qa = qa
        self.inference_adapter = inference_adapter
        self.system_prompt = system_prompt  
        self.user_prompt = user_prompt

    def __call__(self, question: str) -> str:
        search_response = qa.search(
            question,
            limit=1
        )
        contexts = []
        for o in search_response.search_results.objects:
            contexts.append(o.properties["content"])

        context_str = "\n".join(contexts)
        # If the prompt uses double braces to escape curly braces (as seen in your example), replace them for .format:
        prompt_to_use = self.user_prompt.replace("{{", "{").replace("}}", "}")
        messages = [
            {"user": prompt_to_use.format(question=question, context=context_str)}
        ]

        response = self.inference_adapter.call_model(
            model_id=NOVA_MODEL_ID,
            system_prompt=self.system_prompt,
            messages=messages,
            # CHANGE HERE: Add "top_k" and keep it under 128
            inf_config={
                "max_tokens": 100, 
                "temperature": 1, 
                "top_p": 0.9, 
                "top_k": 50
            } 
        )
        return response

rag_system = RAGSystem(
    qa=qa,
    inference_adapter=inference_adapter,
    system_prompt=optimized_prompt_adapter.system_prompt,
    user_prompt=optimized_prompt_adapter.user_prompt
)

rag_system("What is Top-Down Partitioning Reranking?")

'Top-Down Partitioning Reranking is a method that initially searches over a top-w window to find a pivot at cutoff k before searching to depth D, comparing each window to the pivot.'