### Setup AWS Credentials

In [None]:
import os
# Setup your AWS Access Key and Secret Key as environment variables.
os.environ["AWS_ACCESS_KEY_ID"]
os.environ["AWS_SECRET_ACCESS_KEY"] 

In [1]:
# Setup Nova Model
NOVA_MODEL_ID = "us.amazon.nova-pro-v1:0"

### Dataset Adapter

Initialize the Dataset Adapter that takes the input_columns and output_columns. We use the JSONDatasetAdapter to read a `.jsonl` file and adapt it to the standardized format. We also use the adapter to create train and test sets for our use case.

In [2]:
from amzn_nova_prompt_optimizer.core.input_adapters.dataset_adapter import JSONDatasetAdapter

input_columns = {"input"}
output_columns = {"answer"}

dataset_adapter = JSONDatasetAdapter(input_columns, output_columns)

# Adapt
dataset_adapter.adapt("../data/FacilitySupportAnalyzer.jsonl")

train_set, test_set = dataset_adapter.split(0.5)

### Prompt Adapter

Initialize the Prompt Adapter for the Original Prompt. For this example, we use the FacilitySupportAnalyzer User Prompt in the `.txt` format. 

In [3]:
from amzn_nova_prompt_optimizer.core.input_adapters.prompt_adapter import TextPromptAdapter

prompt_variables = input_columns

prompt_adapter = TextPromptAdapter()

prompt_adapter.set_user_prompt(file_path="original_prompt/user_prompt_as_template.txt", variables=prompt_variables)

# Adapt
prompt_adapter.adapt()

2025/07/02 20:45:28 INFO amzn_nova_prompt_optimizer.core.input_adapters.prompt_adapter: System Prompt not set, initializing as empty string...


<amzn_nova_prompt_optimizer.core.input_adapters.prompt_adapter.TextPromptAdapter at 0x7fa285314dd0>

### Metric Adapter

Initialize the Metric Adapter for evaluating this prompt for certain optimizers. For this example, we build a Custom Metric for the FacilitySupportAnalyzer Dataset. The metric adapter requires the use of the `apply` [For single row evaluation] or `batch_apply` [For evaluating the whole dataset together] function

In [4]:
from amzn_nova_prompt_optimizer.core.input_adapters.metric_adapter import MetricAdapter
from typing import List, Any, Dict
import re
import json

class FacilitySupportAnalyzerMetric(MetricAdapter):
    def parse_json(self, input_string: str):
        """
        Attempts to parse the given string as JSON. If direct parsing fails,
        it tries to extract a JSON snippet from code blocks formatted as:
            ```json
            ... JSON content ...
            ```
        or any code block delimited by triple backticks and then parses that content.
        """
        try:
            return json.loads(input_string)
        except json.JSONDecodeError as err:
            error = err

        patterns = [
            re.compile(r"```json\s*(.*?)\s*```", re.DOTALL | re.IGNORECASE),
            re.compile(r"```(.*?)```", re.DOTALL)
        ]

        for pattern in patterns:
            match = pattern.search(input_string)
            if match:
                json_candidate = match.group(1).strip()
                try:
                    return json.loads(json_candidate)
                except json.JSONDecodeError:
                    continue

        raise error

    def _calculate_metrics(self, y_pred: Any, y_true: Any) -> Dict:
        strict_json = False
        result = {
            "is_valid_json": False,
            "correct_categories": 0.0,
            "correct_sentiment": False,
            "correct_urgency": False,
        }

        try:
            y_true = y_true if isinstance(y_true, dict) else (json.loads(y_true) if strict_json else self.parse_json(y_true))
            y_pred = y_pred if isinstance(y_pred, dict) else (json.loads(y_pred) if strict_json else self.parse_json(y_pred))
        except json.JSONDecodeError:
            result["total"] = 0
            return result  # Return result with is_valid_json = False
        else:
            if isinstance(y_pred, str):
                result["total"] = 0
                return result  # Return result with is_valid_json = False
            result["is_valid_json"] = True

            categories_true = y_true.get("categories", {})
            categories_pred = y_pred.get("categories", {})

            if isinstance(categories_true, dict) and isinstance(categories_pred, dict):
                correct = sum(
                    categories_true.get(k, False) == categories_pred.get(k, False)
                    for k in categories_true
                )
                result["correct_categories"] = correct / len(categories_true) if categories_true else 0.0
            else:
                result["correct_categories"] = 0.0  # or raise an error if you prefer

            result["correct_sentiment"] = y_pred.get("sentiment", "") == y_true.get("sentiment", "")
            result["correct_urgency"] = y_pred.get("urgency", "") == y_true.get("urgency", "")

        # Compute overall metric score
        result["total"] = sum(
            float(result[k]) for k in ["correct_categories", "correct_sentiment", "correct_urgency"]
        ) / 3.0

        return result

    def apply(self, y_pred: Any, y_true: Any):
        return self._calculate_metrics(y_pred, y_true)

    def batch_apply(self, y_preds: List[Any], y_trues: List[Any]):
        evals = [self.apply(y_pred, y_true) for y_pred, y_true in zip(y_preds, y_trues)]
        float_keys = [k for k, v in evals[0].items() if isinstance(v, (int, float, bool))]
        return {k: sum(e[k] for e in evals) / len(evals) for k in float_keys}

metric_adapter = FacilitySupportAnalyzerMetric()

### Inference Adapter
Initialize the InferenceAdapter to choose the backend Inference. Currently, we only support BedrockInferenceAdapter.

In [5]:
from amzn_nova_prompt_optimizer.core.inference.adapter import BedrockInferenceAdapter

inference_adapter = BedrockInferenceAdapter(region_name="us-east-1")

### Evaluator

The Evaluator can use the metric_adapter, prompt_adapter, and dataset_adapter to evaluate the prompt given the `model_id` to produce an evaluation score. The Evaluator internally uses the `InferenceRunner` to first generate inference results and then evaluate the output.

#### Base Model Evaluation

In [7]:
from amzn_nova_prompt_optimizer.core.evaluation import Evaluator

evaluator = Evaluator(prompt_adapter, test_set, metric_adapter, inference_adapter)

In [8]:
original_prompt_score = evaluator.aggregate_score(model_id=NOVA_MODEL_ID)

print(f"Original Prompt Evaluation Score = {original_prompt_score}")

2025/07/02 04:35:03 INFO amzn_nova_prompt_optimizer.core.evaluation: Cache miss - Running new inference on Dataset
Running inference: 100%|██████████████████████| 100/100 [00:45<00:00,  2.21it/s]
2025/07/02 04:35:48 INFO amzn_nova_prompt_optimizer.core.evaluation: Running Batch Evaluation on Dataset, using `batch_apply` metric
2025/07/02 04:35:48 INFO amzn_nova_prompt_optimizer.core.evaluation: Using cached inference results
2025/07/02 04:35:48 INFO amzn_nova_prompt_optimizer.core.evaluation: Running Evaluation on Dataset, using `apply` metric


Original Prompt Evaluation Score = {'is_valid_json': 0.27, 'correct_categories': 0.251, 'correct_sentiment': 0.13, 'correct_urgency': 0.2, 'total': 0.19366666666666668}


### Optimization Adapter

We can now define the Optimization Functions. The Optimization function takes as input the Prompt Adapter and Optionally a Dataset Adapter, Inference Adapter, and Metric Adapter. The optimization function optimizes the prompt and returns a Prompt Adapter.

In [6]:
class FacilitySupportAnalyzerNovaMetric(FacilitySupportAnalyzerMetric):
    def apply(self, y_pred: Any, y_true: Any):
        # Requires to return a value and not a JSON payload
        return self._calculate_metrics(y_pred, y_true)["total"]
        
    def batch_apply(self, y_preds: List[Any], y_trues: List[Any]):
        pass
nova_metric_adapter = FacilitySupportAnalyzerNovaMetric()

#### NovaPromptOptimizer

NovaPromptOptimizer = Nova Meta Prompter + MIPROv2 with Nova Model Tips

In [7]:
from amzn_nova_prompt_optimizer.core.optimizers import NovaPromptOptimizer

nova_prompt_optimizer = NovaPromptOptimizer(prompt_adapter=prompt_adapter, inference_adapter=inference_adapter, dataset_adapter=train_set, metric_adapter=nova_metric_adapter)

optimized_prompt_adapter = nova_prompt_optimizer.optimize(mode="pro")

2025/07/02 20:45:38 INFO amzn_nova_prompt_optimizer.core.optimizers.nova_meta_prompter.nova_mp_optimizer: Optimizing prompt using Nova Meta Prompter
2025/07/02 20:45:44 INFO amzn_nova_prompt_optimizer.core.optimizers.miprov2.miprov2_optimizer: Using us.amazon.nova-pro-v1:0 for Evaluation
2025/07/02 20:45:44 INFO amzn_nova_prompt_optimizer.core.optimizers.miprov2.miprov2_optimizer: Using us.amazon.nova-premier-v1:0 for Prompting
2025/07/02 20:45:44 INFO amzn_nova_prompt_optimizer.core.optimizers.miprov2.custom_adapters.custom_chat_adapter: Initializing CustomChatAdapter with enable_json_fallback=False
2025/07/02 20:45:44 INFO amzn_nova_prompt_optimizer.core.optimizers.miprov2.miprov2_optimizer: Using Nova tips for MIPROv2 optimization
2025/07/02 20:45:44 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/07/02 20:45:44 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instruc

Bootstrapping set 1/20
Bootstrapping set 2/20
Bootstrapping set 3/20


  8%|███████▋                                                                                        | 4/50 [00:06<01:09,  1.51s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 4/20


  8%|███████▋                                                                                        | 4/50 [00:05<01:04,  1.41s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 5/20


  6%|█████▊                                                                                          | 3/50 [00:04<01:06,  1.41s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 6/20


  4%|███▊                                                                                            | 2/50 [00:03<01:18,  1.63s/it]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 7/20


  2%|█▉                                                                                              | 1/50 [00:01<01:05,  1.34s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 8/20


  6%|█████▊                                                                                          | 3/50 [00:04<01:04,  1.37s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 9/20


  8%|███████▋                                                                                        | 4/50 [00:05<01:02,  1.37s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 10/20


  6%|█████▊                                                                                          | 3/50 [00:03<01:02,  1.33s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 11/20


  2%|█▉                                                                                              | 1/50 [00:01<01:10,  1.45s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 12/20


  6%|█████▊                                                                                          | 3/50 [00:04<01:08,  1.45s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 13/20


  6%|█████▊                                                                                          | 3/50 [00:04<01:05,  1.39s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 14/20


  4%|███▊                                                                                            | 2/50 [00:02<01:04,  1.34s/it]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 15/20


  4%|███▊                                                                                            | 2/50 [00:02<01:02,  1.30s/it]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 16/20


  8%|███████▋                                                                                        | 4/50 [00:05<01:06,  1.45s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 17/20


  8%|███████▋                                                                                        | 4/50 [00:05<00:59,  1.30s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 18/20


  6%|█████▊                                                                                          | 3/50 [00:03<00:59,  1.27s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 19/20


  2%|█▉                                                                                              | 1/50 [00:01<01:07,  1.37s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 20/20


  2%|█▉                                                                                              | 1/50 [00:01<01:14,  1.51s/it]
2025/07/02 20:46:51 INFO amzn_nova_prompt_optimizer.core.optimizers.miprov2.miprov2_optimizer: Entering patched_propose_instructions, patching GroundedProposer with NovaGroundedProposer
2025/07/02 20:46:51 INFO amzn_nova_prompt_optimizer.core.optimizers.miprov2.miprov2_optimizer: Patched GroundedProposer, current GroundedProposer class=<class 'amzn_nova_prompt_optimizer.core.optimizers.nova_prompt_optimizer.nova_grounded_proposer.NovaGroundedProposer'>
2025/07/02 20:46:51 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/07/02 20:46:51 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.
2025/07/02 20:46:51 INFO amzn_nova_prompt_optimizer.core.op

Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.


2025/07/02 20:47:10 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing N=20 instructions...



[Nova] Selected tip: persona
[Nova] Selected tip: persona
[Nova] Selected tip: multi_turn
[Nova] Selected tip: multi_turn
[Nova] Selected tip: multi_turn
[Nova] Selected tip: multi_turn
[Nova] Selected tip: simple
[Nova] Selected tip: high_stakes
[Nova] Selected tip: high_stakes
[Nova] Selected tip: format_control
[Nova] Selected tip: creative
[Nova] Selected tip: rules_based
[Nova] Selected tip: none
[Nova] Selected tip: simple
[Nova] Selected tip: multi_turn
[Nova] Selected tip: simple
[Nova] Selected tip: high_stakes
[Nova] Selected tip: structured_prompt
[Nova] Selected tip: simple
[Nova] Selected tip: multi_turn


2025/07/02 20:49:38 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/07/02 20:49:38 INFO dspy.teleprompt.mipro_optimizer_v2: 0: **Task:**
Extract and return a JSON with specified keys and values based on the input.

**Context:**
- The JSON must include "urgency", "sentiment", and "categories".
- "urgency" can be `high`, `medium`, or `low`.
- "sentiment" can be `negative`, `neutral`, or `positive`.
- "categories" is a dictionary with boolean values indicating if each category matches the input.

**Instructions:**
- MUST include all specified keys: "urgency", "sentiment", and "categories".
- "categories" MUST include all listed support category tags with boolean values.
- The JSON string MUST be valid and readable directly.
- DO NOT enclose the JSON in ```json...```.
- DO NOT include newlines or unnecessary whitespaces.

**Response Format:**
- The response MUST be a single-line JSON string.
- MUST adhere to the specified format and include all require

2025/07/02 20:49:38 INFO dspy.teleprompt.mipro_optimizer_v2: 13: Analyze the input email and classify it into the appropriate categories. Determine the sentiment and urgency levels. Return a JSON object with "categories" (boolean for each category), "sentiment" (positive, neutral, negative), and "urgency" (high, medium, low). Ensure all keys are present and the JSON is valid.

2025/07/02 20:49:38 INFO dspy.teleprompt.mipro_optimizer_v2: 14: Analyze the provided customer service inquiry and generate a JSON object containing "urgency" (high/medium/low), "sentiment" (positive/neutral/negative), and "categories" (boolean flags for each predefined category). Consider the email's tone, keywords, and context to accurately determine classifications. If ambiguity exists, default to neutral sentiment and medium urgency unless explicit urgency indicators are present. Ensure all category keys are included with appropriate boolean values.

2025/07/02 20:49:38 INFO dspy.teleprompt.mipro_optimizer_v2

Average Metric: 35.17 / 50 (70.3%): 100%|███████████████████████████████████████████████████████████| 50/50 [00:36<00:00,  1.37it/s]

2025/07/02 20:50:15 INFO dspy.evaluate.evaluate: Average Metric: 35.166666666666664 / 50 (70.3%)
2025/07/02 20:50:15 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 70.33

2025/07/02 20:50:15 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 37 - Minibatch ==



Average Metric: 25.50 / 35 (72.9%): 100%|███████████████████████████████████████████████████████████| 35/35 [00:22<00:00,  1.55it/s]

2025/07/02 20:50:37 INFO dspy.evaluate.evaluate: Average Metric: 25.5 / 35 (72.9%)
2025/07/02 20:50:37 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 72.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 12', 'Predictor 0: Few-Shot Set 6'].
2025/07/02 20:50:37 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86]
2025/07/02 20:50:37 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33]
2025/07/02 20:50:37 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 70.33


2025/07/02 20:50:37 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 37 - Minibatch ==



Average Metric: 25.17 / 35 (71.9%): 100%|███████████████████████████████████████████████████████████| 35/35 [00:22<00:00,  1.54it/s]

2025/07/02 20:51:00 INFO dspy.evaluate.evaluate: Average Metric: 25.166666666666664 / 35 (71.9%)
2025/07/02 20:51:00 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 71.9 on minibatch of size 35 with parameters ['Predictor 0: Instruction 8', 'Predictor 0: Few-Shot Set 4'].
2025/07/02 20:51:00 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9]
2025/07/02 20:51:00 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33]
2025/07/02 20:51:00 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 70.33


2025/07/02 20:51:00 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 37 - Minibatch ==



Average Metric: 27.13 / 35 (77.5%): 100%|███████████████████████████████████████████████████████████| 35/35 [00:22<00:00,  1.57it/s]

2025/07/02 20:51:22 INFO dspy.evaluate.evaluate: Average Metric: 27.133333333333333 / 35 (77.5%)
2025/07/02 20:51:22 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 77.52 on minibatch of size 35 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 13'].
2025/07/02 20:51:22 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52]
2025/07/02 20:51:22 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33]
2025/07/02 20:51:22 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 70.33


2025/07/02 20:51:22 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 37 - Minibatch ==



Average Metric: 25.40 / 35 (72.6%): 100%|███████████████████████████████████████████████████████████| 35/35 [00:23<00:00,  1.49it/s]

2025/07/02 20:51:46 INFO dspy.evaluate.evaluate: Average Metric: 25.4 / 35 (72.6%)
2025/07/02 20:51:46 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 72.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 9', 'Predictor 0: Few-Shot Set 7'].
2025/07/02 20:51:46 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52, 72.57]
2025/07/02 20:51:46 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33]
2025/07/02 20:51:46 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 70.33


2025/07/02 20:51:46 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 6 / 37 - Minibatch ==



Average Metric: 25.07 / 35 (71.6%): 100%|███████████████████████████████████████████████████████████| 35/35 [00:23<00:00,  1.46it/s]

2025/07/02 20:52:10 INFO dspy.evaluate.evaluate: Average Metric: 25.066666666666666 / 35 (71.6%)
2025/07/02 20:52:10 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 71.62 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 9'].
2025/07/02 20:52:10 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52, 72.57, 71.62]
2025/07/02 20:52:10 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33]
2025/07/02 20:52:10 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 70.33


2025/07/02 20:52:10 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 37 - Full Evaluation =====
2025/07/02 20:52:10 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 77.52) from minibatch trials...



Average Metric: 38.53 / 50 (77.1%): 100%|███████████████████████████████████████████████████████████| 50/50 [00:11<00:00,  4.45it/s]

2025/07/02 20:52:21 INFO dspy.evaluate.evaluate: Average Metric: 38.53333333333333 / 50 (77.1%)
2025/07/02 20:52:21 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 77.07
2025/07/02 20:52:21 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07]
2025/07/02 20:52:21 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 77.07
2025/07/02 20:52:21 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/07/02 20:52:21 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 8 / 37 - Minibatch ==



Average Metric: 25.93 / 35 (74.1%): 100%|███████████████████████████████████████████████████████████| 35/35 [00:24<00:00,  1.41it/s]

2025/07/02 20:52:46 INFO dspy.evaluate.evaluate: Average Metric: 25.933333333333334 / 35 (74.1%)
2025/07/02 20:52:46 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 74.1 on minibatch of size 35 with parameters ['Predictor 0: Instruction 10', 'Predictor 0: Few-Shot Set 15'].
2025/07/02 20:52:46 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52, 72.57, 71.62, 74.1]
2025/07/02 20:52:46 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07]
2025/07/02 20:52:46 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 77.07


2025/07/02 20:52:46 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 9 / 37 - Minibatch ==



Average Metric: 26.70 / 35 (76.3%): 100%|███████████████████████████████████████████████████████████| 35/35 [00:30<00:00,  1.14it/s]

2025/07/02 20:53:17 INFO dspy.evaluate.evaluate: Average Metric: 26.7 / 35 (76.3%)
2025/07/02 20:53:17 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 76.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 6', 'Predictor 0: Few-Shot Set 17'].
2025/07/02 20:53:17 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52, 72.57, 71.62, 74.1, 76.29]
2025/07/02 20:53:17 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07]
2025/07/02 20:53:17 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 77.07


2025/07/02 20:53:17 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 10 / 37 - Minibatch ==



Average Metric: 24.30 / 35 (69.4%): 100%|███████████████████████████████████████████████████████████| 35/35 [00:23<00:00,  1.52it/s]

2025/07/02 20:53:40 INFO dspy.evaluate.evaluate: Average Metric: 24.3 / 35 (69.4%)
2025/07/02 20:53:40 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 69.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 18', 'Predictor 0: Few-Shot Set 9'].
2025/07/02 20:53:40 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52, 72.57, 71.62, 74.1, 76.29, 69.43]
2025/07/02 20:53:40 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07]
2025/07/02 20:53:40 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 77.07


2025/07/02 20:53:40 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 11 / 37 - Minibatch ==



Average Metric: 25.20 / 35 (72.0%): 100%|███████████████████████████████████████████████████████████| 35/35 [00:23<00:00,  1.46it/s]

2025/07/02 20:54:04 INFO dspy.evaluate.evaluate: Average Metric: 25.2 / 35 (72.0%)
2025/07/02 20:54:04 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 72.0 on minibatch of size 35 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 13'].
2025/07/02 20:54:04 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52, 72.57, 71.62, 74.1, 76.29, 69.43, 72.0]
2025/07/02 20:54:04 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07]
2025/07/02 20:54:04 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 77.07


2025/07/02 20:54:04 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 12 / 37 - Minibatch ==



Average Metric: 26.53 / 35 (75.8%): 100%|███████████████████████████████████████████████████████████| 35/35 [00:23<00:00,  1.46it/s]

2025/07/02 20:54:28 INFO dspy.evaluate.evaluate: Average Metric: 26.53333333333333 / 35 (75.8%)
2025/07/02 20:54:28 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 75.81 on minibatch of size 35 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 5'].
2025/07/02 20:54:28 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52, 72.57, 71.62, 74.1, 76.29, 69.43, 72.0, 75.81]
2025/07/02 20:54:28 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07]
2025/07/02 20:54:28 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 77.07


2025/07/02 20:54:28 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 13 / 37 - Full Evaluation =====
2025/07/02 20:54:28 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 76.29) from minibatch trials...



Average Metric: 40.03 / 50 (80.1%): 100%|███████████████████████████████████████████████████████████| 50/50 [00:11<00:00,  4.33it/s]

2025/07/02 20:54:39 INFO dspy.evaluate.evaluate: Average Metric: 40.03333333333333 / 50 (80.1%)
2025/07/02 20:54:39 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 80.07
2025/07/02 20:54:39 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07, 80.07]
2025/07/02 20:54:39 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 80.07
2025/07/02 20:54:39 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/07/02 20:54:39 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 14 / 37 - Minibatch ==



Average Metric: 27.03 / 35 (77.2%): 100%|███████████████████████████████████████████████████████████| 35/35 [00:23<00:00,  1.50it/s]

2025/07/02 20:55:03 INFO dspy.evaluate.evaluate: Average Metric: 27.03333333333333 / 35 (77.2%)
2025/07/02 20:55:03 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 77.24 on minibatch of size 35 with parameters ['Predictor 0: Instruction 13', 'Predictor 0: Few-Shot Set 13'].
2025/07/02 20:55:03 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52, 72.57, 71.62, 74.1, 76.29, 69.43, 72.0, 75.81, 77.24]
2025/07/02 20:55:03 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07, 80.07]
2025/07/02 20:55:03 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 80.07


2025/07/02 20:55:03 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 15 / 37 - Minibatch ==



Average Metric: 26.90 / 35 (76.9%): 100%|███████████████████████████████████████████████████████████| 35/35 [00:23<00:00,  1.50it/s]

2025/07/02 20:55:26 INFO dspy.evaluate.evaluate: Average Metric: 26.9 / 35 (76.9%)
2025/07/02 20:55:26 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 76.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 6', 'Predictor 0: Few-Shot Set 1'].
2025/07/02 20:55:26 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52, 72.57, 71.62, 74.1, 76.29, 69.43, 72.0, 75.81, 77.24, 76.86]
2025/07/02 20:55:26 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07, 80.07]
2025/07/02 20:55:26 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 80.07


2025/07/02 20:55:26 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 16 / 37 - Minibatch ==



Average Metric: 26.70 / 35 (76.3%): 100%|███████████████████████████████████████████████████████████| 35/35 [00:28<00:00,  1.22it/s]

2025/07/02 20:55:55 INFO dspy.evaluate.evaluate: Average Metric: 26.7 / 35 (76.3%)
2025/07/02 20:55:55 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 76.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 6'].
2025/07/02 20:55:55 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52, 72.57, 71.62, 74.1, 76.29, 69.43, 72.0, 75.81, 77.24, 76.86, 76.29]
2025/07/02 20:55:55 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07, 80.07]
2025/07/02 20:55:55 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 80.07


2025/07/02 20:55:55 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 17 / 37 - Minibatch ==



Average Metric: 27.73 / 35 (79.2%): 100%|█████████████████████████████████████████████████████████| 35/35 [00:00<00:00, 1377.01it/s]

2025/07/02 20:55:55 INFO dspy.evaluate.evaluate: Average Metric: 27.733333333333334 / 35 (79.2%)
2025/07/02 20:55:55 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 79.24 on minibatch of size 35 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 13'].
2025/07/02 20:55:55 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52, 72.57, 71.62, 74.1, 76.29, 69.43, 72.0, 75.81, 77.24, 76.86, 76.29, 79.24]
2025/07/02 20:55:55 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07, 80.07]
2025/07/02 20:55:55 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 80.07


2025/07/02 20:55:55 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 18 / 37 - Minibatch ==



Average Metric: 25.57 / 35 (73.0%): 100%|███████████████████████████████████████████████████████████| 35/35 [00:25<00:00,  1.39it/s]

2025/07/02 20:56:20 INFO dspy.evaluate.evaluate: Average Metric: 25.566666666666666 / 35 (73.0%)
2025/07/02 20:56:20 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 73.05 on minibatch of size 35 with parameters ['Predictor 0: Instruction 6', 'Predictor 0: Few-Shot Set 6'].
2025/07/02 20:56:20 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52, 72.57, 71.62, 74.1, 76.29, 69.43, 72.0, 75.81, 77.24, 76.86, 76.29, 79.24, 73.05]
2025/07/02 20:56:20 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07, 80.07]
2025/07/02 20:56:20 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 80.07


2025/07/02 20:56:20 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 19 / 37 - Full Evaluation =====
2025/07/02 20:56:20 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 77.24) from minibatch trials...



Average Metric: 38.57 / 50 (77.1%): 100%|███████████████████████████████████████████████████████████| 50/50 [00:10<00:00,  4.97it/s]

2025/07/02 20:56:30 INFO dspy.evaluate.evaluate: Average Metric: 38.56666666666666 / 50 (77.1%)
2025/07/02 20:56:30 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07, 80.07, 77.13]
2025/07/02 20:56:30 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 80.07
2025/07/02 20:56:30 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/07/02 20:56:30 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 20 / 37 - Minibatch ==



Average Metric: 27.60 / 35 (78.9%): 100%|███████████████████████████████████████████████████████████| 35/35 [00:23<00:00,  1.47it/s]

2025/07/02 20:56:54 INFO dspy.evaluate.evaluate: Average Metric: 27.599999999999998 / 35 (78.9%)
2025/07/02 20:56:54 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 78.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 17', 'Predictor 0: Few-Shot Set 1'].
2025/07/02 20:56:54 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52, 72.57, 71.62, 74.1, 76.29, 69.43, 72.0, 75.81, 77.24, 76.86, 76.29, 79.24, 73.05, 78.86]
2025/07/02 20:56:54 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07, 80.07, 77.13]
2025/07/02 20:56:54 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 80.07


2025/07/02 20:56:54 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 21 / 37 - Minibatch ==



Average Metric: 25.57 / 35 (73.0%): 100%|███████████████████████████████████████████████████████████| 35/35 [00:23<00:00,  1.49it/s]

2025/07/02 20:57:17 INFO dspy.evaluate.evaluate: Average Metric: 25.566666666666666 / 35 (73.0%)
2025/07/02 20:57:17 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 73.05 on minibatch of size 35 with parameters ['Predictor 0: Instruction 16', 'Predictor 0: Few-Shot Set 18'].
2025/07/02 20:57:17 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52, 72.57, 71.62, 74.1, 76.29, 69.43, 72.0, 75.81, 77.24, 76.86, 76.29, 79.24, 73.05, 78.86, 73.05]
2025/07/02 20:57:17 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07, 80.07, 77.13]
2025/07/02 20:57:17 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 80.07


2025/07/02 20:57:17 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 22 / 37 - Minibatch ==



Average Metric: 27.70 / 35 (79.1%): 100%|███████████████████████████████████████████████████████████| 35/35 [00:22<00:00,  1.59it/s]

2025/07/02 20:57:39 INFO dspy.evaluate.evaluate: Average Metric: 27.7 / 35 (79.1%)
2025/07/02 20:57:39 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 79.14 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 1'].
2025/07/02 20:57:39 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52, 72.57, 71.62, 74.1, 76.29, 69.43, 72.0, 75.81, 77.24, 76.86, 76.29, 79.24, 73.05, 78.86, 73.05, 79.14]
2025/07/02 20:57:39 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07, 80.07, 77.13]
2025/07/02 20:57:39 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 80.07


2025/07/02 20:57:39 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 23 / 37 - Minibatch ==



Average Metric: 25.20 / 35 (72.0%): 100%|███████████████████████████████████████████████████████████| 35/35 [00:21<00:00,  1.61it/s]

2025/07/02 20:58:01 INFO dspy.evaluate.evaluate: Average Metric: 25.2 / 35 (72.0%)
2025/07/02 20:58:01 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 72.0 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 4'].
2025/07/02 20:58:01 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52, 72.57, 71.62, 74.1, 76.29, 69.43, 72.0, 75.81, 77.24, 76.86, 76.29, 79.24, 73.05, 78.86, 73.05, 79.14, 72.0]
2025/07/02 20:58:01 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07, 80.07, 77.13]
2025/07/02 20:58:01 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 80.07


2025/07/02 20:58:01 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 24 / 37 - Minibatch ==



Average Metric: 25.50 / 35 (72.9%): 100%|███████████████████████████████████████████████████████████| 35/35 [00:24<00:00,  1.45it/s]

2025/07/02 20:58:25 INFO dspy.evaluate.evaluate: Average Metric: 25.5 / 35 (72.9%)
2025/07/02 20:58:25 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 72.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 15', 'Predictor 0: Few-Shot Set 16'].
2025/07/02 20:58:25 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52, 72.57, 71.62, 74.1, 76.29, 69.43, 72.0, 75.81, 77.24, 76.86, 76.29, 79.24, 73.05, 78.86, 73.05, 79.14, 72.0, 72.86]
2025/07/02 20:58:25 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07, 80.07, 77.13]
2025/07/02 20:58:25 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 80.07


2025/07/02 20:58:25 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 25 / 37 - Full Evaluation =====
2025/07/02 20:58:25 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 79.14) from minibatch trials...



Average Metric: 39.20 / 50 (78.4%): 100%|███████████████████████████████████████████████████████████| 50/50 [00:09<00:00,  5.15it/s]

2025/07/02 20:58:35 INFO dspy.evaluate.evaluate: Average Metric: 39.199999999999996 / 50 (78.4%)
2025/07/02 20:58:35 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07, 80.07, 77.13, 78.4]
2025/07/02 20:58:35 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 80.07
2025/07/02 20:58:35 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/07/02 20:58:35 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 26 / 37 - Minibatch ==



Average Metric: 22.73 / 35 (65.0%): 100%|███████████████████████████████████████████████████████████| 35/35 [00:28<00:00,  1.24it/s]

2025/07/02 20:59:03 INFO dspy.evaluate.evaluate: Average Metric: 22.733333333333334 / 35 (65.0%)
2025/07/02 20:59:03 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 64.95 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 12'].
2025/07/02 20:59:03 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52, 72.57, 71.62, 74.1, 76.29, 69.43, 72.0, 75.81, 77.24, 76.86, 76.29, 79.24, 73.05, 78.86, 73.05, 79.14, 72.0, 72.86, 64.95]
2025/07/02 20:59:03 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07, 80.07, 77.13, 78.4]
2025/07/02 20:59:03 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 80.07


2025/07/02 20:59:03 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 27 / 37 - Minibatch ==



Average Metric: 26.30 / 35 (75.1%): 100%|███████████████████████████████████████████████████████████| 35/35 [00:22<00:00,  1.54it/s]

2025/07/02 20:59:26 INFO dspy.evaluate.evaluate: Average Metric: 26.3 / 35 (75.1%)
2025/07/02 20:59:26 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 75.14 on minibatch of size 35 with parameters ['Predictor 0: Instruction 19', 'Predictor 0: Few-Shot Set 10'].
2025/07/02 20:59:26 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52, 72.57, 71.62, 74.1, 76.29, 69.43, 72.0, 75.81, 77.24, 76.86, 76.29, 79.24, 73.05, 78.86, 73.05, 79.14, 72.0, 72.86, 64.95, 75.14]
2025/07/02 20:59:26 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07, 80.07, 77.13, 78.4]
2025/07/02 20:59:26 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 80.07


2025/07/02 20:59:26 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 28 / 37 - Minibatch ==



Average Metric: 26.43 / 35 (75.5%): 100%|█████████████████████████████████████████████████████████| 35/35 [00:00<00:00, 1232.02it/s]

2025/07/02 20:59:26 INFO dspy.evaluate.evaluate: Average Metric: 26.433333333333334 / 35 (75.5%)
2025/07/02 20:59:26 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 75.52 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 1'].
2025/07/02 20:59:26 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52, 72.57, 71.62, 74.1, 76.29, 69.43, 72.0, 75.81, 77.24, 76.86, 76.29, 79.24, 73.05, 78.86, 73.05, 79.14, 72.0, 72.86, 64.95, 75.14, 75.52]
2025/07/02 20:59:26 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07, 80.07, 77.13, 78.4]
2025/07/02 20:59:26 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 80.07


2025/07/02 20:59:26 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 29 / 37 - Minibatch ==



Average Metric: 24.60 / 35 (70.3%): 100%|███████████████████████████████████████████████████████████| 35/35 [00:22<00:00,  1.56it/s]

2025/07/02 20:59:49 INFO dspy.evaluate.evaluate: Average Metric: 24.599999999999998 / 35 (70.3%)
2025/07/02 20:59:49 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 70.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 15', 'Predictor 0: Few-Shot Set 11'].
2025/07/02 20:59:49 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52, 72.57, 71.62, 74.1, 76.29, 69.43, 72.0, 75.81, 77.24, 76.86, 76.29, 79.24, 73.05, 78.86, 73.05, 79.14, 72.0, 72.86, 64.95, 75.14, 75.52, 70.29]
2025/07/02 20:59:49 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07, 80.07, 77.13, 78.4]
2025/07/02 20:59:49 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 80.07


2025/07/02 20:59:49 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 30 / 37 - Minibatch ==



Average Metric: 24.13 / 35 (69.0%): 100%|███████████████████████████████████████████████████████████| 35/35 [00:23<00:00,  1.50it/s]

2025/07/02 21:00:12 INFO dspy.evaluate.evaluate: Average Metric: 24.133333333333333 / 35 (69.0%)
2025/07/02 21:00:12 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 68.95 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 11'].
2025/07/02 21:00:12 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52, 72.57, 71.62, 74.1, 76.29, 69.43, 72.0, 75.81, 77.24, 76.86, 76.29, 79.24, 73.05, 78.86, 73.05, 79.14, 72.0, 72.86, 64.95, 75.14, 75.52, 70.29, 68.95]
2025/07/02 21:00:12 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07, 80.07, 77.13, 78.4]
2025/07/02 21:00:12 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 80.07


2025/07/02 21:00:12 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 31 / 37 - Full Evaluation =====
2025/07/02 21:00:12 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 78.86) from minibatch trials.


Average Metric: 39.30 / 50 (78.6%): 100%|███████████████████████████████████████████████████████████| 50/50 [00:10<00:00,  4.63it/s]

2025/07/02 21:00:23 INFO dspy.evaluate.evaluate: Average Metric: 39.3 / 50 (78.6%)
2025/07/02 21:00:23 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07, 80.07, 77.13, 78.4, 78.6]
2025/07/02 21:00:23 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 80.07
2025/07/02 21:00:23 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/07/02 21:00:23 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 32 / 37 - Minibatch ==



Average Metric: 26.27 / 35 (75.0%): 100%|███████████████████████████████████████████████████████████| 35/35 [00:23<00:00,  1.49it/s]

2025/07/02 21:00:46 INFO dspy.evaluate.evaluate: Average Metric: 26.266666666666666 / 35 (75.0%)
2025/07/02 21:00:46 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 75.05 on minibatch of size 35 with parameters ['Predictor 0: Instruction 17', 'Predictor 0: Few-Shot Set 8'].
2025/07/02 21:00:46 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52, 72.57, 71.62, 74.1, 76.29, 69.43, 72.0, 75.81, 77.24, 76.86, 76.29, 79.24, 73.05, 78.86, 73.05, 79.14, 72.0, 72.86, 64.95, 75.14, 75.52, 70.29, 68.95, 75.05]
2025/07/02 21:00:46 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07, 80.07, 77.13, 78.4, 78.6]
2025/07/02 21:00:46 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 80.07


2025/07/02 21:00:46 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 33 / 37 - Minibatch ==



Average Metric: 26.20 / 35 (74.9%): 100%|███████████████████████████████████████████████████████████| 35/35 [00:22<00:00,  1.54it/s]

2025/07/02 21:01:09 INFO dspy.evaluate.evaluate: Average Metric: 26.2 / 35 (74.9%)
2025/07/02 21:01:09 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 74.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 14', 'Predictor 0: Few-Shot Set 1'].
2025/07/02 21:01:09 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52, 72.57, 71.62, 74.1, 76.29, 69.43, 72.0, 75.81, 77.24, 76.86, 76.29, 79.24, 73.05, 78.86, 73.05, 79.14, 72.0, 72.86, 64.95, 75.14, 75.52, 70.29, 68.95, 75.05, 74.86]
2025/07/02 21:01:09 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07, 80.07, 77.13, 78.4, 78.6]
2025/07/02 21:01:09 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 80.07


2025/07/02 21:01:09 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 34 / 37 - Minibatch ==



Average Metric: 25.87 / 35 (73.9%): 100%|███████████████████████████████████████████████████████████| 35/35 [00:25<00:00,  1.39it/s]

2025/07/02 21:01:34 INFO dspy.evaluate.evaluate: Average Metric: 25.866666666666667 / 35 (73.9%)
2025/07/02 21:01:34 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 73.9 on minibatch of size 35 with parameters ['Predictor 0: Instruction 11', 'Predictor 0: Few-Shot Set 1'].
2025/07/02 21:01:34 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52, 72.57, 71.62, 74.1, 76.29, 69.43, 72.0, 75.81, 77.24, 76.86, 76.29, 79.24, 73.05, 78.86, 73.05, 79.14, 72.0, 72.86, 64.95, 75.14, 75.52, 70.29, 68.95, 75.05, 74.86, 73.9]
2025/07/02 21:01:34 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07, 80.07, 77.13, 78.4, 78.6]
2025/07/02 21:01:34 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 80.07


2025/07/02 21:01:34 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 35 / 37 - Minibatch ==



Average Metric: 27.07 / 35 (77.3%): 100%|█████████████████████████████████████████████████████████| 35/35 [00:00<00:00, 1187.20it/s]

2025/07/02 21:01:34 INFO dspy.evaluate.evaluate: Average Metric: 27.066666666666666 / 35 (77.3%)
2025/07/02 21:01:34 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 77.33 on minibatch of size 35 with parameters ['Predictor 0: Instruction 17', 'Predictor 0: Few-Shot Set 1'].
2025/07/02 21:01:34 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52, 72.57, 71.62, 74.1, 76.29, 69.43, 72.0, 75.81, 77.24, 76.86, 76.29, 79.24, 73.05, 78.86, 73.05, 79.14, 72.0, 72.86, 64.95, 75.14, 75.52, 70.29, 68.95, 75.05, 74.86, 73.9, 77.33]
2025/07/02 21:01:34 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07, 80.07, 77.13, 78.4, 78.6]
2025/07/02 21:01:34 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 80.07


2025/07/02 21:01:34 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 36 / 37 - Minibatch ==



Average Metric: 28.23 / 35 (80.7%): 100%|███████████████████████████████████████████████████████████| 35/35 [00:22<00:00,  1.53it/s]

2025/07/02 21:01:57 INFO dspy.evaluate.evaluate: Average Metric: 28.233333333333334 / 35 (80.7%)
2025/07/02 21:01:57 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 80.67 on minibatch of size 35 with parameters ['Predictor 0: Instruction 12', 'Predictor 0: Few-Shot Set 14'].
2025/07/02 21:01:57 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [72.86, 71.9, 77.52, 72.57, 71.62, 74.1, 76.29, 69.43, 72.0, 75.81, 77.24, 76.86, 76.29, 79.24, 73.05, 78.86, 73.05, 79.14, 72.0, 72.86, 64.95, 75.14, 75.52, 70.29, 68.95, 75.05, 74.86, 73.9, 77.33, 80.67]
2025/07/02 21:01:57 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07, 80.07, 77.13, 78.4, 78.6]
2025/07/02 21:01:57 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 80.07


2025/07/02 21:01:57 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 37 / 37 - Full Evaluation =====
2025/07/02 21:01:57 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program 


Average Metric: 38.43 / 50 (76.9%): 100%|███████████████████████████████████████████████████████████| 50/50 [00:09<00:00,  5.19it/s]

2025/07/02 21:02:07 INFO dspy.evaluate.evaluate: Average Metric: 38.43333333333333 / 50 (76.9%)
2025/07/02 21:02:07 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [70.33, 77.07, 80.07, 77.13, 78.4, 78.6, 76.87]
2025/07/02 21:02:07 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 80.07
2025/07/02 21:02:07 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/07/02 21:02:07 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 80.07!





In [8]:
optimized_prompt_adapter.show()

2025/07/02 21:02:07 INFO amzn_nova_prompt_optimizer.core.input_adapters.prompt_adapter: 
Standardized Prompt:


{
  "user_prompt": {
    "variables": [
      "input"
    ],
    "template": "Extract and return a json with the following keys and values from the input provided: [{{ input }}]\n- \"urgency\" as one of `high`, `medium`, `low`\n- \"sentiment\" as one of `negative`, `neutral`, `positive`\n- \"categories\" as a dictionary with categories as keys and boolean values indicating if the category matches the input. Categories are: `emergency_repair_services`, `routine_maintenance_requests`, `quality_and_safety_concerns`, `specialized_cleaning_services`, `general_inquiries`, `sustainability_and_environmental_practices`, `training_and_support_requests`, `cleaning_services_scheduling`, `customer_feedback_and_complaints`, `facility_management_issues`",
    "metadata": {
      "format": "text"
    }
  },
  "system_prompt": {
    "variables": [],
    "template": "Analyze the input text and classify it into a JSON object with \"urgency\", \"sentiment\", and \"categories\" keys. Ensure all categories 

### Optimized System Prompt

In [9]:
print(optimized_prompt_adapter.system_prompt)

Analyze the input text and classify it into a JSON object with "urgency", "sentiment", and "categories" keys. Ensure all categories are covered with boolean values, and the output is a valid JSON string without extra formatting.


### Optimized User Prompt

In [10]:
print(optimized_prompt_adapter.user_prompt)

Extract and return a json with the following keys and values from the input provided: [{{ input }}]
- "urgency" as one of `high`, `medium`, `low`
- "sentiment" as one of `negative`, `neutral`, `positive`
- "categories" as a dictionary with categories as keys and boolean values indicating if the category matches the input. Categories are: `emergency_repair_services`, `routine_maintenance_requests`, `quality_and_safety_concerns`, `specialized_cleaning_services`, `general_inquiries`, `sustainability_and_environmental_practices`, `training_and_support_requests`, `cleaning_services_scheduling`, `customer_feedback_and_complaints`, `facility_management_issues`


### Few Shot Examples

In [11]:
print(f"Number of Few-Shot Examples = {len(optimized_prompt_adapter.few_shot_examples)}")

Number of Few-Shot Examples = 4


In [12]:
# Print only the first example
print(optimized_prompt_adapter.few_shot_examples[0])

{'input': 'Extract and return a json with the following keys and values from the input provided: [Subject: Request for Post-Renovation Cleaning Services\n\nDear ProCare Support Team,\n\nI hope this message finds you well. My name is [Sender], and I have been utilizing ProCare Facility Solutions for the maintenance and management of my commercial property for the past few years. Your services have always been reliable and efficient, which is why I am reaching out to you today.\n\nI am writing to request assistance with a specialized cleaning service for our office building. We have recently undergone some renovations, and there is a significant amount of dust and debris that needs to be addressed. Additionally, the carpets and windows require a thorough cleaning to restore them to their original condition.\n\nI have already scheduled a routine cleaning, but I believe this situation requires more specialized attention. Given the nature of the work needed, I would appreciate it if you cou

### Evaluator

Now we evaluate the Nova Prompt Optimizer Optimized prompt

In [13]:
from amzn_nova_prompt_optimizer.core.evaluation import Evaluator

evaluator = Evaluator(optimized_prompt_adapter, test_set, metric_adapter, inference_adapter)

In [14]:
evaluation_score = evaluator.aggregate_score(model_id=NOVA_MODEL_ID)
print(f"Nova Prompt Optimization = {evaluation_score}")

2025/07/02 21:02:07 INFO amzn_nova_prompt_optimizer.core.evaluation: Cache miss - Running new inference on Dataset
Running inference: 100%|██████████████████████████████████████████████████████████████████████████| 100/100 [00:44<00:00,  2.27it/s]
2025/07/02 21:02:51 INFO amzn_nova_prompt_optimizer.core.evaluation: Running Batch Evaluation on Dataset, using `batch_apply` metric
2025/07/02 21:02:51 INFO amzn_nova_prompt_optimizer.core.evaluation: Using cached inference results
2025/07/02 21:02:51 INFO amzn_nova_prompt_optimizer.core.evaluation: Running Evaluation on Dataset, using `apply` metric


Nova Prompt Optimization = {'is_valid_json': 1.0, 'correct_categories': 0.9309999999999999, 'correct_sentiment': 0.66, 'correct_urgency': 0.84, 'total': 0.8103333333333333}


In [15]:
optimized_prompt_adapter.save("optimized_prompt/")