# Automated Prompt Engineering using MIPRO

This notebook provides an automated approach to optimizing prompt engineering using the [Multi-prompt Instruction PRoposal Optimizer (MIPRO)](https://arxiv.org/abs/2406.11695v1).
It is designed for TensorZero users who want to optimize their system prompts based on collected inference and feedback data. As such, we currently only support prompt optimization for applications with a single system prompt.

Support for applications with multiple system prompts is in the pipeline. If this use case interests you, please see our our [LLM Gym Example](https://github.com/tensorzero/llmgym/tree/main/examples/mipro) for a full implementation.

By following this guide, you can systematically refine your prompts to improve model performance in specific tasks.


## Overview

The optimization process involves the following steps:

1. **Generate candidate instructions and demonstrations**
    - Candidate instructions are generated using OpenAI's o1 model based on a system template and an optional schema.
        - This is configurable in the `config/tensorzero.toml` file if you want to use a different model.
    - Candidate demonstrations are sets of few-shot examples sampled from the training dataset.
2. **Evaluate Instruction-Demonstration Pairs**
    - Sample an instruction and demonstration pair and score it using a Large Language Model (LLM) judge.
    - The judge (a TensorZero function utilizing OpenAI's GPT-4o-mini model) scores the quality of the instruction-demonstration pair.
    - Scores are aggregated over the evaluation set to produce a final evaluation score.
3. **Optimization via Search Algorithms**
    - Utilize a random search or a Tree-structured Parzen Estimator (TPE) to determine the next instruction and demonstration pair for evaluation.
4. **Iterate the Optimization Process**
    - Repeat the optimization process for a fixed number of iterations.
5. **Select the Best Performing Prompts**
    - The instruction and demonstration pairs corresponding to the highest-performing prompts are formatted to yield optimized system templates.



## Step 1: Define Function Configuration Parameters

Specify the TensorZero function you want to optimize. The example below optimizes the system prompt for Named Entity Recognition (NER):

- **Function Configuration Directory:** Location of the function’s configuration files.

- **Function Name:** The TensorZero function being optimized.

- **Model Variant:** The specific function variant to use as an example for the system template.

In [1]:
# Configuation arguments for the function you want to optimize the prompt for
CONFIG_DIR = "../../examples/data-extraction-ner/config"

# The name of the function you want to optimize the prompt for
FUNCTION_NAME = "extract_entities"

# The name of the variant to use
TEMPLATE_VARIANT_NAME = "gpt_4o_mini"

## Step 2: Configure the LLM Judge for Metric Optimization

The LLM judge guides the optimization process by evaluating prompt effectiveness. You must define:

- **Task Description:** A summary of the task being optimized.
- **Optimization Metric:** The metric used for evaluating prompt effectiveness (e.g. Jaccard similarity between predicted and ground truth entities).

In [2]:
# Description of the task you are optimizing the prompt for to be used by the optimizer judge
TASK_DESCRIPTION = "The task is to extract named entities from the input text."

# Metric definition for scoring generated prompts
METRIC_PROPERTIES = "The metric is the Jaccard similarity between the predicted and ground truth entities."

## Step 3: Define Optimization Parameters

The following parameters control the optimization process. Experimenting with different values can help refine results:

- **Search Space**
    - `NUM_CANDIDATE_INSTRUCTIONS`: Number of candidate instructions to generate.
    - `NUM_CANDIDATE_DEMONSTRATIONS`: Number of candidate demonstrations to sample.
- **Optimization Control**
    - `MAX_ITERATIONS`: Number of optimization steps.
    - `MAX_EXAMPLES_PER_DEMONSTRATION`: Maximum few-shot examples per demonstration.
- **Evaluation Control**
    - `EVAL_FRACTION`: Fraction of the dataset used for scoring generated prompts.
    - `MAX_SAMPLES`: Limit on the number of demonstration samples.
- **Reproducibility**
    - `SEED`: Random seed for consistent results.

In [3]:
# Number of candidate instructions to generate and search over
NUM_CANDIDATE_INSTRUCTIONS = 10

# Number of candidate demonstrations to sample and search over
NUM_CANDIDATE_DEMONSTRATIONS = 10

# Maximum number of demonstrations in each candidate demonstration set
MAX_EXAMPLES_PER_DEMONSTRATION = 10

# Maximum number of search steps taken by the optimization algorithm for evaluating instruction-demonstration pairs
MAX_ITERATIONS = 5

# Set optimization direction ('maximize' or 'minimize') based on the metric properties you described above.
OPTIMIZER_DIRECTION = "maximize"

# Fraction of the dataset used by the judge to score the quality of the generated prompt
EVAL_FRACTION = 0.2

# Limit on the number of samples for demonstration selection
MAX_SAMPLES = 100_000

# Random seed for reproducibility
SEED = 0


## Import Dependencies

In [4]:
import asyncio
import json
import os
from copy import deepcopy
from typing import Any, Dict, List, Optional

import numpy as np
import optuna
import pandas as pd
from clickhouse_connect import get_client
from minijinja import Environment
from optuna.samplers import TPESampler
from tensorzero import (
    AsyncTensorZeroGateway,
    InferenceResponse,
    JsonInferenceResponse,
    RawText,
    Text,
)
from tqdm.asyncio import tqdm_asyncio
from utils.client_calls import candidate_inference, get_instructions, judge_answer
from utils.configs.reader import load_config

  from .autonotebook import tqdm as notebook_tqdm
* 'fields' has been removed


## Initialize the MIPRO TensorZero Client

This client is used to generate candidate instructions and score the quality of responses given the candidate instructions and demonstrations.

In [5]:
MAX_CONCURRENT_REQUESTS = 50

In [6]:
mipro_client = await AsyncTensorZeroGateway.build_embedded(
    config_file="config/tensorzero.toml",
    clickhouse_url=os.environ["TENSORZERO_CLICKHOUSE_URL"],
)
semaphore = asyncio.Semaphore(MAX_CONCURRENT_REQUESTS)

## Load Data

Load the TensorZero configuration for the function you want to optimize the prompt for.

In [7]:
base_config = load_config(CONFIG_DIR)

Retrieve the configuration for the variant with the templates we'll use for prompt optimization.

In [8]:
assert FUNCTION_NAME in base_config.functions.keys(), (
    f"No function named `{FUNCTION_NAME}` found in config"
)
assert TEMPLATE_VARIANT_NAME in base_config.functions[FUNCTION_NAME].variants.keys(), (
    f"No variant named `{TEMPLATE_VARIANT_NAME}` found in function `{FUNCTION_NAME}`"
)

base_function = base_config.functions[FUNCTION_NAME]
base_variant = deepcopy(base_function.variants[TEMPLATE_VARIANT_NAME])

Initialize the ClickHouse client.

In [9]:
assert "TENSORZERO_CLICKHOUSE_URL" in os.environ, (
    "TENSORZERO_CLICKHOUSE_URL environment variable not set"
)

clickhouse_client = get_client(dsn=os.environ["TENSORZERO_CLICKHOUSE_URL"])

Determine the inference table name based on the function type.

In [10]:
inference_table_name = {"chat": "ChatInference", "json": "JsonInference"}.get(
    base_function.type
)

if inference_table_name is None:
    raise ValueError(f"Unsupported function type: {base_function.type}")

Query the inferences and demonstration feedback from ClickHouse.

You can use one of the metrics above, or choose `FILTER_METRIC_NAME = "demonstration"` to use ground truth demonstrations.

In [11]:
print(base_config.metrics.keys())

dict_keys(['exact_match', 'jaccard_similarity', 'valid_output'])


In [12]:
FILTER_METRIC_NAME = "demonstration"
FILTER_METRIC_THRESHOLD = 0.9

if (
    FILTER_METRIC_NAME != "demonstration"
):  # If no metric name is provided, use ground truth demonstrations
    filter_metric = base_config.metrics[FILTER_METRIC_NAME]

In [13]:
if (
    FILTER_METRIC_NAME == "demonstration"
):  # Assume demonstration feedback is available and used.
    query = f"""
    SELECT 
        i.input, 
        i.output, 
        f.value,
        i.episode_id
    FROM 
        {inference_table_name} i
    JOIN 
        (SELECT
            inference_id,
            value,
            ROW_NUMBER() OVER (PARTITION BY inference_id ORDER BY timestamp DESC) as rn
        FROM 
            DemonstrationFeedback
        ) f ON i.id = f.inference_id AND f.rn = 1
    WHERE 
        i.function_name = %(function_name)s
    LIMIT %(max_samples)s
    """

    params = {
        "function_name": FUNCTION_NAME,
        "max_samples": MAX_SAMPLES,
    }
else:
    feedback_table_name = {
        "float": "FloatMetricFeedback",
        "boolean": "BooleanMetricFeedback",
    }.get(filter_metric.type)

    inference_join_key = {
        "episode": "episode_id",
        "inference": "id",
    }.get(filter_metric.level)

    if inference_join_key is None:
        raise ValueError(f"Unsupported metric level: {filter_metric.level}")

    threshold = FILTER_METRIC_THRESHOLD if filter_metric.type == "float" else 0.5
    comparison_operator = ">=" if filter_metric.optimize == "maximize" else "<="

    query = f"""
    SELECT 
        i.input, 
        i.output, 
        i.episode_id,
        i.function_name,
        f.value
    FROM 
        {inference_table_name} i
    JOIN 
        (SELECT
            target_id,
            value,
            ROW_NUMBER() OVER (PARTITION BY target_id ORDER BY timestamp DESC) as rn
        FROM 
            {feedback_table_name}
        WHERE
            metric_name = %(metric_name)s
            AND value {comparison_operator} %(threshold)s
        ) f ON i.{inference_join_key} = f.target_id and f.rn = 1
    WHERE 
        i.function_name = %(function_name)s
    LIMIT %(max_samples)s
    """

    params = {
        "function_name": FUNCTION_NAME,
        "max_samples": MAX_SAMPLES,
        "metric_name": FILTER_METRIC_NAME,
        "threshold": FILTER_METRIC_THRESHOLD,
    }

df = clickhouse_client.query_df(query, params)

if FILTER_METRIC_NAME != "demonstration":
    df.value = df.output

df.head()

Unnamed: 0,input,output,value,episode_id
0,"{""messages"":[{""role"":""user"",""content"":[{""type""...","{""raw"":""{\""person\"":[\""Andrea Collinelli\"",\""F...","{""raw"":""{\""person\"":[\""Andrea Collinelli\"",\""F...",01956208-4532-7941-89f6-fc9305183931
1,"{""messages"":[{""role"":""user"",""content"":[{""type""...","{""raw"":""{\""person\"":[],\""organization\"":[],\""l...","{""raw"":""{\""person\"":[\""SAO PAULO\""],\""organiza...",01956207-741a-7a62-8bd2-78e0d66609a2
2,"{""messages"":[{""role"":""user"",""content"":[{""type""...","{""raw"":""{\""person\"":[],\""organization\"":[],\""l...","{""raw"":""{\""person\"":[],\""organization\"":[],\""l...",01956207-bdeb-7292-94c0-7b33d02a2f7c
3,"{""messages"":[{""role"":""user"",""content"":[{""type""...","{""raw"":""{\""person\"":[\""Dirk Dier\"",\""Chuck Ada...","{""raw"":""{\""person\"":[\""Dirk Dier\"",\""Chuck Ada...",01956208-7540-7341-96c8-db05abce8fcd
4,"{""messages"":[{""role"":""user"",""content"":[{""type""...","{""raw"":""{\""person\"":[\""Federico Colonna\""],\""o...","{""raw"":""{\""person\"":[\""Federico Colonna\""],\""o...",01956208-7283-7592-975e-c092b4528585


Retrieve the system, user, and assistant templates in the variant (if any), and initialize a minijinja environment with them.


In [14]:
templates = {}

if base_variant.assistant_template is not None:
    templates["assistant"] = base_variant.assistant_template

if base_variant.system_template is not None:
    templates["system"] = base_variant.system_template

if base_variant.user_template is not None:
    templates["user"] = base_variant.user_template

env = Environment(templates=templates)

Render the messages in the input and demonstration columns.

In [15]:
def render_message(content: List[Dict[str, Any]], role: str) -> str:
    assert role in ["user", "assistant"], f"Invalid role: {role}"

    if len(content) != 1:
        raise ValueError(f"Message must have exactly one content block: {content}")

    if role == "user":
        output = "ENVIRONMENT:\n"
    else:
        output = "AGENT:\n"

    if content[0]["type"] == "text":
        value = content[0]["value"]
        if isinstance(value, str):
            output += value
        else:
            value = env.render_template(role, **value)  # type: ignore
            assert isinstance(value, str)
            output += value
    elif content[0]["type"] == "tool_call":
        del content[0]["id"]
        del content[0]["type"]
        output += f"Tool call: {json.dumps(content[0])}"
    elif content[0]["type"] == "tool_result":
        output += f"Tool result: {content[0]['result']}"
    else:
        raise ValueError(
            f"Content block must be of type text, tool_call, or tool_result: {content}"
        )

    return output


def format_input(sample):
    function_input = json.loads(sample["input"])
    rendered_message = ""
    for message in function_input["messages"]:
        rendered_message += render_message(message["content"], message["role"])
        rendered_message += "\n"
    return rendered_message


def format_output(sample):
    output = json.loads(sample["value"])
    if base_function.type == "chat":
        if len(output) != 1:
            raise ValueError(f"Output {output} must have exactly one content block.")
        if output[0]["type"] == "text":
            return output[0]["text"]
        elif output[0]["type"] == "tool_call":
            del output[0]["raw_arguments"]
            del output[0]["raw_name"]
            del output[0]["type"]
            return f"Tool call: {json.dumps(output[0])}"
        elif output[0]["type"] == "tool_result":
            return json.dumps(output[0])
        else:
            raise ValueError(f"Output {output} must be a text block.")
    elif base_function.type == "json":
        return output["raw"]
    else:
        raise ValueError(f"Unsupported function type: {base_function.type}")


def format_system_args(sample):
    function_input = json.loads(sample["input"])
    if "system" in function_input:
        return function_input["system"]
    else:
        return ""


df["input_str"] = df.apply(format_input, axis=1)
df["value_str"] = df.apply(format_output, axis=1)
df["system_args"] = df.apply(format_system_args, axis=1)
df.head()

Unnamed: 0,input,output,value,episode_id,input_str,value_str,system_args
0,"{""messages"":[{""role"":""user"",""content"":[{""type""...","{""raw"":""{\""person\"":[\""Andrea Collinelli\"",\""F...","{""raw"":""{\""person\"":[\""Andrea Collinelli\"",\""F...",01956208-4532-7941-89f6-fc9305183931,ENVIRONMENT:\nAndrea Collinelli ( Italy ) 4:16...,"{""person"":[""Andrea Collinelli"",""Francis Moreau...",
1,"{""messages"":[{""role"":""user"",""content"":[{""type""...","{""raw"":""{\""person\"":[],\""organization\"":[],\""l...","{""raw"":""{\""person\"":[\""SAO PAULO\""],\""organiza...",01956207-741a-7a62-8bd2-78e0d66609a2,ENVIRONMENT:\nSAO PAULO 1996-08-27\n,"{""person"":[""SAO PAULO""],""organization"":[],""loc...",
2,"{""messages"":[{""role"":""user"",""content"":[{""type""...","{""raw"":""{\""person\"":[],\""organization\"":[],\""l...","{""raw"":""{\""person\"":[],\""organization\"":[],\""l...",01956207-bdeb-7292-94c0-7b33d02a2f7c,ENVIRONMENT:\nKey stock and currency market mo...,"{""person"":[],""organization"":[],""location"":[],""...",
3,"{""messages"":[{""role"":""user"",""content"":[{""type""...","{""raw"":""{\""person\"":[\""Dirk Dier\"",\""Chuck Ada...","{""raw"":""{\""person\"":[\""Dirk Dier\"",\""Chuck Ada...",01956208-7540-7341-96c8-db05abce8fcd,ENVIRONMENT:\nDirk Dier ( Germany ) beat Chuck...,"{""person"":[""Dirk Dier"",""Chuck Adams""],""organiz...",
4,"{""messages"":[{""role"":""user"",""content"":[{""type""...","{""raw"":""{\""person\"":[\""Federico Colonna\""],\""o...","{""raw"":""{\""person\"":[\""Federico Colonna\""],\""o...",01956208-7283-7592-975e-c092b4528585,ENVIRONMENT:\n1. Federico Colonna ( Italy ) Ma...,"{""person"":[""Federico Colonna""],""organization"":...",


Split the data into training and evaluation sets.
The training set is used to generate candidate demonstrations.
The evaluation set is used by the judge to score the quality of the generated prompt.

In [16]:
# Get unique episode_ids
unique_episode_ids = df["episode_id"].unique()

# Shuffle the unique episode_ids
np.random.seed(42)
np.random.shuffle(unique_episode_ids)

# Calculate the split index for episode_ids
split_index = int(len(unique_episode_ids) * (1 - EVAL_FRACTION))

# Split the episode_ids into training and validation sets
train_episode_ids = unique_episode_ids[:split_index]
val_episode_ids = unique_episode_ids[split_index:]

# Create training and validation DataFrames based on episode_ids
train_df = df[df["episode_id"].isin(train_episode_ids)]
eval_df = df[df["episode_id"].isin(val_episode_ids)]

print(f"Training set size: {len(train_df)}")
print(f"Evaluation set size: {len(eval_df)}")
print(f"Actual evaluation fraction: {len(eval_df) / len(df):.2f}")

Training set size: 400
Evaluation set size: 100
Actual evaluation fraction: 0.20


## Generate Candidate Instructions

Given the function's system template as an example, generate a set of candidate instructions to optimize the prompt over.

In [17]:
example_instructions = base_variant.system_template

if base_function.system_schema is not None:
    example_schema = base_function.system_schema.model_json_schema()
else:
    example_schema = None

responses = await tqdm_asyncio.gather(
    *[
        get_instructions(
            client=mipro_client,
            example_instructions=example_instructions,
            example_schema=example_schema,
            semaphore=semaphore,
        )
        for _ in range(NUM_CANDIDATE_INSTRUCTIONS)
    ]
)

candidate_instructions = [example_instructions]
for response in responses:
    if response is None:
        continue
    candidate_instructions.append(response.output.parsed["instructions"])

100%|██████████| 10/10 [01:10<00:00,  7.02s/it]


## Generate Candidate Demonstrations

Given the training set, generate a set of candidate demonstrations to optimize the prompt over.

In [18]:
def generate_demonstrations(
    df: pd.DataFrame,
    max_examples_per_demonstration: int,
    input_col: str,
    output_col: str,
    system_col: str,
    seed: int = 42,
) -> str:
    sample = df.sample(
        n=max_examples_per_demonstration, replace=False, random_state=seed
    )
    demonstrations = ""
    demonstration_number = 1
    for _, row in sample.iterrows():  # type: ignore
        demonstrations += f"DEMONSTRATION {demonstration_number}:\n"
        demonstration_number += 1
        if row[system_col] is not None and row[system_col] != "":
            demonstrations += f"SYSTEM:\n{row[system_col]}\n"
        demonstrations += f"{row[input_col]}AGENT:\n{row[output_col]}\n\n"
    return demonstrations

In [19]:
candidate_demonstrations = [
    generate_demonstrations(
        df=train_df,
        max_examples_per_demonstration=MAX_EXAMPLES_PER_DEMONSTRATION,
        input_col="input_str",
        output_col="value_str",
        system_col="system_args",
        seed=seed,
    )
    for seed in range(NUM_CANDIDATE_DEMONSTRATIONS)
]

## Optimize the Prompt

### Define the optimization objective

In [20]:
# Initialize online statistics
num_instructions = len(candidate_instructions)
num_demonstrations = len(candidate_demonstrations)

In [21]:
def format_system_template(instructions: str, demonstrations: str) -> str:
    return f"# Instructions:\n\n{instructions}\n\n# Demonstrations:\n\n{demonstrations}"


def format_response(response: Optional[InferenceResponse]) -> str:
    if response is None:
        return ""
    if isinstance(response, JsonInferenceResponse):
        return str(response.output.parsed)
    else:
        content = response.content
        assert len(content) == 1  # TODO: Handle multiple content blocks
        if isinstance(content[0], Text):
            return content[0].text
        elif isinstance(content[0], RawText):
            return content[0].value
        else:
            raise ValueError(f"Unsupported content type: {type(content[0])}")


async def objective(trial: optuna.Trial):
    # Sample an instruction and a demonstration set
    instruction_index = trial.suggest_categorical(
        "instruction_index", range(num_instructions)
    )
    demonstration_index = trial.suggest_categorical(
        "demonstration_index", range(num_demonstrations)
    )
    # Format the candidate prompt
    candidate_prompt = format_system_template(
        candidate_instructions[instruction_index],
        candidate_demonstrations[demonstration_index],
    )
    # Create a new variant with the candidate prompt
    candidate_variant_name = f"{instruction_index}_{demonstration_index}"
    candidate_config = deepcopy(base_config)
    candidate_config.functions[FUNCTION_NAME].variants[candidate_variant_name] = (
        deepcopy(base_variant)
    )
    candidate_config.functions[FUNCTION_NAME].variants[
        candidate_variant_name
    ].system_template = candidate_prompt
    candidate_config.functions[FUNCTION_NAME].variants[
        candidate_variant_name
    ].name = candidate_variant_name
    # Write the new config to a temporary directory
    tmp_config_dir = candidate_config.write()
    # Build a new client with the new config
    target_client = await AsyncTensorZeroGateway.build_embedded(
        config_file=str(tmp_config_dir / "tensorzero.toml"),
        clickhouse_url=os.environ["TENSORZERO_CLICKHOUSE_URL"],
    )
    # Asynchronously generate answers for each query in the evaluation set
    responses = await tqdm_asyncio.gather(
        *[
            candidate_inference(
                client=target_client,
                function_name=FUNCTION_NAME,
                input=json.loads(input_args),
                variant_name=candidate_variant_name,
                semaphore=semaphore,
            )
            for input_args in eval_df["input"]
        ]
    )

    # Score the responses using the judge
    judge_responses = await tqdm_asyncio.gather(
        *[
            judge_answer(
                client=mipro_client,
                task_description=TASK_DESCRIPTION,
                metric_properties=METRIC_PROPERTIES,
                prediction=format_response(response) if response is not None else "",
                ground_truth=str(ground_truth),
                semaphore=semaphore,
            )
            for response, ground_truth in zip(responses, eval_df["value_str"])
        ]
    )

    # Aggregate the scores
    scores = []
    for response in judge_responses:
        if response is not None:
            if response.output.parsed is not None:
                scores.append(response.output.parsed["score"])

    # Return the mean score
    return np.mean(scores)

### Random Search

We start by sampling a random instruction and demonstration at each iteration in the optimization loop.

In [22]:
study_random = optuna.create_study(
    sampler=optuna.samplers.RandomSampler(seed=SEED), direction=OPTIMIZER_DIRECTION
)

for iteration in range(MAX_ITERATIONS):
    trial = study_random.ask()

    value = await objective(trial)
    print(f"Iteration {iteration + 1}: {value}")

    frozen_trial = study_random.tell(trial, value)
    study_random._log_completed_trial(frozen_trial)

[I 2025-04-07 14:43:51,425] A new study created in memory with name: no-name-c359a151-09ef-4de8-a03e-51a1496b13de
  0%|          | 0/100 [00:00<?, ?it/s]



  3%|▎         | 3/100 [00:00<00:15,  6.28it/s]



 22%|██▏       | 22/100 [00:00<00:01, 48.01it/s]



 43%|████▎     | 43/100 [00:01<00:00, 68.61it/s]



 61%|██████    | 61/100 [00:01<00:00, 59.27it/s]



100%|██████████| 100/100 [00:02<00:00, 39.97it/s]
100%|██████████| 100/100 [00:16<00:00,  6.24it/s]
[I 2025-04-07 14:44:10,252] Trial 0 finished with value: 0.5984036796536796 and parameters: {'instruction_index': 8, 'demonstration_index': 9}. Best is trial 0 with value: 0.5984036796536796.


Iteration 1: 0.5984036796536796


  0%|          | 0/100 [00:00<?, ?it/s]



  4%|▍         | 4/100 [00:00<00:14,  6.77it/s]



 26%|██▌       | 26/100 [00:00<00:01, 43.07it/s]



 36%|███▌      | 36/100 [00:01<00:01, 54.85it/s]



 51%|█████     | 51/100 [00:01<00:00, 51.76it/s]



100%|██████████| 100/100 [00:07<00:00, 12.56it/s]
100%|██████████| 100/100 [00:15<00:00,  6.60it/s]
[I 2025-04-07 14:44:33,683] Trial 1 finished with value: 0.6602142857142858 and parameters: {'instruction_index': 6, 'demonstration_index': 6}. Best is trial 1 with value: 0.6602142857142858.


Iteration 2: 0.6602142857142858


  0%|          | 0/100 [00:00<?, ?it/s]



  9%|▉         | 9/100 [00:00<00:05, 16.97it/s]



 27%|██▋       | 27/100 [00:00<00:01, 44.66it/s]



 37%|███▋      | 37/100 [00:00<00:01, 56.21it/s]



 45%|████▌     | 45/100 [00:01<00:01, 49.11it/s]



100%|██████████| 100/100 [00:07<00:00, 13.00it/s]


[2m2025-04-07T18:44:41.687149Z[0m [31mERROR[0m [2mtensorzero_internal::error[0m[2m:[0m JSON Schema validation failed for Function:

Additional properties are not allowed ('rbi' was unexpected)
Data: {"person":["Bonilla"],"organization":[],"location":[],"miscellaneous":[],"rbi":21,"runs":15,"games":20}Schema: {"$schema":"http://json-schema.org/draft-07/schema#","type":"object","properties":{"person":{"type":"array","items":{"type":"string"}},"organization":{"type":"array","items":{"type":"string"}},"location":{"type":"array","items":{"type":"string"}},"miscellaneous":{"type":"array","items":{"type":"string"}}},"required":["person","organization","location","miscellaneous"],"additionalProperties":false}


100%|██████████| 100/100 [00:14<00:00,  7.12it/s]
[I 2025-04-07 14:44:55,747] Trial 2 finished with value: 0.6045831168831168 and parameters: {'instruction_index': 10, 'demonstration_index': 9}. Best is trial 1 with value: 0.6602142857142858.


Iteration 3: 0.6045831168831168


  0%|          | 0/100 [00:00<?, ?it/s]



 15%|█▌        | 15/100 [00:00<00:02, 32.22it/s]



 34%|███▍      | 34/100 [00:00<00:01, 57.05it/s]



 53%|█████▎    | 53/100 [00:01<00:00, 63.37it/s]



100%|██████████| 100/100 [00:02<00:00, 38.83it/s]
100%|██████████| 100/100 [00:24<00:00,  4.11it/s]
[I 2025-04-07 14:45:22,962] Trial 3 finished with value: 0.6324126984126984 and parameters: {'instruction_index': 9, 'demonstration_index': 0}. Best is trial 1 with value: 0.6602142857142858.


Iteration 4: 0.6324126984126984


  0%|          | 0/100 [00:00<?, ?it/s]



 15%|█▌        | 15/100 [00:00<00:02, 28.51it/s]



 21%|██        | 21/100 [00:00<00:02, 31.48it/s]



 34%|███▍      | 34/100 [00:01<00:01, 39.51it/s]



 41%|████      | 41/100 [00:01<00:01, 45.61it/s]



100%|██████████| 100/100 [00:07<00:00, 13.04it/s]
100%|██████████| 100/100 [00:13<00:00,  7.35it/s]
[I 2025-04-07 14:45:44,608] Trial 4 finished with value: 0.5763681318681318 and parameters: {'instruction_index': 5, 'demonstration_index': 8}. Best is trial 1 with value: 0.6602142857142858.


Iteration 5: 0.5763681318681318


### Tree-structured Parzen Estimator
Following the MIPRO paper, we use a tree-structured parzen estimator (TPE) to sample the next instruction and demonstration pair to evaluate.

In [23]:
study_tpe = optuna.create_study(
    sampler=TPESampler(seed=SEED), direction=OPTIMIZER_DIRECTION
)

for iteration in range(MAX_ITERATIONS):
    trial = study_tpe.ask()

    value = await objective(trial)
    print(f"Iteration {iteration + 1}: {value}")

    frozen_trial = study_tpe.tell(trial, value)
    study_tpe._log_completed_trial(frozen_trial)

[I 2025-04-07 14:45:44,616] A new study created in memory with name: no-name-8f7b6c2a-710b-44b5-b374-c1afe385969e
  0%|          | 0/100 [00:00<?, ?it/s]



 10%|█         | 10/100 [00:00<00:04, 21.25it/s]



 35%|███▌      | 35/100 [00:00<00:00, 66.19it/s]

[2m2025-04-07T18:45:45.732495Z[0m [31mERROR[0m [2mtensorzero_internal::error[0m[2m:[0m JSON Schema validation failed for Function:

Additional properties are not allowed ('RBI' was unexpected)
Data: {"person":["Bonilla"],"organization":[],"location":[],"miscellaneous":[],"RBI":21,"runs":15,"games":20}Schema: {"$schema":"http://json-schema.org/draft-07/schema#","type":"object","properties":{"person":{"type":"array","items":{"type":"string"}},"organization":{"type":"array","items":{"type":"string"}},"location":{"type":"array","items":{"type":"string"}},"miscellaneous":{"type":"array","items":{"type":"string"}}},"required":["person","organization","location","miscellaneous"],"additionalProperties":false}


 45%|████▌     | 45/100 [00:00<00:00, 71.07it/s]



100%|██████████| 100/100 [00:02<00:00, 43.47it/s]
100%|██████████| 100/100 [00:13<00:00,  7.23it/s]
[I 2025-04-07 14:46:01,058] Trial 0 finished with value: 0.6106904761904761 and parameters: {'instruction_index': 8, 'demonstration_index': 9}. Best is trial 0 with value: 0.6106904761904761.


Iteration 1: 0.6106904761904761


  0%|          | 0/100 [00:00<?, ?it/s]



  7%|▋         | 7/100 [00:00<00:06, 14.02it/s]



 20%|██        | 20/100 [00:00<00:02, 36.53it/s]



 33%|███▎      | 33/100 [00:01<00:01, 48.00it/s]



 49%|████▉     | 49/100 [00:01<00:00, 51.02it/s]



100%|██████████| 100/100 [00:02<00:00, 34.33it/s]
100%|██████████| 100/100 [00:12<00:00,  7.70it/s]
[I 2025-04-07 14:46:17,291] Trial 1 finished with value: 0.6321634920634921 and parameters: {'instruction_index': 6, 'demonstration_index': 6}. Best is trial 1 with value: 0.6321634920634921.


Iteration 2: 0.6321634920634921


  0%|          | 0/100 [00:00<?, ?it/s]



 12%|█▏        | 12/100 [00:00<00:03, 23.80it/s]



 29%|██▉       | 29/100 [00:00<00:01, 48.71it/s]



 37%|███▋      | 37/100 [00:01<00:01, 44.19it/s]



 57%|█████▋    | 57/100 [00:01<00:00, 60.10it/s]



100%|██████████| 100/100 [00:03<00:00, 26.47it/s]
100%|██████████| 100/100 [00:13<00:00,  7.16it/s]
[I 2025-04-07 14:46:35,357] Trial 2 finished with value: 0.5609722222222223 and parameters: {'instruction_index': 10, 'demonstration_index': 9}. Best is trial 1 with value: 0.6321634920634921.


Iteration 3: 0.5609722222222223


  0%|          | 0/100 [00:00<?, ?it/s]



  6%|▌         | 6/100 [00:00<00:07, 12.65it/s]



 32%|███▏      | 32/100 [00:00<00:01, 51.34it/s]



 40%|████      | 40/100 [00:01<00:01, 49.36it/s]



100%|██████████| 100/100 [00:04<00:00, 21.85it/s]
100%|██████████| 100/100 [00:14<00:00,  7.00it/s]
[I 2025-04-07 14:46:54,549] Trial 3 finished with value: 0.6758681318681319 and parameters: {'instruction_index': 9, 'demonstration_index': 0}. Best is trial 3 with value: 0.6758681318681319.


Iteration 4: 0.6758681318681319


  0%|          | 0/100 [00:00<?, ?it/s]



 11%|█         | 11/100 [00:00<00:04, 22.05it/s]



 31%|███       | 31/100 [00:00<00:01, 47.60it/s]



 45%|████▌     | 45/100 [00:01<00:01, 50.62it/s]



100%|██████████| 100/100 [00:07<00:00, 12.68it/s]
100%|██████████| 100/100 [00:17<00:00,  5.87it/s]
[I 2025-04-07 14:47:19,775] Trial 4 finished with value: 0.5750952380952381 and parameters: {'instruction_index': 5, 'demonstration_index': 8}. Best is trial 3 with value: 0.6758681318681319.


Iteration 5: 0.5750952380952381


## Save the Optimized Candidate

We now have an estimate of the best instruction and demonstration pair.
We can now generate an optimized system template.

In [24]:
optimized_system_template = format_system_template(
    instructions=candidate_instructions[study_tpe.best_params["instruction_index"]],
    demonstrations=candidate_demonstrations[
        study_tpe.best_params["demonstration_index"]
    ],
)
print(optimized_system_template)

# Instructions:

You are an assistant tasked with performing named entity recognition (NER). Identify and extract any mentions of people, organizations, locations, or other miscellaneous entities from the provided text. Then, organize your findings in JSON format using the structure below:

{
    "person": ["Person1", "Person2", ...],
    "organization": ["Organization1", "Organization2", ...],
    "location": ["Location1", "Location2", ...],
    "miscellaneous": ["MiscEntity1", "MiscEntity2", ...]
}

If no entities of a given category are found, list an empty array for that category.

# Demonstrations:

DEMONSTRATION 1:
ENVIRONMENT:
We 're seeing some new crop coming in now but it 's slow going , " the dealer said .
AGENT:
{"person":[],"organization":[],"location":[],"miscellaneous":[]}

DEMONSTRATION 2:
ENVIRONMENT:
" We 're always concerned about aflatoxin but we 're on top of it , " Glickman told reporters after addressing a USDA-sponsored farmers ' market .
AGENT:
{"person":["Glic

You can save the optimized configuration file tree.

In [25]:
OUTPUT_DIR = None  # Set to a local path to save the optimized config

optimized_variant_name = "mipro_optimized"
optimized_config = deepcopy(base_config)
optimized_config.functions[FUNCTION_NAME].variants[optimized_variant_name] = deepcopy(
    base_variant
)
optimized_config.functions[FUNCTION_NAME].variants[
    optimized_variant_name
].system_template = optimized_system_template
optimized_config.functions[FUNCTION_NAME].variants[
    optimized_variant_name
].name = optimized_variant_name
# write the new config to a temporary directory
optimized_config_dir = optimized_config.write(base_dir=OUTPUT_DIR)

## Conclusion

By following this notebook, you can systematically refine prompts for better performance.
The optimized prompt can be saved and used in production by updating the function's system template configuration.
