
---
# Automated Prompt Engineering using MIPRO

This notebook provides an automated approach to optimizing prompt engineering using the [Multi-prompt Instruction PRoposal Optimizer (MIPRO)](https://arxiv.org/abs/2406.11695v1).
It is designed for TensorZero users who want to optimize their system prompts based on collected inference and feedback data. Currently, only demonstration feedback is supported, but additional feedback types will be incorporated in future updates.

By following this guide, you can systematically refine your prompts to improve model performance in specific tasks.

---

## Overview

The optimization process involves the following steps:

1. **Generate candidate instructions and demonstrations**
    - Candidate instructions are generated using OpenAI's o1 model based on a system template and an optional schema.
        - This is configurable in the `config/tensorzero.toml` file if you want to use a different model.
    - Candidate demonstrations are sets of few-shot examples sampled from the training dataset.
2. **Evaluate Instruction-Demonstration Pairs**
    - Sample an instruction and demonstration pair and score it using a Large Language Model (LLM) judge.
    - The judge (a TensorZero function utilizing OpenAI's GPT-4o-mini model) scores the quality of the instruction-demonstration pair.
    - Scores are aggregated over the evaluation set to produce a final evaluation score.
3. **Optimization via Search Algorithms**
    - Utilize either random search or a tree-structured parzen estimator (TPE) to determine the next instruction and demonstration pair for evaluation.
4. **Iterate the Optimization Process**
    - Repeat the optimization process for a fixed number of iterations.
5. **Select the Best Performing Pair**
    - The highest scoring instruction and demonstration pair are formatted to produce an optimized system template.

---

## Prerequisites

### 1. Environment Setup

Before running the optimization, ensure that:

- The `OPENAI_API_KEY` environment variable is set.

- Required dependencies for TensorZero and MIPRO are installed.

- The clickhouse client for the database containing the demonstration feeback is running and visible throught `TENSORZERO_CLICKHOUSE_URL`
    - For example, [by running the docker container for the function you want to optimize the prompt for.](https://www.tensorzero.com/docs/gateway)

### 2. Configuration Parameters

To tailor the optimization to your function, update the following parameters accordingly.

---

## Step 1: Define Function Configuration Parameters

Specify the TensorZero function you want to optimize. The example below optimizes the system prompt for Named Entity Recognition (NER):

- **Function Configuration Directory:** Location of the function’s configuration files.

- **Function Name:** The TensorZero function being optimized.

- **Model Variant:** The specific function variant to use as an example for the system template.

In [None]:
# Configuation arguments for the function you want to optimize the prompt for
CONFIG_DIR = "../../examples/data-extraction-ner/config"

# The name of the function you want to optimize the prompt for
FUNCTION_NAME = "extract_entities"

# The name of the variant to use
TEMPLATE_VARIANT_NAME = "gpt_4o_mini"


---
## Step 2: Configure the LLM Judge for Metric Optimization

The LLM judge guides the optimization process by evaluating prompt effectiveness. You must define:

- **Task Description:** A summary of the task being optimized.

- **Optimization Metric:** The metric used for evaluating prompt effectiveness (e.g., Jaccard similarity between predicted and ground truth entities).

In [None]:
# Description of the task you are optimizing the prompt for to be used by the optimizer judge
TASK_DESCRIPTION = "The task is to extract named entities from the input text."

# Metric definition for scoring generated prompts
METRIC_PROPERTIES = "The metric is the Jaccard similarity between the predicted and ground truth entities."


---
## Step 3: Define Optimization Parameters

The following parameters control the optimization process. Experimenting with different values can help refine results:

- **Search Space**

    - `NUM_CANDIDATE_INSTRUCTIONS`: Number of candidate instructions to generate.

    - `NUM_CANDIDATE_DEMONSTRATIONS`: Number of candidate demonstrations to sample.

- **Optimization Control**

    - `MAX_ITERATIONS`: Number of optimization steps.

    - `MAX_DEMONSTRATIONS`: Maximum few-shot examples per demonstration.

In [None]:
# Number of candidate instructions to generate and search over
NUM_CANDIDATE_INSTRUCTIONS = 10

# Number of candidate demonstrations to sample and search over
NUM_CANDIDATE_DEMONSTRATIONS = 10

# Maximum number of demonstrations in each example
MAX_DEMONSTRATIONS = 10

# Maximum number of search steps taken by the optimization algorithm for evaluating instruction-demonstration pairs
MAX_ITERATIONS = 5

# Set optimization direction ('maximize' or 'minimize') based on the metric properties you described above.
OPTIMIZER_DIRECTION = "maximize"

# Fraction of the dataset used by the judge to score the quality of the generated prompt
EVAL_FRACTION = 0.2

# Limit on the number of samples for demonstration selection
MAX_SAMPLES = 100_000

# Random seed for reproducibility
SEED = 0

- **Evaluation Control**

    - `EVAL_FRACTION`: Fraction of the dataset used for scoring generated prompts.

    - `MAX_SAMPLES`: Limit on the number of demonstration samples.

- **Reproducibility**

    - `SEED`: Random seed for consistent results.

---

## Import Dependencies

In [None]:
import asyncio
import json
import os
from copy import deepcopy
from typing import Any, Dict, List, Optional

import numpy as np
import optuna
import pandas as pd
from clickhouse_connect import get_client
from minijinja import Environment
from optuna.samplers import TPESampler
from tensorzero import (
    AsyncTensorZeroGateway,
    InferenceResponse,
    JsonInferenceResponse,
    RawText,
    Text,
)
from tqdm.asyncio import tqdm_asyncio
from utils.client_calls import candidate_inference, get_instructions, judge_answer
from utils.configs.reader import load_config


---

## Initialize the MIPRO TensorZero Client

This client is used to generate candidate instructions and score the quality of responses given the candidate instructions and demonstrations.

In [None]:
MAX_CONCURRENT_REQUESTS = 50

In [None]:
mipro_client = await AsyncTensorZeroGateway.build_embedded(
    config_file="config/tensorzero.toml",
    clickhouse_url=os.environ["TENSORZERO_CLICKHOUSE_URL"],
)
semaphore = asyncio.Semaphore(MAX_CONCURRENT_REQUESTS)


---
## Load Data
Load the TensorZero configuration for the function you want to optimize the prompt for.

In [None]:
base_config = load_config(CONFIG_DIR)

Retrieve the configuration for the variant with the templates we'll use for prompt optimization.

In [None]:
# assert "functions" in config, "No `[functions]` section found in config"
assert FUNCTION_NAME in base_config.functions.keys(), (
    f"No function named `{FUNCTION_NAME}` found in config"
)
# assert "variants" in config["functions"][FUNCTION_NAME], (
#     f"No variants section found for function `{FUNCTION_NAME}`"
# )
assert TEMPLATE_VARIANT_NAME in base_config.functions[FUNCTION_NAME].variants.keys(), (
    f"No variant named `{TEMPLATE_VARIANT_NAME}` found in function `{FUNCTION_NAME}`"
)

base_function = base_config.functions[FUNCTION_NAME]
base_variant = deepcopy(base_function.variants[TEMPLATE_VARIANT_NAME])

Retrieve the system, user, and assistant templates in the variant (if any), and initialize a minijinja environment with them.


In [None]:
templates = {}

if base_variant.assistant_template is not None:
    templates["assistant"] = base_variant.assistant_template

if base_variant.system_template is not None:
    templates["system"] = base_variant.system_template

if base_variant.user_template is not None:
    templates["user"] = base_variant.user_template

env = Environment(templates=templates)

Initialize the ClickHouse client.

In [None]:
assert "TENSORZERO_CLICKHOUSE_URL" in os.environ, (
    "TENSORZERO_CLICKHOUSE_URL environment variable not set"
)

clickhouse_client = get_client(dsn=os.environ["TENSORZERO_CLICKHOUSE_URL"])

Determine the inference table name based on the function type.

In [None]:
inference_table_name = {"chat": "ChatInference", "json": "JsonInference"}.get(
    base_function.type
)

if inference_table_name is None:
    raise ValueError(f"Unsupported function type: {base_function.type}")

Query the inferences and demonstration feedback from ClickHouse.

In [None]:
query = f"""
SELECT 
    i.input, 
    i.output, 
    f.value,
    i.episode_id
FROM 
    {inference_table_name} i
JOIN 
    (SELECT
        inference_id,
        value,
        ROW_NUMBER() OVER (PARTITION BY inference_id ORDER BY timestamp DESC) as rn
    FROM 
        DemonstrationFeedback
    ) f ON i.id = f.inference_id AND f.rn = 1
WHERE 
    i.function_name = %(function_name)s
LIMIT %(max_samples)s
"""

params = {
    "function_name": FUNCTION_NAME,
    "max_samples": MAX_SAMPLES,
}

df = clickhouse_client.query_df(query, params)

df.head()

Render the messages in the input and demonstration columns.

In [None]:
def render_message(content: List[Dict[str, Any]], role: str) -> str:
    assert role in ["user", "assistant"], f"Invalid role: {role}"

    if len(content) != 1:
        raise ValueError(f"Message must have exactly one content block: {content}")

    if content[0]["type"] != "text":
        raise ValueError(f"Content block must be of type text: {content}")

    content = content[0]["value"]

    if isinstance(content, str):
        return content
    else:
        return env.render_template(role, **content)


def format_input(sample):
    function_input = json.loads(sample["input"])
    rendered_message = ""
    for message in function_input["messages"]:
        rendered_message += render_message(message["content"], message["role"])
    return rendered_message


def format_output(sample):
    output = json.loads(sample["value"])
    if base_function.type == "chat":
        if len(output) != 1:
            raise ValueError(f"Output {output} must have exactly one content block.")

        if output[0]["type"] != "text":
            raise ValueError(f"Output {output} must be a text block.")
        return output[0]["text"]
    elif base_function.type == "json":
        return output["raw"]
    else:
        raise ValueError(f"Unsupported function type: {base_function.type}")


def format_system_args(sample):
    function_input = json.loads(sample["input"])
    if "system" in function_input:
        return function_input["system"]
    else:
        return None


df["input_str"] = df.apply(format_input, axis=1)
df["value_str"] = df.apply(format_output, axis=1)
df["system_args"] = df.apply(format_system_args, axis=1)
df.head()

Split the data into training and evaluation sets.
The training set is used to generate candidate demonstrations.
The evaluation set is used by the judge to score the quality of the generated prompt.

In [None]:
# Get unique episode_ids
unique_episode_ids = df["episode_id"].unique()

# Shuffle the unique episode_ids
np.random.seed(42)
np.random.shuffle(unique_episode_ids)

# Calculate the split index for episode_ids
split_index = int(len(unique_episode_ids) * (1 - EVAL_FRACTION))

# Split the episode_ids into training and validation sets
train_episode_ids = unique_episode_ids[:split_index]
val_episode_ids = unique_episode_ids[split_index:]

# Create training and validation DataFrames based on episode_ids
train_df = df[df["episode_id"].isin(train_episode_ids)]
eval_df = df[df["episode_id"].isin(val_episode_ids)]

print(f"Training set size: {len(train_df)}")
print(f"Evaluation set size: {len(eval_df)}")
print(f"Actual evaluation fraction: {len(eval_df) / len(df):.2f}")


---
## Generate Candidate Instructions
Given the function's system template as an example, generate a set of candidate instructions to optimize the prompt over.

In [None]:
example_instructions = base_variant.system_template

if base_function.system_schema is not None:
    example_schema = base_function.system_schema.model_json_schema()
else:
    example_schema = None

responses = await tqdm_asyncio.gather(
    *[
        get_instructions(
            client=mipro_client,
            example_instructions=example_instructions,
            example_schema=example_schema,
            semaphore=semaphore,
        )
        for _ in range(NUM_CANDIDATE_INSTRUCTIONS)
    ]
)

candidate_instructions = [example_instructions]
for response in responses:
    if response is None:
        continue
    candidate_instructions.append(response.output.parsed["instructions"])


---
## Generate Candidate Demonstrations
Given the training set, generate a set of candidate demonstrations to optimize the prompt over.

In [None]:
def generate_demonstrations(
    df: pd.DataFrame, input_col: str, output_col: str, system_col: str, seed: int = 42
) -> str:
    # Perform a bootstrap sample (with replacement) of the entire DataFrame.
    sample = df.sample(n=MAX_DEMONSTRATIONS, replace=False, random_state=seed)
    # Remove duplicate rows that may have been sampled multiple times.
    # unique_sample = bootstrap_sample.drop_duplicates(subset=['episode_id'])[:MAX_DEMONSTRATIONS]
    demonstrations = ""
    for _, row in sample.iterrows():
        if row[system_col] is not None:
            demonstrations += f"System Info: {row[system_col]}\n"
        demonstrations += f"Input: {row[input_col]}\nOutput: {row[output_col]}\n\n"

    return demonstrations

In [None]:
candidate_demonstrations = [
    generate_demonstrations(
        df=train_df,
        input_col="input_str",
        output_col="value_str",
        system_col="system_args",
        seed=seed,
    )
    for seed in range(NUM_CANDIDATE_DEMONSTRATIONS)
]


---
## Optimize the Prompt

### Define the optimization objective

In [None]:
# --- Initialize Online Statistics ---
num_instructions = len(candidate_instructions)
num_demonstrations = len(candidate_demonstrations)

In [None]:
def format_system_template(instructions: str, demonstrations: str) -> str:
    return f"{instructions}\n\nDemonstrations:\n\n{demonstrations}"


def format_response(response: Optional[InferenceResponse]) -> str:
    if response is None:
        return ""
    if isinstance(response, JsonInferenceResponse):
        return str(response.output.parsed)
    else:
        content = response.content
        assert len(content) == 1  # TODO: Handle multiple content blocks
        if isinstance(content[0], Text):
            return content[0].text
        elif isinstance(content[0], RawText):
            return content[0].value
        else:
            raise ValueError(f"Unsupported content type: {type(content[0])}")


async def objective(trial: optuna.Trial):
    # Sample an instruction and a demonstration set.
    instruction_index = trial.suggest_categorical(
        "instruction_index", range(num_instructions)
    )
    demonstration_index = trial.suggest_categorical(
        "demonstration_index", range(num_demonstrations)
    )
    # format the candidate prompt
    candidate_prompt = format_system_template(
        candidate_instructions[instruction_index],
        candidate_demonstrations[demonstration_index],
    )
    # create a new variant with the candidate prompt
    candidate_variant_name = f"{instruction_index}_{demonstration_index}"
    candidate_config = deepcopy(base_config)
    candidate_config.functions[FUNCTION_NAME].variants[candidate_variant_name] = (
        deepcopy(base_variant)
    )
    candidate_config.functions[FUNCTION_NAME].variants[
        candidate_variant_name
    ].system_template = candidate_prompt
    candidate_config.functions[FUNCTION_NAME].variants[
        candidate_variant_name
    ].name = candidate_variant_name
    # write the new config to a temporary directory
    tmp_config_dir = candidate_config.write()
    # build a new client with the new config
    target_client = await AsyncTensorZeroGateway.build_embedded(
        config_file=str(tmp_config_dir / "tensorzero.toml"),
        clickhouse_url=os.environ["TENSORZERO_CLICKHOUSE_URL"],
    )
    # Asynchronously generate answers for each query in the evaluation set.
    responses = await tqdm_asyncio.gather(
        *[
            candidate_inference(
                client=target_client,
                function_name=FUNCTION_NAME,
                input=json.loads(input_args),
                variant_name=candidate_variant_name,
                semaphore=semaphore,
            )
            for input_args in eval_df["input"]
        ]
    )

    # Score the responses using the judge.
    judge_responses = await tqdm_asyncio.gather(
        *[
            judge_answer(
                client=mipro_client,
                task_description=TASK_DESCRIPTION,
                metric_properties=METRIC_PROPERTIES,
                prediction=format_response(response) if response is not None else "",
                truth=str(ground_truth),
                semaphore=semaphore,
            )
            for response, ground_truth in zip(responses, eval_df["value_str"])
        ]
    )

    # Aggregate the scores.
    scores = []
    for response in judge_responses:
        if response is not None:
            if response.output.parsed is not None:
                scores.append(response.output.parsed["score"])
    # Return the mean score.
    return np.mean(scores)

### Random Search

We start by sampling a random instruction and demonstration at each iteration in the optimization loop.

In [None]:
study_random = optuna.create_study(
    sampler=optuna.samplers.RandomSampler(seed=SEED), direction=OPTIMIZER_DIRECTION
)

for iteration in range(MAX_ITERATIONS):
    trial = study_random.ask()

    value = await objective(trial)
    print(f"Iteration {iteration + 1}: {value}")

    frozen_trial = study_random.tell(trial, value)
    study_random._log_completed_trial(frozen_trial)

## Tree-structured Parzen Estimator
Following the MIPRO paper, we use a tree-structured parzen estimator (TPE) to sample the next instruction and demonstration pair to evaluate.

In [None]:
study_tpe = optuna.create_study(
    sampler=TPESampler(seed=SEED), direction=OPTIMIZER_DIRECTION
)

for iteration in range(MAX_ITERATIONS):
    trial = study_tpe.ask()

    value = await objective(trial)
    print(f"Iteration {iteration + 1}: {value}")

    frozen_trial = study_tpe.tell(trial, value)
    study_tpe._log_completed_trial(frozen_trial)


---
## Save the optimized candidate

We now have an estimate of the best instruction and demonstration pair.
We can now generate an optimized system template.

In [None]:
optimized_system_template = format_system_template(
    instructions=candidate_instructions[study_tpe.best_params["instruction_index"]],
    demonstrations=candidate_demonstrations[
        study_tpe.best_params["demonstration_index"]
    ],
)
print(optimized_system_template)

You can save the optimized configuration file tree.

In [None]:
OUTPUT_DIR = None  # Set to a local path to save the optimized config

optimized_variant_name = "mipro_optimized"
optimized_config = deepcopy(base_config)
optimized_config.functions[FUNCTION_NAME].variants[optimized_variant_name] = deepcopy(
    base_variant
)
optimized_config.functions[FUNCTION_NAME].variants[
    optimized_variant_name
].system_template = optimized_system_template
optimized_config.functions[FUNCTION_NAME].variants[
    optimized_variant_name
].name = optimized_variant_name
# write the new config to a temporary directory
optimized_config_dir = optimized_config.write(base_dir=OUTPUT_DIR)


---
## Conclusion

By following this notebook, you can systematically refine prompts for better performance. The optimized prompt can be saved and used in production by updating the function's system template configuration.

Future updates will extend support to additional feedback types and we encourage you to explore different optimization strategies.

---