# Automated Prompt Engineering using MIPRO

This notebook provides an automated approach to optimizing prompt engineering using the [Multi-prompt Instruction PRoposal Optimizer (MIPRO)](https://arxiv.org/abs/2406.11695v1).
It is designed for TensorZero users who want to optimize their system prompts based on collected inference and feedback data. We currently only support demonstration feeback here, but we will add support for other feedback types in the future.

By following this guide, you can systematically refine your prompts to improve model performance in specific tasks.

## Overview

1. Generate candidate instructions and candidate demonstrations. 
    - Candidate instructions are generated using OpenAI's o1 model based on a system template and a schema (if provided). 
    - Each candidate demonstration is a set of few-shot examples sampled from the training set.
2. For each optimization step, sample an instruction and demonstration pair and score it using an LLM judge.
    - The judge is a TensorZero function that uses OpenAI's gpt-4o-mini model to score the quality of the instruction and demonstration pair.
    - The judge provides a score on each prediction from your function on the evaluation set.
    - The scores are aggregated to produce a single score for the instruction and demonstration pair.
3. The optimizer uses a random search or tree-structured parzen estimator (TPE) to sample the next instruction and demonstration pair to evaluate.
4. The process repeats for a fixed number of iterations.
5. The highest scoring instruction and demonstration pair are formatted to produce an optimized system template.

## Prerequisites

### 1. Environment Setup

Before running the optimization, ensure that:

- The OPENAI_API_KEY environment variable is set.

- The necessary dependencies for TensorZero and MIPRO are installed.

- [You have spun up the docker container for the function you want to optimize the prompt for.](https://www.tensorzero.com/docs/gateway)

### 2. Update Configuration Parameters

To customize the optimization for your specific function, modify the following parameters.

## Step 1: Define Function Configuration Parameters

These parameters specify the TensorZero function you aim to optimize. 
In this example, we optimize the system prompt for the Named Entity Recognition (NER) task. Adjust these values based on your use case.

In [None]:
# Configuation arguments for the function you want to optimize the prompt for
CONFIG_DIR = "../../examples/data-extraction-ner/config"

FUNCTION_NAME = "extract_entities"

# The name of the variant to use
TEMPLATE_VARIANT_NAME = "gpt_4o_mini"

# The name of the variant for the search template
GENERATE_ANSWER_VARIANT_NAME = "gpt_4o_mini"

Candidate instructions and demonstrations are passed as arguments to the `generate_answer` system template.
If your function uses a different model than GPT-4o-mini, you can create a new variant with your desired model in `config/functions/tensorzero.toml` and update the `GENERATE_ANSWER_VARIANT_NAME` appropriately.
We currently have support for "chat" and "json" function types.
If you are optimizing a chat function using Anthropic's Claude 3 Haiku, you can add a new variant with the following configuration:

```toml
[functions.generate_answer_chat.variants.claude_3_haiku]
type = "chat_completion"
weight = 1.0
model = "anthropic::claude-3-haiku-20240307"
retries = { num_retries = 3, max_delay_s = 10 }
system_template = "functions/generate_answer_chat/search_template/system_template.minijinja"
user_template = "functions/generate_answer_chat/search_template/user_template.minijinja"
```

and change the `GENERATE_ANSWER_VARIANT_NAME` to `claude_3_haiku`.

## Step 2: Configure the LLM Judge for Metric Optimization

To guide the optimizer in evaluating prompt effectiveness, specify the task description and optimization metric.

In [None]:
# Description of the task you are optimizing the prompt for to be used by the optimizer judge
TASK_DESCRIPTION = "The task is to extract named entities from the input text."

# Metric definition for scoring generated prompts
METRIC_PROPERTIES = "The metric is the Jaccard similarity between the predicted and ground truth entities."

## Step 3: Define Optimization Parameters

These settings control how the optimizer selects and evaluates candidate prompts and demonstrations.

You may want to experiment with different values of the following key parameters to find the best configuration for your use case.
Increasing their values may improve the quality of the optimized prompt, but will also increase the cost of the optimization.
- The `NUM_CANDIDATE_INSTRUCTIONS` and `NUM_CANDIDATE_DEMONSTRATIONS` parameters control the size of the search space.
- The `MAX_ITERATIONS` parameter controls the number of search steps taken by the optimization algorithm for evaluating instruction-demonstration pairs.
- The `MAX_DEMONSTRATIONS` parameter controls the number of few-shot examples included in each candidate demonstration.

In [None]:
# Number of candidate instructions to generate and search over
NUM_CANDIDATE_INSTRUCTIONS = 10

# Number of candidate demonstrations to sample and search over
NUM_CANDIDATE_DEMONSTRATIONS = 10

# Maximum number of demonstrations in each example
MAX_DEMONSTRATIONS = 10

# Maximum number of search steps taken by the optimization algorithm for evaluating instruction-demonstration pairs
MAX_ITERATIONS = 5

# Set optimization direction ('maximize' or 'minimize') based on the metric properties you described above.
OPTIMIZER_DIRECTION = "maximize"

# Fraction of the dataset used by the judge to score the quality of the generated prompt
EVAL_FRACTION = 0.2

# Limit on the number of samples for demonstration selection
MAX_SAMPLES = 100_000

# Random seed for reproducibility
SEED = 0

In [None]:
import asyncio
import json
import os
from collections import Counter
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple

import numpy as np
import optuna
import pandas as pd
import toml
from clickhouse_connect import get_client
from minijinja import Environment
from optuna.samplers import TPESampler
from tensorzero import (
    AsyncTensorZeroGateway,
    ChatInferenceResponse,
    InferenceResponse,
    JsonInferenceResponse,
    RawText,
    Text,
)
from tqdm.asyncio import tqdm_asyncio
from utils import generate_answer, get_instructions, judge_answer

In [None]:
TENSORZERO_GATEWAY_URL = "http://localhost:3001"
MAX_CONCURRENT_REQUESTS = 50

In [None]:
tensorzero_client = AsyncTensorZeroGateway(TENSORZERO_GATEWAY_URL, timeout=30)
semaphore = asyncio.Semaphore(MAX_CONCURRENT_REQUESTS)

## Load Data
Load the TensorZero configuration file for the function you want to optimize the prompt for.

In [None]:
config_path = Path(CONFIG_DIR) / "tensorzero.toml"

assert config_path.exists(), f"{config_path} does not exist"
assert config_path.is_file(), f"{config_path} is not a file"

with config_path.open("r") as f:
    config = toml.load(f)

Retrieve the configuration for the variant with the templates we'll use for prompt optimization.

In [None]:
assert "functions" in config, "No `[functions]` section found in config"
assert FUNCTION_NAME in config["functions"], (
    f"No function named `{FUNCTION_NAME}` found in config"
)
assert "variants" in config["functions"][FUNCTION_NAME], (
    f"No variants section found for function `{FUNCTION_NAME}`"
)
assert TEMPLATE_VARIANT_NAME in config["functions"][FUNCTION_NAME]["variants"], (
    f"No variant named `{TEMPLATE_VARIANT_NAME}` found in function `{FUNCTION_NAME}`"
)

function_type = config["functions"][FUNCTION_NAME]["type"]
variant = config["functions"][FUNCTION_NAME]["variants"][TEMPLATE_VARIANT_NAME]

Retrieve the output schema if the function is a JSON function.

In [None]:
if function_type == "json":
    user_output_schema = json.load(
        open(
            Path(CONFIG_DIR) / config["functions"][FUNCTION_NAME]["output_schema"], "r"
        )
    )
    GENERATE_ANSWER_FUNCTION_NAME = "generate_answer_json"
else:
    GENERATE_ANSWER_FUNCTION_NAME = "generate_answer_chat"
    user_output_schema = None

Retrieve the system, user, and assistant templates in the variant (if any), and initialize a minijinja environment with them.


In [None]:
templates = {}

if "assistant_template" in variant:
    assistant_template_path = config_path.parent / variant["assistant_template"]
    with assistant_template_path.open("r") as f:
        templates["assistant"] = f.read()

if "system_template" in variant:
    system_template_path = config_path.parent / variant["system_template"]
    with system_template_path.open("r") as f:
        templates["system"] = f.read()

if "user_template" in variant:
    user_template_path = config_path.parent / variant["user_template"]
    with user_template_path.open("r") as f:
        templates["user"] = f.read()

env = Environment(templates=templates)

Initialize the ClickHouse client.

In [None]:
assert "TENSORZERO_CLICKHOUSE_URL" in os.environ, (
    "TENSORZERO_CLICKHOUSE_URL environment variable not set"
)

clickhouse_client = get_client(dsn=os.environ["TENSORZERO_CLICKHOUSE_URL"])

Determine the inference table name based on the function type.

In [None]:
inference_table_name = {"chat": "ChatInference", "json": "JsonInference"}.get(
    function_type
)

if inference_table_name is None:
    raise ValueError(f"Unsupported function type: {function_type}")

Query the inferences and demonstration feedback from ClickHouse.

In [None]:
query = f"""
SELECT 
    i.variant_name, 
    i.input, 
    i.output, 
    f.value,
    i.episode_id
FROM 
    {inference_table_name} i
JOIN 
    (SELECT
        inference_id,
        value,
        ROW_NUMBER() OVER (PARTITION BY inference_id ORDER BY timestamp DESC) as rn
    FROM 
        DemonstrationFeedback
    ) f ON i.id = f.inference_id AND f.rn = 1
WHERE 
    i.function_name = %(function_name)s
    AND i.variant_name = %(variant_name)s
LIMIT %(max_samples)s
"""

params = {
    "function_name": FUNCTION_NAME,
    "max_samples": MAX_SAMPLES,
    "variant_name": TEMPLATE_VARIANT_NAME,
}

df = clickhouse_client.query_df(query, params)

df.head()

Render the messages in the input and demonstration columns.

In [None]:
def render_message(content: List[Dict[str, Any]], role: str) -> str:
    assert role in ["user", "assistant"], f"Invalid role: {role}"

    if len(content) != 1:
        raise ValueError(f"Message must have exactly one content block: {content}")

    if content[0]["type"] != "text":
        raise ValueError(f"Content block must be of type text: {content}")

    content = content[0]["value"]

    if isinstance(content, str):
        return content
    else:
        return env.render_template(role, **content)


def format_input(sample):
    function_input = json.loads(sample["input"])
    rendered_message = ""
    for message in function_input["messages"]:
        rendered_message += render_message(message["content"], message["role"])
    return rendered_message


def format_output(sample):
    output = json.loads(sample["value"])
    if function_type == "chat":
        if len(output) != 1:
            raise ValueError(f"Output {output} must have exactly one content block.")

        if output[0]["type"] != "text":
            raise ValueError(f"Output {output} must be a text block.")
        return output[0]["text"]
    elif function_type == "json":
        return output["raw"]
    else:
        raise ValueError(f"Unsupported function type: {function_type}")


def format_system_args(sample):
    function_input = json.loads(sample["input"])
    if "system_args" in function_input:
        return function_input["system_args"]
    else:
        return None


df["input_str"] = df.apply(format_input, axis=1)
df["value_str"] = df.apply(format_output, axis=1)
df["system_args"] = df.apply(format_system_args, axis=1)
df.head()

Split the data into training and evaluation sets.
The training set is used to generate candidate demonstrations.
The evaluation set is used by the judge to score the quality of the generated prompt.

In [None]:
# Get unique episode_ids
unique_episode_ids = df["episode_id"].unique()

# Shuffle the unique episode_ids
np.random.seed(42)
np.random.shuffle(unique_episode_ids)

# Calculate the split index for episode_ids
split_index = int(len(unique_episode_ids) * (1 - EVAL_FRACTION))

# Split the episode_ids into training and validation sets
train_episode_ids = unique_episode_ids[:split_index]
val_episode_ids = unique_episode_ids[split_index:]

# Create training and validation DataFrames based on episode_ids
train_df = df[df["episode_id"].isin(train_episode_ids)]
eval_df = df[df["episode_id"].isin(val_episode_ids)]

print(f"Training set size: {len(train_df)}")
print(f"Evaluation set size: {len(eval_df)}")
print(f"Actual evaluation fraction: {len(eval_df) / len(df):.2f}")

## Generate Candidate Instructions
Given the function's system template as an example, generate a set of candidate instructions to optimize the prompt over.

In [None]:
with open(Path(CONFIG_DIR) / variant["system_template"], "r", encoding="utf-8") as file:
    example_instructions = file.read()

if "system_schema" in config["functions"][FUNCTION_NAME]:
    system_schema_path = (
        Path(CONFIG_DIR) / config["functions"][FUNCTION_NAME]["system_schema"]
    )
    with open(system_schema_path, "r", encoding="utf-8") as f:
        example_schema = f.read()
else:
    example_schema = None

responses = await tqdm_asyncio.gather(
    *[
        get_instructions(
            client=tensorzero_client,
            example_instructions=example_instructions,
            example_schema=example_schema,
            semaphore=semaphore,
        )
        for _ in range(NUM_CANDIDATE_INSTRUCTIONS)
    ]
)

candidate_instructions = [example_instructions]
for response in responses:
    if response is None:
        continue
    candidate_instructions.append(response.output.parsed["instructions"])

## Generate Candidate Demonstrations
Given the training set, generate a set of candidate demonstrations to optimize the prompt over.

In [None]:
def generate_demonstrations(
    df: pd.DataFrame, input_col: str, output_col: str, seed: int = 42
) -> str:
    # Perform a bootstrap sample (with replacement) of the entire DataFrame.
    sample = df.sample(n=MAX_DEMONSTRATIONS, replace=False, random_state=seed)
    # Remove duplicate rows that may have been sampled multiple times.
    # unique_sample = bootstrap_sample.drop_duplicates(subset=['episode_id'])[:MAX_DEMONSTRATIONS]
    demonstrations = ""
    for _, row in sample.iterrows():
        demonstrations += f"Input: {row[input_col]}\nOutput: {row[output_col]}\n\n"

    return demonstrations

In [None]:
candidate_demonstrations = [
    generate_demonstrations(
        df=train_df, input_col="input_str", output_col="value_str", seed=seed
    )
    for seed in range(NUM_CANDIDATE_DEMONSTRATIONS)
]

## Optimize the Prompt

### Define the optimization objective

In [None]:
# --- Initialize Online Statistics ---
num_instructions = len(candidate_instructions)
num_demonstrations = len(candidate_demonstrations)

In [None]:
def format_response(response: InferenceResponse) -> str:
    if response is None:
        return ""
    if isinstance(response, JsonInferenceResponse):
        return str(response.output.parsed)
    elif isinstance(response, ChatInferenceResponse):
        content = response.content
        assert len(content) == 1  # TODO: Handle multiple content blocks
        if isinstance(content[0], Text):
            return content[0].text
        elif isinstance(content[0], RawText):
            return content[0].value
        else:
            raise ValueError(f"Unsupported content type: {type(content[0])}")
    else:
        raise ValueError(f"Unsupported response type: {type(response)}")


async def objective(trial: optuna.Trial):
    # Sample an instruction and a demonstration set.
    instruction_index = trial.suggest_categorical(
        "instruction_index", range(num_instructions)
    )
    demonstration_index = trial.suggest_categorical(
        "demonstration_index", range(num_demonstrations)
    )
    # Asynchronously generate answers for each query in the evaluation set.
    responses = await tqdm_asyncio.gather(
        *[
            generate_answer(
                client=tensorzero_client,
                function_name=GENERATE_ANSWER_FUNCTION_NAME,
                variant_name=GENERATE_ANSWER_VARIANT_NAME,
                instruction=candidate_instructions[instruction_index],
                demonstrations=candidate_demonstrations[demonstration_index],
                query=query,
                output_schema=user_output_schema,
                system_args=system_args,
                semaphore=semaphore,
            )
            for query, system_args in zip(eval_df["input_str"], eval_df["system_args"])
        ]
    )

    # Score the responses using the judge.
    judge_responses = await tqdm_asyncio.gather(
        *[
            judge_answer(
                client=tensorzero_client,
                task_description=TASK_DESCRIPTION,
                metric_properties=METRIC_PROPERTIES,
                prediction=format_response(response) if response is not None else "",
                truth=str(ground_truth),
                semaphore=semaphore,
            )
            for response, ground_truth in zip(responses, eval_df["value_str"])
        ]
    )

    # Aggregate the scores.
    scores = []
    for response in judge_responses:
        if response is not None:
            scores.append(response.output.parsed["score"])
    # Return the mean score.
    return np.mean(scores)

### Random Search

We start by sampling a random instruction and demonstration at each iteration in the optimization loop.

In [None]:
study_random = optuna.create_study(
    sampler=optuna.samplers.RandomSampler(seed=SEED), direction=OPTIMIZER_DIRECTION
)

for iteration in range(MAX_ITERATIONS):
    trial = study_random.ask()

    value = await objective(trial)
    print(f"Iteration {iteration + 1}: {value}")

    frozen_trial = study_random.tell(trial, value)
    study_random._log_completed_trial(frozen_trial)

## Tree-structured Parzen Estimator
Following the MIPRO paper, we use a tree-structured parzen estimator (TPE) to sample the next instruction and demonstration pair to evaluate.

In [None]:
study_tpe = optuna.create_study(
    sampler=TPESampler(seed=SEED), direction=OPTIMIZER_DIRECTION
)

for iteration in range(MAX_ITERATIONS):
    trial = study_tpe.ask()

    value = await objective(trial)
    print(f"Iteration {iteration + 1}: {value}")

    frozen_trial = study_tpe.tell(trial, value)
    study_tpe._log_completed_trial(frozen_trial)

We now have an estimate of the best instruction and demonstration pair.
We can now generate an optimized system template.
You can save this template to a `system_template.minijinja` file and create a new variant of your function.

In [None]:
from pprint import pprint

templates_optimized = {}

system_template_path = Path(
    "config/functions/generate_answer_json/search_template/system_template.minijinja"
)
with system_template_path.open("r") as f:
    templates_optimized["system"] = f.read()

env_optimized = Environment(templates=templates_optimized)

optimized_system_template = env_optimized.render_template(
    "system",
    instructions=candidate_instructions[study_tpe.best_params["instruction_index"]],
    demonstrations=candidate_demonstrations[
        study_tpe.best_params["demonstration_index"]
    ],
)
pprint(optimized_system_template)

## Evaluation

For demonstration purposes, we evaluate the optimized prompts on a validation set using the Jaccard similarity metric rather than the judge.

In [None]:
NUM_VAL_DATAPOINTS = 500

In [None]:
def flatten_dict(d: Dict[str, List[str]]) -> List[str]:
    res = []
    for k, v in d.items():
        assert isinstance(v, list)
        for elt in v:
            res.append(f"__{k.upper()}__::{elt}")
    return res


def compute_jaccard_similarity(
    predicted: Dict[str, List[str]], ground_truth: Dict[str, List[str]]
) -> float:
    target_entities = flatten_dict(ground_truth)
    pred_entities = flatten_dict(predicted)
    target_count = Counter(target_entities)
    pred_count = Counter(pred_entities)
    num = 0
    den = 0
    all_keys = set(target_entities).union(set(pred_entities))
    for key in all_keys:
        num += min(target_count.get(key, 0), pred_count.get(key, 0))
        den += max(target_count.get(key, 0), pred_count.get(key, 0))
    if den == 0:
        return 1
    return num / den


def evaluate_response(
    response: Optional[InferenceResponse], ground_truth_data: Dict[str, List[str]]
):
    predicted = response.output.parsed if response else None

    # `predicted` is None if the model failed to return a valid JSON that complies with the output schema
    valid_output = predicted is not None

    jaccard_similarity = (
        compute_jaccard_similarity(predicted, ground_truth_data) if predicted else 0
    )

    return valid_output, jaccard_similarity

In [None]:
def load_val_dataset(path: str) -> Tuple[pd.DataFrame, pd.DataFrame]:
    # Load the dataset
    df = pd.read_csv(path)
    df.output = df.output.apply(json.loads)

    # Split the dataset into train and validation sets
    val_df = df[df["split"] == 1]

    # Shuffle the splits
    val_df = val_df.sample(frac=1, random_state=SEED).reset_index(drop=True)
    val_df = val_df.iloc[:NUM_VAL_DATAPOINTS]

    return val_df


val_df = load_val_dataset("../../examples/data-extraction-ner/data/conllpp.csv")

print(f"Validation data shape: {val_df.shape}")

In [None]:
studies = {"Random": study_random, "TPE": study_tpe}
scores = {}
for study_name, study in studies.items():
    responses = await tqdm_asyncio.gather(
        *[
            generate_answer(
                client=tensorzero_client,
                function_name=GENERATE_ANSWER_FUNCTION_NAME,
                variant_name=GENERATE_ANSWER_VARIANT_NAME,
                instruction=candidate_instructions[
                    study.best_params["instruction_index"]
                ],
                demonstrations=candidate_demonstrations[
                    study.best_params["demonstration_index"]
                ],
                output_schema=user_output_schema,
                query=query,
                semaphore=semaphore,
            )
            for query in val_df["input"]
        ]
    )
    valid_output_scores = []
    jaccard_similarity_scores = []
    for response, ground_truth in zip(responses, val_df["output"]):
        valid_output, jaccard_similarity = evaluate_response(response, ground_truth)
        valid_output_scores.append(valid_output)
        jaccard_similarity_scores.append(jaccard_similarity)

    # Compute the score for this iteration.
    # (Be sure to handle the case where the denominator might be zero.)
    score = np.sum(jaccard_similarity_scores) / (np.sum(valid_output_scores) + 1e-6)
    scores[study_name] = score
    print(f"{study_name} score: {score}")