# Optimizing Agent Prompts with GEPA on Tau-bench

This demo notebook walks you through optimizing an AI agent's prompt using the
**Genetic-Pareto (GEPA)** algorithm. We'll use the Google Agent Development
Kit (ADK) to build and run our agent in **Tau-bench**, a benchmark designed to
test agents in realistic, conversational scenarios involving tool use and
adherence to policies.

**Goal:** To take a simple, underperforming prompt and automatically
improve it using GEPA, increasing the agent's reliability on a customer
support task.

**Note:** You can find more options to run GEPA with an ADK agent in the [README file](https://github.com/google/adk-python/blob/main/contributing/samples/gepa/README.md).

## Prerequisites

*   **Google Cloud Project:** You'll need access to a Google Cloud Project with
    Vertex AI enabled to run the language models.
*   **Installation:** Ensure `google-adk`, `tau-bench`, and
    `google-cloud-aiplatform` are installed.


In [None]:
# @title Install Tau-bench and GEPA
!git clone https://github.com/google/adk-python.git
!git clone https://github.com/sierra-research/tau-bench.git
%cd tau-bench/
!pip install -e . --quiet

%cd ..
!pip install gepa --quiet

!pip install retry --quiet

In [None]:
# @title Configure python dependencies
import sys

sys.path.append('/content/tau-bench')
sys.path.append('/content/adk-python/contributing/samples/gepa')

In [None]:
# @title Authentication
from google.colab import auth

auth.authenticate_user()

In [None]:
# @title Setup
from datetime import datetime
import json
import logging
import os

import experiment as experiment_lib
from google.genai import types
import utils


# @markdown ### ‚òÅÔ∏è Configure Vertex AI Access
# @markdown Enter your Google Cloud Project ID and Location.

# @markdown Configure Vertex AI Access

GCP_PROJECT = ''  # @param {type: 'string'}
GCP_LOCATION = 'us-central1'  # @param {type: 'string'}

# @markdown ---
# @markdown ### üß† Configure LLM Models
# @markdown We recommend starting with Flash models for speed and cost-efficiency
# @markdown during optimization, but larger models like `gemini-1.5-pro` can also
# @markdown be used, especially for the reflection model.
AGENT_MODEL_NAME = 'gemini-2.5-flash'  # @param {type: 'string'}
USER_MODEL_NAME = 'gemini-2.5-flash'  # @param {type: 'string'}
REFLECTION_MODEL_NAME = 'gemini-2.5-pro'  # @param {type: 'string'}

# @markdown ---
# @markdown ### ‚öôÔ∏è Configure Experiment Parameters
# @markdown Number of trajectories sampled from rollouts to be used by the reflection model in each GEPA step:
MINI_BATCH_SIZE = 8  # @param {type: 'integer'}
# @markdown Size of the pareto and feedback datasets (small setting for demo purposes):
MAX_DATASET_SIZE = 10  # @param {type: 'integer'}
# @markdown Number of times each task is run during evaluation:
NUM_EVAL_TRIALS = 4  # @param {type: 'integer'}
# @markdown Total budget for GEPA prompt evaluations:
MAX_METRIC_CALLS = 100  # @param {type: 'integer'}
# @markdown Maximum number of parallel agent-environment interactions
MAX_CONCURRENCY = 4  # @param {type: 'integer'}

# @markdown **Note:** You can find more information on how to configure GEPA in the [README file](https://github.com/google/adk-python/blob/main/contributing/samples/gepa/README.md).

# The ADK uses these environment variables to connect to Vertex AI via the
# Google GenAI SDK.
os.environ['GOOGLE_GENAI_USE_VERTEXAI'] = 'true'
os.environ['GOOGLE_CLOUD_PROJECT'] = GCP_PROJECT
os.environ['GOOGLE_CLOUD_LOCATION'] = GCP_LOCATION

# Set a logging verbosity suited for this experiment. See
# https://github.com/google/adk-python/issues/1852 for context
types.logger.addFilter(utils.FilterInferenceWarnings())

# Initial Inference: A First Look at Our Agent

Before we start optimizing, let's see how our agent performs with a very basic
prompt. This will help us understand the task and see what a failure case looks
like.

**The Task:** We're using the **'retail'** environment from Tau-bench. In this
environment, our agent acts as a customer support agent for an online store. It
needs to use a set of tools (like `check_order_status`, `issue_refund`, etc.)
to help a simulated user resolve their issues, while following specific support
policies (e.g., only refunding orders less than 30 days old).

**Our Agent:** The agent is built with ADK using a standard tool-calling
strategy. It receives the conversation history and a list of available tools,
and it must decide whether to respond to the user or call a tool.

**The Initial Prompt:** We'll start with a simple, one-line instruction. As
we'll see, this is often not enough for an agent to perform reliably in complex
scenarios.

In [None]:
# @title Define an initial instruction

# @markdown This is our starting "seed" prompt. It's very generic and doesn't give the agent much guidance on how to behave or use tools.
BASE_SYSTEM_INSTRUCTION = 'you are a customer support agent helping customers resolve their issues by using the right tools'  # @param {type: 'string'}

print(BASE_SYSTEM_INSTRUCTION)

In [None]:
# @title Initial Inference: A First Look at Our Agent

from tau_bench.types import EnvRunResult, RunConfig

# We will run our ADK agent on two tasks from the Tau-bench 'dev' set.
# The `run_tau_bench_rollouts` function handles the interaction between the
# agent and the simulated user environment.
print('Running initial inference for tasks 1 and 2...')
inference_results = experiment_lib.run_tau_bench_rollouts(
    config=RunConfig(
        env='retail',
        model=AGENT_MODEL_NAME,
        model_provider='vertex_ai',
        user_model=USER_MODEL_NAME,
        user_model_provider='vertex_ai',
        agent_strategy='tool-calling',
        user_strategy='llm',  # The user is simulated by an LLM
        max_concurrency=MAX_CONCURRENCY,
        task_ids=[
            1,
            2,
        ],  # We'll just run two specific tasks for this initial look
        task_split='dev',
    ),
    system_instruction=BASE_SYSTEM_INSTRUCTION,
)

Loading user with strategy: llm
Running tasks [1, 2, 9, 12] (checkpoint path: results/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104135627.json)


Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f815a2b2b50>
Unclosed connector
connections: ['deque([(<aiohttp.client_proto.ResponseHandler object at 0x7f815ad61f60>, 95679.854398078)])']
connector: <aiohttp.connector.TCPConnector object at 0x7f815958ced0>
Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f815820d410>
Unclosed connector
connections: ['deque([(<aiohttp.client_proto.ResponseHandler object at 0x7f815aa7dda0>, 95859.665770103)])']
connector: <aiohttp.connector.TCPConnector object at 0x7f816149bdd0>


üèÜ Average reward: 0.25
üìà Pass^k
  k=1: 0.25

üìÑ Results saved to results/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104135627.json



In [None]:
# @title Let's visualize one of the sampled trajectory


def display_trajectory(trajectory):
  """Formats and prints a trajectory for display in Colab."""
  print('--- Trajectory Example ---')
  for turn in trajectory:
    role = turn['role']
    parts = turn['parts']
    for part in parts:
      if txt := part.get('text'):
        print(f'**{role.upper()}**: {txt}')
      elif fc := part.get('function_call'):
        args_str = ', '.join(f'{k}={v!r}' for k, v in fc['args'].items())
        print(f'**{role.upper()}**: üìû Tool Call: `{fc["name"]}({args_str})`')
      elif fr := part.get('function_response'):
        try:
          # result is often a JSON string that needs parsing for readability
          result = json.dumps(json.loads(fr['result']), indent=2)
          print(
              f'**{role.upper()}**: ‚Ü™Ô∏è Tool Response from'
              f' `{fr["name"]}`:\n```json\n{result}\n```'
          )
        except Exception:
          print(
              f'**{role.upper()}**: ‚Ü™Ô∏è Tool Response from'
              f' `{fr["name"]}`: `{fr["response"]["result"]}`'
          )
    print()  # new line after each turn


# Let's inspect the "trajectory" of the first run. A trajectory is the full
# log of the conversation, including user messages, agent thoughts, tool calls,
# and tool outputs. Analyzing trajectories is key to understanding why an agent
# fails or succeeds.
print('\nDisplaying trajectory for Task 1:')
display_trajectory(inference_results[0].traj)

--- Trajectory Example ---
**SYSTEM**: you are a customer support agent helping customers resolve their issues by using the right tools

**USER**: Hello. I need to make an exchange regarding a recent order I received.

**MODEL**: I can help you with that. What is the order ID?

**USER**: The order ID is #W7273336.

**MODEL**: Okay, you want to exchange items from order #W7273336.

Could you please tell me:
1. What are the item IDs of the products you wish to exchange?
2. What are the item IDs of the new products you would like to receive?
3. What payment method would you like to use to cover any price differences or receive refunds (e.g., gift card, credit card)?

**USER**: I don't have the specific item IDs at hand, but I can describe the products. I received a black laser gaming mouse and a 4-foot metal bookshelf. I need to exchange both of these.

**MODEL**: I understand. Since you don't have the item IDs, I'll need to look up the order details to identify them.


**MODEL**: üìû To

# Evaluate the Initial Prompt: Getting a Baseline

Running a couple of examples gives us a qualitative feel, but to systematically
improve our prompt, we need quantitative metrics. Let's evaluate our basic
prompt on a small dataset to get a baseline performance score.

The primary metric in Tau-bench is **reward**, which is 1 if the agent
successfully completes the task according to the environment's goals (e.g.,
user issue resolved, correct tool calls made) and 0 otherwise. Our goal is to
maximize the average reward.

In [None]:
# For this demo, we'll use a small dataset. In a real-world scenario, you
# would use larger, distinct datasets for training, validation, and testing.
demo_dataset = experiment_lib.Dataset(split='dev', max_size=MAX_DATASET_SIZE)

# We configure the experiment parameters, including the models, dataset,
# evaluation settings, and GEPA budget.
demo_config = experiment_lib.ExperimentConfig(
    tau_bench_env='retail',
    agent_model=AGENT_MODEL_NAME,
    agent_model_provider='vertex_ai',
    user_model=USER_MODEL_NAME,
    user_model_provider='vertex_ai',
    max_concurrency=MAX_CONCURRENCY,
    num_eval_trials=NUM_EVAL_TRIALS,  # We run each task multiple times for consistency
    rnd_seed=42,
    max_metric_calls=MAX_METRIC_CALLS,  # GEPA budget: max prompt evaluations
    reflection_model=REFLECTION_MODEL_NAME,  # Model for GEPA's reflection step
    # Number of trajectories sampled from failed rollouts to be used by the
    # reflection model in each GEPA step to generate prompt improvements.
    reflection_minibatch_size=MINI_BATCH_SIZE,
    use_rater=False,  # Optional: LLM rater for nuanced feedback
    # For this demo, we use the same small dataset for all splits.
    # In a real optimization run, you would use separate datasets:
    # - feedback_dataset: For generating trajectories for reflection.
    # - pareto_dataset: For evaluating candidate prompts.
    # - eval_dataset: A final, held-out set to test the optimized prompt.
    feedback_dataset=demo_dataset,
    pareto_dataset=demo_dataset,
    eval_dataset=demo_dataset,
)

# We'll save the results of our runs in a temporary directory.
eval_output_dir = os.path.join(
    'eval_results', datetime.now().strftime('%Y%m%d%H%M%S%f')
)
os.makedirs(eval_output_dir)
logging.info('Writing to output_dir=%s', eval_output_dir)


# The `run_eval` function runs the agent with the given prompt on the evaluation
# dataset and prints the average reward.
print(f'--- Evaluating BASELINE prompt on {MAX_DATASET_SIZE} tasks ---')
eval_results = experiment_lib.run_eval(
    output_dir=eval_output_dir,
    config=demo_config,
    instructions=BASE_SYSTEM_INSTRUCTION,
)

# This will show the detailed results of the evaluation run.
# The most important number is the final "average reward".
print('\nBaseline evaluation results:')
print(eval_results)

Loading user with strategy: llm
Running tasks [9, 8, 4, 2, 5, 3, 1, 0, 7, 6] (checkpoint path: temp_results/20251104150054446083/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104150054.json)




üèÜ Average reward: 0.525
üìà Pass^k
  k=1: 0.525
  k=2: 0.31666666666666665
  k=3: 0.175
  k=4: 0.1

üìÑ Results saved to temp_results/20251104150054446083/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104150054.json

average reward (total=40): 0.525


# Run Prompt Optimization with GEPA

Now we'll use **GEPA** to automatically improve our prompt.

## What is GEPA?

**GEPA (Genetic-Pareto)** is a prompt optimization algorithm that learns from
trial and error, using LLM-based reflection to understand failures and guide
prompt evolution. Here's a simplified view of how it works:

1.  **Run & Collect:** It runs the agent with a candidate prompt on a
    few training examples (the `feedback_dataset`) to collect interaction
    trajectories.
2.  **Reflect:** It gives the trajectories to a "reflection" model,
    which analyzes what went wrong and generates high-level
    insights or "rules" for improvement. For example, it might notice *"The
    agent should always confirm the order number before issuing a refund."*
3.  **Evolve:** It uses these insights to propose new candidate prompts by
    editing existing prompts or combining ideas from different successful ones,
    inspired by genetic algorithms.
4.  **Evaluate & Select:** It evaluates these new prompts on a validation set
    (the `pareto_dataset`) and keeps only the best-performing, diverse set of
    prompts (the "Pareto frontier").
5.  **Repeat:** It repeats this loop‚Äîcollect, reflect, evolve, evaluate‚Äîuntil it
    reaches its budget (`max_metric_calls`).

The result is a detailed and robust prompt that has learned from its mistakes,
often capturing nuances that are difficult to discover through manual prompt
engineering.

In [None]:
# @title Run GEPA (this might take ~10 minutes)
# This process can take around 10 minutes for the demo settings, as it
# involves multiple rounds of running the agent and calling the reflection model.
# A real run with more metric calls will take longer.

# Create a new directory for the GEPA run artifacts.
gepa_output_dir = os.path.join(
    'gepa_results', datetime.now().strftime('%Y%m%d%H%M%S%f')
)
os.makedirs(gepa_output_dir)
logging.info('Writing to output_dir=%s', gepa_output_dir)

# The `run_gepa` function kicks off the optimization loop.
print(f'--- Running GEPA for {MAX_METRIC_CALLS} metric calls ---')
gepa_results = experiment_lib.run_gepa(
    output_dir=gepa_output_dir,
    config=demo_config,
    seed_instructions=BASE_SYSTEM_INSTRUCTION,
)

# The `val_aggregate_scores` attribute shows the performance of the best prompt
# found at each generation of the GEPA algorithm. You should see the score
# generally increasing over time as GEPA learns better prompts.
print('\n--- GEPA Performance Over Generations (Reward) ---')
print(list(enumerate(gepa_results.val_aggregate_scores)))

Loading user with strategy: llm
Running tasks [3, 5, 2, 4, 1, 8, 7, 0, 6, 9] (checkpoint path: temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104153507.json)




üèÜ Average reward: 0.7
üìà Pass^k
  k=1: 0.7

üìÑ Results saved to temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104153507.json

Iteration 0: Base program full valset score: 0.7
Iteration 1: Selected program 0 score: 0.7
Loading user with strategy: llm
Running tasks [0, 1, 3, 2] (checkpoint path: temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104153806.json)




üèÜ Average reward: 0.5
üìà Pass^k
  k=1: 0.5

üìÑ Results saved to temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104153806.json

Iteration 1: Proposed new text for system_instruction: You are a customer support agent whose primary goal is to resolve customer issues efficiently and empathetically by utilizing the provided tools. Maintain a polite, helpful, and professional tone at all times.

**Here's a breakdown of your responsibilities and guidelines:**

1.  **Initial Interaction & Information Gathering:**
    *   Always greet the customer warmly and acknowledge their issue.
    *   Prioritize obtaining the customer's order ID first.
    *   If the order ID is unavailable, attempt to find the user via `find_user_id_by_email`.
    *   If `find_user_id_by_email` returns an error, prompt the user for their first name, last name, and zip code to use `find_user_id_by_name_zip`.
    *   Once a `user_id` is successfully i



üèÜ Average reward: 0.25
üìà Pass^k
  k=1: 0.25

üìÑ Results saved to temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104153920.json

Iteration 1: New subsample score 1.0 is not better than old score 2.0, skipping
Iteration 2: Selected program 0 score: 0.7
Loading user with strategy: llm
Running tasks [6, 8, 4, 5] (checkpoint path: temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104154009.json)




üèÜ Average reward: 0.5
üìà Pass^k
  k=1: 0.5

üìÑ Results saved to temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104154009.json

Iteration 2: Proposed new text for system_instruction: you are a customer support agent helping customers resolve their issues by using the right tools.

Here's how you should operate:

1.  **Understand the User's Core Issue:** Carefully identify what the user is trying to achieve (e.g., cancel an order, return an item, change an address, troubleshoot a technical problem).

2.  **Information Gathering - Order & User Details:**
    *   Always try to obtain the `order_id` first, as many tools require it and it's the most direct way to identify an order. Remember order IDs start with `#W`.
    *   If the user doesn't know the `order_id`, ask for their email address to use `find_user_id_by_email`.
    *   If the user cannot provide an email or if `find_user_id_by_email` fails to find a user, t



üèÜ Average reward: 0.75
üìà Pass^k
  k=1: 0.75

üìÑ Results saved to temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104154113.json

Iteration 2: New subsample score 3.0 is better than old score 2.0. Continue to full eval and add to candidate pool.
Loading user with strategy: llm
Running tasks [3, 5, 2, 4, 1, 8, 7, 0, 6, 9] (checkpoint path: temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104154203.json)




üèÜ Average reward: 0.8
üìà Pass^k
  k=1: 0.8

üìÑ Results saved to temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104154203.json

Iteration 2: New program is on the linear pareto front
Iteration 2: Full valset score for new program: 0.8
Iteration 2: Full train_val score for new program: 0.8
Iteration 2: Individual valset scores for new program: [1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0]
Iteration 2: New valset pareto front scores: [1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0]
Iteration 2: Full valset pareto front score: 0.9
Iteration 2: Updated valset pareto front programs: [{0, 1}, {0, 1}, {1}, {0}, {0, 1}, {0, 1}, {0, 1}, {0, 1}, {0, 1}, {1}]
Iteration 2: Best valset aggregate score so far: 0.8
Iteration 2: Best program as per aggregate score on train_val: 1
Iteration 2: Best program as per aggregate score on valset: 1
Iteration 2: Best score on valset: 0.8
Iteration 2: Best score on train_val: 0.8
Ite



üèÜ Average reward: 0.5
üìà Pass^k
  k=1: 1.0

üìÑ Results saved to temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104154520.json

Iteration 3: Proposed new text for system_instruction: You are a customer support agent helping customers resolve their issues by using the right tools. Your primary goal is to efficiently resolve customer issues while providing clear and helpful communication.

**General Principles:**

1.  **Be Proactive in Information Gathering**:
    *   Always try to identify the customer's order by asking for the `order_id` first.
    *   If the `order_id` is unknown, attempt to find the `user_id` using their `email` with `find_user_id_by_email`.
    *   If the email is not available or the user cannot remember it, use `find_user_id_by_name_zip` with their `first_name`, `last_name`, and `zip` code.
    *   Once a `user_id` is obtained, use `get_user_details` to retrieve all associated `orders` and `pa



üèÜ Average reward: 0.75
üìà Pass^k
  k=1: 1.5

üìÑ Results saved to temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104154646.json

Iteration 3: New subsample score 3.0 is better than old score 2.0. Continue to full eval and add to candidate pool.
Loading user with strategy: llm
Running tasks [3, 5, 2, 4, 1, 8, 7, 0, 6, 9] (checkpoint path: temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104154739.json)




üèÜ Average reward: 0.6
üìà Pass^k
  k=1: 0.6

üìÑ Results saved to temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104154739.json

Iteration 3: Full valset score for new program: 0.6
Iteration 3: Full train_val score for new program: 0.6
Iteration 3: Individual valset scores for new program: [1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0]
Iteration 3: New valset pareto front scores: [1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0]
Iteration 3: Full valset pareto front score: 0.9
Iteration 3: Updated valset pareto front programs: [{0, 1, 2}, {0, 1, 2}, {1}, {0, 2}, {0, 1, 2}, {0, 1}, {0, 1}, {0, 1, 2}, {0, 1, 2}, {1, 2}]
Iteration 3: Best valset aggregate score so far: 0.8
Iteration 3: Best program as per aggregate score on train_val: 1
Iteration 3: Best program as per aggregate score on valset: 1
Iteration 3: Best score on valset: 0.8
Iteration 3: Best score on train_val: 0.8
Iteration 3: Linear pareto front prog



üèÜ Average reward: 1.0
üìà Pass^k
  k=1: 1.0

üìÑ Results saved to temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104154902.json

Iteration 4: All subsample scores perfect. Skipping.
Iteration 4: Reflective mutation did not propose a new candidate
Iteration 5: Selected program 1 score: 0.8
Loading user with strategy: llm
Running tasks [0, 7, 9, 1] (checkpoint path: temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104154939.json)




üèÜ Average reward: 0.75
üìà Pass^k
  k=1: 0.75

üìÑ Results saved to temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104154939.json

Iteration 5: Proposed new text for system_instruction: you are a customer support agent helping customers resolve their issues by using the right tools.

Here's how you should operate:

1.  **Understand the User's Core Issue:** Carefully identify what the user is trying to achieve (e.g., cancel an order, return an item, change an address, troubleshoot a technical problem).

2.  **Information Gathering - Order & User Details:**
    *   Always try to obtain the `order_id` first, as many tools require it and it's the most direct way to identify an order. Remember order IDs start with `#W`.
    *   If the user doesn't know the `order_id`, ask for their email address to use `find_user_id_by_email`.
    *   If the user cannot provide an email or if `find_user_id_by_email` fails to find a user,



üèÜ Average reward: 0.75
üìà Pass^k
  k=1: 0.75

üìÑ Results saved to temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104155047.json

Iteration 5: New subsample score 3.0 is not better than old score 3.0, skipping
Iteration 6: Selected program 0 score: 0.7
Loading user with strategy: llm
Running tasks [5, 2, 5, 4] (checkpoint path: temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104155134.json)




üèÜ Average reward: 0.25
üìà Pass^k
  k=1: 0.3333333333333333

üìÑ Results saved to temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104155134.json

Iteration 6: Proposed new text for system_instruction: You are a customer support agent. Your primary goal is to resolve customer issues efficiently and accurately by leveraging the provided tools.

**General Guidelines:**

1.  **Prioritize Information Gathering:**
    *   Always begin by requesting the **order ID**.
    *   If the order ID is unavailable, ask for the **email address** associated with the customer's account.
    *   If the email is also unavailable or forgotten, then request their **first name, last name, and zip code**.
    *   Once a user ID is found (using `find_user_id_by_email` or `find_user_id_by_name_zip`), use `get_user_details` to retrieve all associated orders for that user.
    *   For each potential order, use `get_order_details` to inspect its 



üèÜ Average reward: 0.5
üìà Pass^k
  k=1: 0.6666666666666666

üìÑ Results saved to temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104155249.json

Iteration 6: New subsample score 2.0 is better than old score 1.0. Continue to full eval and add to candidate pool.
Loading user with strategy: llm
Running tasks [3, 5, 2, 4, 1, 8, 7, 0, 6, 9] (checkpoint path: temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104155321.json)




üèÜ Average reward: 0.8
üìà Pass^k
  k=1: 0.8

üìÑ Results saved to temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104155321.json

Iteration 6: Full valset score for new program: 0.8
Iteration 6: Full train_val score for new program: 0.8
Iteration 6: Individual valset scores for new program: [1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0]
Iteration 6: New valset pareto front scores: [1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0]
Iteration 6: Full valset pareto front score: 0.9
Iteration 6: Updated valset pareto front programs: [{0, 1, 2, 3}, {0, 1, 2, 3}, {1}, {0, 2, 3}, {0, 1, 2, 3}, {0, 1, 3}, {0, 1, 3}, {0, 1, 2, 3}, {0, 1, 2, 3}, {1, 2, 3}]
Iteration 6: Best valset aggregate score so far: 0.8
Iteration 6: Best program as per aggregate score on train_val: 1
Iteration 6: Best program as per aggregate score on valset: 1
Iteration 6: Best score on valset: 0.8
Iteration 6: Best score on train_val: 0.8
Iteration 



Running tasks [7, 1, 5, 0] (checkpoint path: temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104155438.json)




üèÜ Average reward: 0.75
üìà Pass^k
  k=1: 0.75

üìÑ Results saved to temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104155438.json

Iteration 7: Proposed new text for system_instruction: you are a customer support agent helping customers resolve their issues by using the right tools.

Here's how you should operate:

1.  **Understand the User's Core Issue:** Carefully identify what the user is trying to achieve (e.g., cancel an order, return an item, change an address, troubleshoot a technical problem).

2.  **Information Gathering - Order & User Details:**
    *   Always try to obtain the `order_id` first, as many tools require it and it's the most direct way to identify an order. Remember order IDs start with `#W`.
    *   If the user doesn't know the `order_id`, ask for their email address to use `find_user_id_by_email`.
    *   If `find_user_id_by_email` fails to find a user, or if the user cannot provide an email



üèÜ Average reward: 0.5
üìà Pass^k
  k=1: 0.5

üìÑ Results saved to temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104155551.json

Iteration 7: New subsample score 2.0 is not better than old score 3.0, skipping
Iteration 8: Selected program 3 score: 0.8
Loading user with strategy: llm
Running tasks [9, 8, 2, 3] (checkpoint path: temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104155634.json)




üèÜ Average reward: 0.25
üìà Pass^k
  k=1: 0.25

üìÑ Results saved to temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104155634.json

Iteration 8: Proposed new text for system_instruction: You are a customer support agent. Your primary goal is to resolve customer issues efficiently and accurately by leveraging the provided tools.

**General Guidelines for Interaction and Information Gathering:**

1.  **Prioritize Information Gathering to Identify the User and Order:**
    *   Always begin by requesting the **order ID**.
    *   If the order ID is unavailable, ask for the **email address** associated with the customer's account.
    *   If the email is also unavailable or forgotten, then request their **first name, last name, and zip code**.
    *   Once a user ID is found (using `find_user_id_by_email` or `find_user_id_by_name_zip`), use `get_user_details` to retrieve all associated orders for that user.
    *   For ea



Running tasks [9, 8, 2, 3] (checkpoint path: temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104155758.json)




üèÜ Average reward: 0.5
üìà Pass^k
  k=1: 0.5

üìÑ Results saved to temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104155758.json

Iteration 8: New subsample score 2.0 is better than old score 1.0. Continue to full eval and add to candidate pool.
Loading user with strategy: llm
Running tasks [3, 5, 2, 4, 1, 8, 7, 0, 6, 9] (checkpoint path: temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104155842.json)




üèÜ Average reward: 0.7
üìà Pass^k
  k=1: 0.7

üìÑ Results saved to temp_results/20251104153507410436/traces/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104155842.json

Iteration 8: Full valset score for new program: 0.7
Iteration 8: Full train_val score for new program: 0.7
Iteration 8: Individual valset scores for new program: [1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0]
Iteration 8: New valset pareto front scores: [1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0]
Iteration 8: Full valset pareto front score: 0.9
Iteration 8: Updated valset pareto front programs: [{0, 1, 2, 3, 4}, {0, 1, 2, 3, 4}, {1}, {0, 2, 3, 4}, {0, 1, 2, 3, 4}, {0, 1, 3, 4}, {0, 1, 3, 4}, {0, 1, 2, 3, 4}, {0, 1, 2, 3, 4}, {1, 2, 3}]
Iteration 8: Best valset aggregate score so far: 0.8
Iteration 8: Best program as per aggregate score on train_val: 1
Iteration 8: Best program as per aggregate score on valset: 1
Iteration 8: Best score on valset: 0.8
Iteration 8: Best score on t

In [None]:
# @title Visualize the optimized prompt
# Now, let's look at the final, optimized prompt that GEPA produced.
# It should be much more detailed than our initial one-line prompt!
print('\n--- Optimized Prompt from GEPA ---')
print(gepa_results.best_candidate['system_instruction'])

you are a customer support agent helping customers resolve their issues by using the right tools.

Here's how you should operate:

1.  **Understand the User's Core Issue:** Carefully identify what the user is trying to achieve (e.g., cancel an order, return an item, change an address, troubleshoot a technical problem).

2.  **Information Gathering - Order & User Details:**
    *   Always try to obtain the `order_id` first, as many tools require it and it's the most direct way to identify an order. Remember order IDs start with `#W`.
    *   If the user doesn't know the `order_id`, ask for their email address to use `find_user_id_by_email`.
    *   If the user cannot provide an email or if `find_user_id_by_email` fails to find a user, then ask for their first name, last name, and zip code to use `find_user_id_by_name_zip`.
    *   Once a `user_id` is obtained, use `get_user_details` to retrieve all associated `order_id`s, `payment_method`s, and addresses.
    *   For each relevant `orde

# Evaluate the optimized Prompt

GEPA has given us a new, improved prompt. But how much better is it?

To find out, we'll run the exact same evaluation we did initially, but this
time using the `best_candidate` prompt from GEPA. We can then directly compare
the average reward of the baseline prompt with the optimized one. This final
evaluation on a held-out test set (`eval_dataset`) is the true measure of our
success. In this demo we are reusing the same dataset for simplicity, but in a
real scenario, `eval_dataset` should be unseen during optimization.

In [None]:
# @title Run evaluation

# Let's create a new directory for this final evaluation run.
final_eval_dir = os.path.join(
    'temp_results', 'final_eval', datetime.now().strftime('%Y%m%d%H%M%S%f')
)
os.makedirs(final_eval_dir)

print(f'\n--- Evaluating OPTIMIZED prompt on {MAX_DATASET_SIZE} tasks ---')
final_eval_results = experiment_lib.run_eval(
    output_dir=final_eval_dir,
    instructions=gepa_results.best_candidate['system_instruction'],
    config=demo_config,
)

print('\nOptimized prompt evaluation results:')
print(final_eval_results)

Loading user with strategy: llm
Running tasks [5, 2, 8, 3, 1, 9, 4, 7, 6, 0] (checkpoint path: temp_results/20251104153507410436/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104160221.json)




üèÜ Average reward: 0.75
üìà Pass^k
  k=1: 0.75
  k=2: 0.6
  k=3: 0.525
  k=4: 0.5

üìÑ Results saved to temp_results/20251104153507410436/tool-calling-gemini-2.5-flash-0.0_range_0--1_user-gemini-2.5-flash-llm_1104160221.json

average reward (total=40): 0.75


## Conclusion

You should see an improvement in the average reward compared to the
baseline evaluation. This demonstrates the power of using automated
prompt optimization techniques like GEPA to improve agent reliability without manual tuning.