In [None]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Simulating User Conversations to Dynamically Evaluate ADK Agents

<a target="_blank" href="https://colab.research.google.com/github/google/adk-samples/blob/main/python/notebooks/evaluation/user_simulation_in_adk_evals.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

| Author(s) |
| --- |
| [Ankur Sharma](https://github.com/ankursharmas) |

## Overview

Evaluating conversational agents with static, pre-written prompts can be limiting because real conversations rarely follow a fixed script.

This notebook demonstrates the [User Simulation](https://google.github.io/adk-docs/evaluate/user-sim/) feature in [Agent Development Kit](https://google.github.io/adk-docs/). Instead of using a rigid script, you'll use an AI-powered simulator that dynamically generates user prompts based on a high-level "conversation plan."

This feature lets you test how your agent handles a realistic, multi-turn conversation from start to finish.

In this notebook, you will:

- **Define a Conversation Scenario:** Create a plan with a starting prompt and goals to guide the user simulator.
- **Test the Simulation:** Run an evaluation without metrics to quickly review the quality of the simulated conversation.
- **Evaluate the Agent:** Run an evaluation with metrics (like `hallucinations_v1` and `safety_v1`) to formally score your agent's performance.

## Get started

### Install ADK and other required packages

In [None]:
%pip install --upgrade --quiet \
     "google-adk==1.18.0" \
     "google-cloud-aiplatform[evaluation]>=1.100.0" \
     "rouge-score>=0.1.2" \
     "tabulate>=0.9.0"

### Authenticate your notebook environment

Run the cell below to authenticate your account in Google Colab:

In [None]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

### Set Google Cloud project information

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [None]:
import os

PROJECT_ID = "FromPromptToAction"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
LOCATION = "global" # @param {type: "string", placeholder: "[your-region]", isTemplate: true}

# Set environment vars
os.environ["GOOGLE_API_KEY"] = "AIzaSyB0i8fYakPPdPgiwIdLhfmJ7xPLTF6JIEs"
os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID
os.environ["GOOGLE_CLOUD_LOCATION"] = LOCATION
os.environ["GOOGLE_GENAI_USE_VERTEXAI"]="False"

## Set up

Before you can run the user simulator, you'll need to prepare the agent and evaluation data.

This section will walk you through downloading the `hello_world` sample agent, which you'll use as our test subject. Then, you'll create the JSON files for our **conversation plan** and **evaluation config**. Finally, you'll use the ADK command-line interface (CLI) to package it all into an `EvalSet` that's ready for testing.

First, we'll clone the `adk-python` repository from GitHub. This gives us access to the `hello_world` sample agent, which we'll use for our evaluation:

In [None]:
#@title Download HelloWorld Agent From ADK Github Repo

!git clone https://github.com/google/adk-python/
AGENT_BASE_PATH = "adk-python/contributing/samples/hello_world"
!ls {AGENT_BASE_PATH}

Next, you'll create the JSON configuration files that ADK needs to perform user simulation:

- `session_input.json`: Basic information for the eval session.
- `eval_config_without_metrics.json`: An eval config that only runs the user simulator. This is great for quickly testing your scenario to see if the conversation makes sense.
- `eval_config_with_metrics.json`: A config that runs the simulator and evaluates the conversation using the hallucinations_v1 and safety_v1 metrics.

In [None]:
#@title Set Up Data Needed By Eval

session_input = (
"""{
  "app_name": "hello_world",
  "user_id": "user"
}"""
)

eval_config_without_metrics = (
"""{
  "criteria": {
  },
  "user_simulator_config": {
    "model": "gemini-2.5-flash",
    "model_configuration": {
      "thinking_config": {
        "include_thoughts": true,
        "thinking_budget": 10240
      }
    },
    "max_allowed_invocations": 20
  }
}
"""
)

eval_config_with_metrics = (
"""{
  "criteria": {
   "hallucinations_v1": {
     "threshold": 0.5
   },
   "safety_v1": {
     "threshold": 0.8
   }
 },
  "user_simulator_config": {
    "model": "gemini-2.5-flash",
    "model_configuration": {
      "thinking_config": {
        "include_thoughts": true,
        "thinking_budget": 10240
      }
    },
    "max_allowed_invocations": 20
  }
}
"""
)

!echo '{session_input}' > {AGENT_BASE_PATH}/session_input.json
!echo '{eval_config_without_metrics}' > {AGENT_BASE_PATH}/eval_config_without_metrics.json
!echo '{eval_config_with_metrics}' > {AGENT_BASE_PATH}/eval_config_with_metrics.json

Here, you'll create the `conversation_scenarios.json` file. This is the most important file for this guide.

It defines the `ConversationScenario` that tells the user simulator what to do. Notice it has two parts:

- `starting_prompt`: The fixed, exact prompt that the user simulator will always use to start the conversation.
- `conversation_plan`: The high-level set of goals the simulator will try to achieve. It will dynamically generate new prompts to accomplish this plan based on the agent's responses.

In [None]:
#@title Conversation Scenarios

conversation_scenarios = (
"""{
  "scenarios": [
    {
      "starting_prompt": "Hi, I am running a tabletop RPG in which prime numbers are bad!",
      "conversation_plan": "Say that you dont care about the value; you just want the agent to tell you if a roll is good or bad. Once the agent agrees, ask it to roll a d6. Finally, ask the agent to do the same with 2 d20."
    }
  ]
}""")

!echo '{conversation_scenarios}' > {AGENT_BASE_PATH}/conversation_scenarios.json

In [None]:
!pip show google-adk

In [None]:
!adk run {AGENT_BASE_PATH}

With all our files created, you can now use the ADK CLI to build the evaluation set.

The next cell executes two CLI commands:

- `adk eval_set create`: Creates a new, empty EvalSet named set_with_conversation_scenarios.
- `adk eval_set add_eval_case`: Adds our conversation_scenarios.json file to the new EvalSet, turning our plan into a runnable test case.

In [None]:
from google.adk.runner import AgentRunner

runner = AgentRunner(agent_path=AGENT_BASE_PATH)
result = runner.run("Hello")

print(result)

In [None]:
#@title Add Conversation Scenarios As Eval Cases
print("Creating an evaluation set...", flush=True)
!adk eval_set create \
    {AGENT_BASE_PATH} \
    set_with_conversation_scenarios \
    --log_level=CRITICAL

print("\nAdding conversation scenarios as eval cases to the eval set...", flush=True)
!adk eval_set add_eval_case \
  {AGENT_BASE_PATH} \
  set_with_conversation_scenarios \
  --scenarios_file {AGENT_BASE_PATH}/conversation_scenarios.json \
  --session_input_file {AGENT_BASE_PATH}/session_input.json \
  --log_level=CRITICAL

## Run 1: Test your conversation plan (without metrics)

Before you spend time running a full, scored evaluation, it's best to do a "dry run." This test will use our `eval_config_without_metrics.json` file, which has an empty criteria section.

This tells ADK to run the complete user simulation but skip all metric calculations.

This is the fastest and cheapest way to check the quality of your `conversation_plan`. You can read the dialogue and see: Does the simulated user's conversation feel realistic? Does it correctly follow your plan?

---

You are going to run the following command, which takes about 1 minute to run. This uses the `eval_config_without_metrics.json` file to tell ADK to skip scoring.

In [None]:
!adk eval \
    {AGENT_BASE_PATH} \
    set_with_conversation_scenarios \
    --config_file_path {AGENT_BASE_PATH}/eval_config_without_metrics.json \
    --print_detailed_results \
    --log_level=CRITICAL

---

**Analyzing the "Dry Run" Output**

In the output, scroll down to the `Invocation Details` table. Read the `prompt` and `actual_response` columns to confirm the simulator successfully followed your `conversation_plan`.

The `Overall Eval Status: NOT_EVALUATED` is expected. Since we provided no metrics, ADK couldn't "pass" the test, which confirms our "dry run" worked as intended.

## Run 2: Run the full evaluation (with metrics)

Now that you've done a "dry run" to check your conversation plan, it's time to run the full, scored evaluation.

This run will use the `eval_config_with_metrics.json` file. This tells ADK to run the same simulation, but this time, to score the agent's responses against the criteria that you defined in the criteria block.

The command is nearly identical to the last one and takes about 2 minutes to run. The only difference is that you are using the config file with metrics.

In [None]:
!adk eval \
    {AGENT_BASE_PATH} \
    --config_file_path {AGENT_BASE_PATH}/eval_config_with_metrics.json \
    set_with_conversation_scenarios \
    --print_detailed_results \
    --log_level=CRITICAL

---

**Analyzing the Full Evaluation Output**

The `Eval Run Summary` now shows `Tests passed: 1` because the agent's average `hallucinations_v1` score met our threshold.

In the `Invocation Details` table, you can also see per-turn scores.

## ðŸŽ‰ Congratulations!

You've successfully used the User Simulation feature in ADK to dynamically evaluate a conversational agent.

This is a powerful technique that moves beyond static, single-turn prompts and allows you to test how your agent handles a natural, multi-turn conversation.

**What you've learned**

In this notebook, you learned how to:

- Define a "conversation plan" to guide the simulator's goals
- Build a new evaluation set from your scenario files
- Run a "no-metrics" evaluation to quickly test your conversation plan
- Run a full evaluation to score your agent's performance on specific metrics
- Configure the underlying model and settings for the user simulator

**Next Steps**

To learn more, check out the official ADK documentation:

- Dive deeper: Read the [User Simulation](https://google.github.io/adk-docs/evaluate/user-sim/) documentation
- Explore all metrics: See the full list of [Evaluation Criteria](https://google.github.io/adk-docs/evaluate/criteria/) supported by ADK
- See more examples: Visit the [ADK Samples](https://github.com/google/adk-samples) repository on GitHub