In [None]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Evaluation Criteria In ADK

<a target="_blank" href="https://colab.research.google.com/github/google/adk-samples/blob/main/python/notebooks/evaluation/evaluation_criteria_in_adk.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

| Author(s) |
| --- |
| [Ankur Sharma](https://github.com/ankursharmas) |

## Overview

Agent Development Kit [(ADK)](https://google.github.io/adk-docs/) is a flexible and modular framework that applies software development principles to AI agent creation. It is designed to simplify building, deploying, and orchestrating agent workflows, from simple tasks to complex systems.

Unlike traditional software with clear pass/fail unit tests, LLM agents, due to their probabilistic nature, require a more nuanced evaluation/testing approach. ADK helps you manage this probablistic behavior and enables you to [evaluate/test](https://google.github.io/adk-docs/evaluate/) Agents not only for their final output, but also the path they took to achieve that.  

This Colab notebook demonstrates how you can evaluate/test your Agents using various criteria using CLI. It covers following criteria:

  *   `tool_trajectory_avg_score`: Compares agent's tool calls with expected trajectory.
  *   `response_match_score`: Evaluates final response similarity using Rouge-1.
  *   `final_response_match_v2`: Assesses semantic equivalence of responses using an LLM judge.
  *   `rubric_based_final_response_quality_v1`: Evaluates response quality against user-defined rubrics with an LLM judge.
  *   `rubric_based_tool_use_quality_v1`: Assesses tool usage quality against user-defined rubrics with an LLM judge.
  *   `hallucinations_v1`: Checks for false, contradictory, or unsupported claims in agent responses.
  *   `safety_v1`: Evaluates the harmlessness of agent responses using Vertex AI General AI Eval SDK.

By the end of this notebook, you will understand how to set up and run comprehensive evaluations for your ADK Agents.

## Get started

### Install ADK and other required packages

In [None]:
%pip install --upgrade --quiet \
     "google-adk==1.18.0" \
     "google-cloud-aiplatform[evaluation]>=1.100.0" \
     "rouge-score>=0.1.2" \
     "tabulate>=0.9.0"

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/46.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.3/41.3 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m28.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.1/8.1 MB[0m [31m31.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.0/290.0 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m59.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.9/119.9 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00

### Authenticate your notebook environment

Run the cell below to authenticate your account in Google Colab:

In [None]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

### Set Google Cloud project information

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [None]:
import os

PROJECT_ID = ""  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
LOCATION = "global"  # @param {type: "string", placeholder: "[your-region]", isTemplate: true}

# Set environment vars
os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID
os.environ["GOOGLE_CLOUD_LOCATION"] = LOCATION
os.environ["GOOGLE_GENAI_USE_VERTEXAI"] = "1"

## Set up

Before you can try an evaluation criterion, you'll need to prepare the agent and evaluation data.

This section will walk you through downloading the `hello_world` sample agent, which you'll use as our test subject. Then, you'll create the an eval dataset.

First, we'll clone the `adk-python` repository from GitHub. This gives us access to the `hello_world` sample agent, which we'll use for our evaluation:

In [None]:
# @title Download HelloWorld Agent From ADK Github Repo
AGENT_BASE_PATH = "adk-python/contributing/samples/hello_world"

!git clone https://github.com/google/adk-python/
!ls {AGENT_BASE_PATH}

Cloning into 'adk-python'...
remote: Enumerating objects: 17279, done.[K
remote: Counting objects: 100% (2954/2954), done.[K
remote: Compressing objects: 100% (822/822), done.[K
remote: Total 17279 (delta 2458), reused 2149 (delta 2122), pack-reused 14325 (from 4)[K
Receiving objects: 100% (17279/17279), 31.20 MiB | 33.38 MiB/s, done.
Resolving deltas: 100% (10331/10331), done.
agent.py  __init__.py  main.py


In [None]:
# @title Set Up Eval Data Needed By Eval
serialized_eval_set = """
{
  "eval_set_id": "sample_eval_set_01",
  "name": "sample_eval_set_01",
  "eval_cases": [
    {
      "eval_id": "roll_dice_9_and_check_prime_10_19",
      "conversation": [
        {
          "invocation_id": "e-df832358-8669-4153-acb6-55fef0f139d2",
          "user_content": {
            "parts": [
              {
                "text": "What can you do?"
              }
            ],
            "role": "user"
          },
          "final_response": {
            "parts": [
              {
                "text": "I can roll a die of a specified number of sides and check if a list of numbers are prime."
              }
            ],
            "role": "model"
          },
          "intermediate_data": {},
          "creation_timestamp": 1758846836.067581
        },
        {
          "invocation_id": "e-377f3392-0587-4741-9474-439eafd45592",
          "user_content": {
            "parts": [
              {
                "text": "Roll a 9 sided dice"
              }
            ],
            "role": "user"
          },
          "final_response": {
            "parts": [
              {
                "text": "I rolled a 9 sided die and got a 6."
              }
            ],
            "role": "model"
          },
          "intermediate_data": {
            "invocation_events": [
              {
                "author": "hello_world_agent",
                "content": {
                  "parts": [
                    {
                      "function_call": {
                        "id": "adk-85ed5aa0-baf0-43f6-b55d-85b518120645",
                        "args": {
                          "sides": 9
                        },
                        "name": "roll_die"
                      }
                    }
                  ],
                  "role": "model"
                }
              },
              {
                "author": "hello_world_agent",
                "content": {
                  "parts": [
                    {
                      "function_response": {
                        "id": "adk-85ed5aa0-baf0-43f6-b55d-85b518120645",
                        "name": "roll_die",
                        "response": {
                          "result": 6
                        }
                      }
                    }
                  ],
                  "role": "user"
                }
              }
            ]
          },
          "creation_timestamp": 1758846843.514974
        },
        {
          "invocation_id": "e-599ddefd-1588-4cca-82a1-8e6461acaf52",
          "user_content": {
            "parts": [
              {
                "text": "Are 10 and 19 prime numbers?"
              }
            ],
            "role": "user"
          },
          "final_response": {
            "parts": [
              {
                "text": "19 is a prime number, while 10 is not."
              }
            ],
            "role": "model"
          },
          "intermediate_data": {
            "invocation_events": [
              {
                "author": "hello_world_agent",
                "content": {
                  "parts": [
                    {
                      "function_call": {
                        "id": "adk-ae456e0f-4b02-4a44-981e-68528ae8fc2f",
                        "args": {
                          "nums": [
                            10,
                            19
                          ]
                        },
                        "name": "check_prime"
                      }
                    }
                  ],
                  "role": "model"
                }
              },
              {
                "author": "hello_world_agent",
                "content": {
                  "parts": [
                    {
                      "function_response": {
                        "id": "adk-ae456e0f-4b02-4a44-981e-68528ae8fc2f",
                        "name": "check_prime",
                        "response": {
                          "result": "19 are prime numbers."
                        }
                      }
                    }
                  ],
                  "role": "user"
                }
              }
            ]
          },
          "creation_timestamp": 1758846851.372041
        }
      ],
      "session_input": {
        "app_name": "hello_world",
        "user_id": "user"
      },
      "creation_timestamp": 1758846897.1953406
    }
  ],
  "creation_timestamp": 1758846869.1735425
}
"""

!echo '{serialized_eval_set}' > {AGENT_BASE_PATH}/sample_eval_set_01.evalset.json

## Evaluation Criteria


### Criterion - `tool_trajectory_avg_score`

This criterion compares the tool call trajectory produced by the agent with an
expected trajectory and computes an average score based on exact match.

#### Details

For each invocation that is being evaluated, this criterion compares the list of
tool calls produced by the agent against the list of expected tool calls. The
comparison is done by performing an exact match on the tool name and tool
arguments for each tool call in the list. If all tool calls in an invocation
match exactly in content and order, a score of 1.0 is awarded for that
invocation, otherwise the score is 0.0. The final value is the average of these
scores across all invocations in the eval case.

#### Output And How To Interpret

The output is a score between 0.0 and 1.0, where 1.0 indicates a perfect match
between actual and expected tool trajectories for all invocations, and 0.0
indicates a complete mismatch for all invocations. Higher scores are better. A
score below 1.0 means that for at least one invocation, the agent's tool call
trajectory deviated from the expected one.

More details can be found [here](https://google.github.io/adk-docs/evaluate/criteria/#tool_trajectory_avg_score).

Takes about ~30 seconds to run the cell below.

In [None]:
eval_config = """
{
  "criteria": {
    "tool_trajectory_avg_score": 1.0
  }
}
"""

!echo '{eval_config}' > {AGENT_BASE_PATH}/eval_config.json
!adk eval \
    {AGENT_BASE_PATH} \
    --config_file_path {AGENT_BASE_PATH}/eval_config.json \
    sample_eval_set_01 \
    --print_detailed_results \
    --log_level=CRITICAL

  metric_evaluator_registry = MetricEvaluatorRegistry()
  user_simulator_provider: UserSimulatorProvider = UserSimulatorProvider(),
Using evaluation criteria: criteria={'tool_trajectory_avg_score': 1.0} user_simulator_config=None
  user_simulator_provider = UserSimulatorProvider(
  eval_service = LocalEvalService(
  return StaticUserSimulator(static_conversation=eval_case.conversation)
  super().__init__(
*********************************************************************
Eval Run Summary
sample_eval_set_01:
  Tests passed: 1
  Tests failed: 0
********************************************************************
Eval Set Id: sample_eval_set_01
Eval Id: roll_dice_9_and_check_prime_10_19
Overall Eval Status: PASSED
---------------------------------------------------------------------
Metric: tool_trajectory_avg_score, Status: PASSED, Score: 1.0, Threshold: 1.0
---------------------------------------------------------------------
Invocation Details:
+----+---------------------+----------

---

**Analyzing the Full Evaluation Output**

The evaluation results for `sample_eval_set_01` and `eval_id: roll_dice_9_and_check_prime_10_19` indicate an **Overall Eval Status: PASSED** for the `tool_trajectory_avg_score` metric.

**Metric Details:**
*   **Metric:** `tool_trajectory_avg_score`
*   **Status:** PASSED
*   **Score:** 1.0
*   **Threshold:** 1.0

This means the agent achieved a perfect score of 1.0, meeting or exceeding the set threshold. The detailed invocation results show that for each turn in the conversation (e.g., "What can you do?", "Roll a 9 sided dice", "Are 10 and 19 prime numbers?"), the agent's actual tool calls and responses perfectly matched the expected ones, resulting in a score of 1.0 for each invocation. This demonstrates that the agent's tool usage trajectory aligns precisely with the golden standard for this evaluation set.

### Criterion - `response_match_score`

This criterion evaluates if agent's final response matches a golden/expected
final response using Rouge-1.

#### Details

To learn more, see details on
[ROUGE-1](https://github.com/google-research/google-research/tree/master/rouge).

#### Output And How To Interpret

Value range for this criterion is [0,1], with values closer to 1 more desirable.

More details can be found [here](https://google.github.io/adk-docs/evaluate/criteria/#response_match_score).

Takes about ~30 seconds to run the cell below.

In [None]:
eval_config = """
{
  "criteria": {
    "response_match_score": 0.8
  }
}
"""

!echo '{eval_config}' > {AGENT_BASE_PATH}/eval_config.json
!adk eval \
    {AGENT_BASE_PATH} \
    --config_file_path {AGENT_BASE_PATH}/eval_config.json \
    sample_eval_set_01 \
    --print_detailed_results \
    --log_level=CRITICAL

  metric_evaluator_registry = MetricEvaluatorRegistry()
  user_simulator_provider: UserSimulatorProvider = UserSimulatorProvider(),
Using evaluation criteria: criteria={'response_match_score': 0.8} user_simulator_config=None
  user_simulator_provider = UserSimulatorProvider(
  eval_service = LocalEvalService(
  return StaticUserSimulator(static_conversation=eval_case.conversation)
  super().__init__(
*********************************************************************
Eval Run Summary
sample_eval_set_01:
  Tests passed: 0
  Tests failed: 1
********************************************************************
Eval Set Id: sample_eval_set_01
Eval Id: roll_dice_9_and_check_prime_10_19
Overall Eval Status: FAILED
---------------------------------------------------------------------
Metric: response_match_score, Status: FAILED, Score: 0.7883597883597884, Threshold: 0.8
---------------------------------------------------------------------
Invocation Details:
+----+---------------------+-----

---

**Analyzing the Full Evaluation Output**

The evaluation results for `sample_eval_set_01` and `eval_id: roll_dice_9_and_check_prime_10_19` indicate an **Overall Eval Status: FAILED** for the `response_match_score` metric.

**Metric Details:**
*   **Metric:** `response_match_score`
*   **Status:** FAILED
*   **Score:** 0.7883597883597884
*   **Threshold:** 0.8

This means the agent's overall score (0.788) fell below the set threshold of 0.8, leading to a FAILED status. The `response_match_score` uses ROUGE-1 to compare the agent's final response with the expected golden response. While some invocations (like 'Roll a 9 sided dice') achieved a perfect score, the first invocation ('What can you do?') scored significantly lower (0.476), pulling down the overall average and causing the evaluation to fail. This indicates that the agent's responses, particularly for the initial prompt, did not match the expected responses closely enough based on lexical overlap.

### Criterion - `final_response_match_v2`

This criterion evaluates if the agent's final response matches a golden/expected
final response using LLM as a judge.

#### Details

This criterion uses a Large Language Model (LLM) as a judge to determine if the
agent's final response is semantically equivalent to the provided reference
response. It is designed to be more flexible than lexical matching metrics (like
`response_match_score`), as it focuses on whether the agent's response contains
the correct information, while tolerating differences in formatting, phrasing,
or the inclusion of additional correct details.

For each invocation, the criterion prompts a judge LLM to rate the agent's
response as "valid" or "invalid" compared to the reference. This is repeated
multiple times for robustness (configurable via `num_samples`), and a majority
vote determines if the invocation receives a score of 1.0 (valid) or 0.0
(invalid). The final criterion score is the fraction of invocations deemed valid
across the entire eval case.

#### Output And How To Interpret

The criterion returns a score between 0.0 and 1.0. A score of 1.0 means the LLM
judge considered the agent's final response to be valid for all invocations,
while a score closer to 0.0 indicates that many responses were judged as invalid
when compared to the reference responses. Higher values are better.

More details can be found [here](https://google.github.io/adk-docs/evaluate/criteria/#final_response_match_v2).

Takes about ~1 minute to run the cell below.

In [None]:
eval_config = """
{
  "criteria": {
    "final_response_match_v2": {
      "threshold": 0.8,
      "judge_model_options": {
            "judge_model": "gemini-2.5-flash",
            "num_samples": 5
      }
    }
  }
}
"""

!echo '{eval_config}' > {AGENT_BASE_PATH}/eval_config.json
!adk eval \
    {AGENT_BASE_PATH} \
    --config_file_path {AGENT_BASE_PATH}/eval_config.json \
    sample_eval_set_01 \
    --print_detailed_results \
    --log_level=CRITICAL

  metric_evaluator_registry = MetricEvaluatorRegistry()
  user_simulator_provider: UserSimulatorProvider = UserSimulatorProvider(),
Using evaluation criteria: criteria={'final_response_match_v2': BaseCriterion(threshold=0.8, judge_model_options={'judge_model': 'gemini-2.5-flash', 'num_samples': 5})} user_simulator_config=None
  user_simulator_provider = UserSimulatorProvider(
  eval_service = LocalEvalService(
  return StaticUserSimulator(static_conversation=eval_case.conversation)
  super().__init__(
  return self._registry[eval_metric.metric_name][0](eval_metric=eval_metric)
  super().__init__(
*********************************************************************
Eval Run Summary
sample_eval_set_01:
  Tests passed: 1
  Tests failed: 0
********************************************************************
Eval Set Id: sample_eval_set_01
Eval Id: roll_dice_9_and_check_prime_10_19
Overall Eval Status: PASSED
---------------------------------------------------------------------
Metric: fin

---

**Analyzing the Full Evaluation Output**

The evaluation results for `sample_eval_set_01` and `eval_id: roll_dice_9_and_check_prime_10_19` indicate an **Overall Eval Status: PASSED** for the `final_response_match_v2` metric.

**Metric Details:**
*   **Metric:** `final_response_match_v2`
*   **Status:** PASSED
*   **Score:** 1.0
*   **Threshold:** 0.8

This means the agent achieved a perfect score of 1.0, meeting or exceeding the set threshold of 0.8. The `final_response_match_v2` criterion uses an LLM as a judge to assess if the agent's final response is semantically equivalent to the expected reference response. The detailed invocation results show that for all turns in the conversation, the LLM judge considered the agent's actual responses to be valid compared to the expected ones, resulting in a score of 1.0 for each invocation. This demonstrates that the agent's responses convey the correct information, even if there are slight variations in phrasing or additional correct details, as evaluated by the LLM judge.

### Criterion - `rubric_based_final_response_quality_v1`

This criterion assesses the quality of an agent's final response against a
user-defined set of rubrics using LLM as a judge.

#### Details

This criterion provides a flexible way to evaluate response quality based on
specific criteria that you define as rubrics. For example, you could define
rubrics to check if a response is concise, if it correctly infers user intent,
or if it avoids jargon.

The criterion uses an LLM-as-a-judge to evaluate the agent's final response
against each rubric, producing a `yes` (1.0) or `no` (0.0) verdict for each.
Like other LLM-based metrics, it samples the judge model multiple times per
invocation and uses a majority vote to determine the score for each rubric in
that invocation. The overall score for an invocation is the average of its
rubric scores. The final criterion score for the eval case is the average of
these overall scores across all invocations.

#### Output And How To Interpret

The criterion outputs an overall score between 0.0 and 1.0, where 1.0 indicates
that the agent's responses satisfied all rubrics across all invocations, and 0.0
indicates that no rubrics were satisfied. The results also include detailed
per-rubric scores for each invocation. Higher values are better.

More details can be found [here](https://google.github.io/adk-docs/evaluate/criteria/#rubric_based_final_response_quality_v1).

Takes about 1-2 minutes to run the cell below.

In [None]:
eval_config = """
{
  "criteria": {
    "rubric_based_final_response_quality_v1": {
      "threshold": 0.8,
      "judge_model_options": {
        "judge_model": "gemini-2.5-flash",
        "num_samples": 5
      },
      "rubrics": [
        {
          "rubric_id": "conciseness",
          "rubric_content": {
            "text_property": "The response from the agent is direct and to the point."
          }
        },
        {
          "rubric_id": "intent_inference",
          "rubric_content": {
            "text_property": "The response from the agent accurately infers the underlying goal from ambiguous queries."
          }
        }
      ]
    }
  }
}
"""

!echo '{eval_config}' > {AGENT_BASE_PATH}/eval_config.json
!adk eval \
    {AGENT_BASE_PATH} \
    --config_file_path {AGENT_BASE_PATH}/eval_config.json \
    sample_eval_set_01 \
    --print_detailed_results \
    --log_level=CRITICAL

  metric_evaluator_registry = MetricEvaluatorRegistry()
  user_simulator_provider: UserSimulatorProvider = UserSimulatorProvider(),
Using evaluation criteria: criteria={'rubric_based_final_response_quality_v1': BaseCriterion(threshold=0.8, judge_model_options={'judge_model': 'gemini-2.5-flash', 'num_samples': 5}, rubrics=[{'rubric_id': 'conciseness', 'rubric_content': {'text_property': 'The response from the agent is direct and to the point.'}}, {'rubric_id': 'intent_inference', 'rubric_content': {'text_property': 'The response from the agent accurately infers the underlying goal from ambiguous queries.'}}])} user_simulator_config=None
  user_simulator_provider = UserSimulatorProvider(
  eval_service = LocalEvalService(
  return StaticUserSimulator(static_conversation=eval_case.conversation)
  super().__init__(
  return self._registry[eval_metric.metric_name][0](eval_metric=eval_metric)
  super().__init__(
  super().__init__(
************************************************************

---

**Analyzing the Full Evaluation Output**

The evaluation results for `sample_eval_set_01` and `eval_id: roll_dice_9_and_check_prime_10_19` indicate an **Overall Eval Status: PASSED** for the `rubric_based_final_response_quality_v1` metric.

**Metric Details:**
*   **Metric:** `rubric_based_final_response_quality_v1`
*   **Status:** PASSED
*   **Score:** 1.0
*   **Threshold:** 0.8

This means the agent achieved a perfect score of 1.0, meeting or exceeding the set threshold of 0.8. The `rubric_based_final_response_quality_v1` criterion uses an LLM as a judge to assess the quality of the agent's final response against user-defined rubrics. In this case, both rubrics — "The response from the agent is direct and to the point." and "The response from the agent accurately infers the underlying goal from ambiguous queries." — received a perfect score of 1.0 across all invocations. This demonstrates that the agent's responses successfully met both conciseness and intent inference criteria as evaluated by the LLM judge. For detailed reasoning and scores for each rubric per invocation, please refer to the individual per-rubric columns in the invocation details table above.

### Criterion - `rubric_based_tool_use_quality_v1`
This criterion assesses the quality of an agent's tool usage against a user-defined set of rubrics using LLM as a judge.
#### Details
This criterion provides a flexible way to evaluate tool usage based on specific rules that you define as rubrics. For example, you could define rubrics to check if a specific tool was called, if its parameters were correct, or if tools were called in a particular order.

The criterion uses an LLM-as-a-judge to evaluate the agent's tool calls and responses against each rubric, producing a yes (1.0) or no (0.0) verdict for each. Like other LLM-based metrics, it samples the judge model multiple times per invocation and uses a majority vote to determine the score for each rubric in that invocation. The overall score for an invocation is the average of its rubric scores. The final criterion score for the eval case is the average of these overall scores across all invocations.

#### Output And How To Interpret
The criterion outputs an overall score between 0.0 and 1.0, where 1.0 indicates that the agent's tool usage satisfied all rubrics across all invocations, and 0.0 indicates that no rubrics were satisfied. The results also include detailed per-rubric scores for each invocation. Higher values are better.

More details can be found [here](https://google.github.io/adk-docs/evaluate/criteria/#rubric_based_tool_use_quality_v1).

Takes about 1-2 minutes to run the cell below.

In [None]:
eval_config = """
{
  "criteria": {
    "rubric_based_tool_use_quality_v1": {
      "threshold": 0.8,
      "judge_model_options": {
        "judge_model": "gemini-2.5-flash",
        "num_samples": 5
      },
      "rubrics": [
        {
          "rubric_id": "tool_use_1",
          "rubric_content": {
            "text_property": "roll_dice tool is only called when user prompt asks for a dice roll."
          }
        },
        {
          "rubric_id": "tool_use_2",
          "rubric_content": {
            "text_property": "check_prime is only called when user prompt asks for a prime number."
          }
        }
      ]
    }
  }
}
"""

!echo '{eval_config}' > {AGENT_BASE_PATH}/eval_config.json
!adk eval \
    {AGENT_BASE_PATH} \
    --config_file_path {AGENT_BASE_PATH}/eval_config.json \
    sample_eval_set_01 \
    --print_detailed_results \
    --log_level=CRITICAL

  metric_evaluator_registry = MetricEvaluatorRegistry()
  user_simulator_provider: UserSimulatorProvider = UserSimulatorProvider(),
Using evaluation criteria: criteria={'rubric_based_tool_use_quality_v1': BaseCriterion(threshold=0.8, judge_model_options={'judge_model': 'gemini-2.5-flash', 'num_samples': 5}, rubrics=[{'rubric_id': 'tool_use_1', 'rubric_content': {'text_property': 'roll_dice tool is only called when user prompt asks for a dice roll.'}}, {'rubric_id': 'tool_use_2', 'rubric_content': {'text_property': 'check_prime is only called when user prompt asks for a prime number.'}}])} user_simulator_config=None
  user_simulator_provider = UserSimulatorProvider(
  eval_service = LocalEvalService(
  return StaticUserSimulator(static_conversation=eval_case.conversation)
  super().__init__(
  return self._registry[eval_metric.metric_name][0](eval_metric=eval_metric)
  super().__init__(
  super().__init__(
*********************************************************************
Eval Run Su

---

**Analyzing the Full Evaluation Output**

The evaluation results for `sample_eval_set_01` and `eval_id: roll_dice_9_and_check_prime_10_19` indicate an **Overall Eval Status: PASSED** for the `rubric_based_tool_use_quality_v1` metric.

**Metric Details:**
*   **Metric:** `rubric_based_tool_use_quality_v1`
*   **Status:** PASSED
*   **Score:** 1.0
*   **Threshold:** 0.8

This means the agent achieved a perfect score of 1.0, meeting or exceeding the set threshold of 0.8. The `rubric_based_tool_use_quality_v1` criterion uses an LLM as a judge to assess the quality of the agent's tool usage against user-defined rubrics. In this case, both rubrics — "roll_dice tool is only called when user prompt asks for a dice roll." and "check_prime is only called when user prompt asks for a prime number." — received a perfect score of 1.0 across all invocations. This demonstrates that the agent's tool usage successfully adhered to the defined rules as evaluated by the LLM judge. For detailed reasoning and scores for each rubric per invocation, please refer to the individual per-rubric columns in the invocation details table above.

### Criterion - `hallucinations_v1`
This criterion assesses whether a model response contains any false,
contradictory, or unsupported claims.

#### Details

This criterion assesses whether a model response contains any false,
contradictory, or unsupported claims based on context that includes developer
instructions, user prompt, tool definitions, and tool invocations and their
results. It uses LLM-as-a-judge and follows a two-step process:

1.  **Segmenter**: Segments the agent response into individual sentences.
2.  **Sentence Validator**: Evaluates each segmented sentence against the
    provided context for grounding. Each sentence is labeled as `supported`,
    `unsupported`, `contradictory`, `disputed` or `not_applicable`.

The metric computes an Accuracy Score: the percentage of sentences that are
`supported` or `not_applicable`. By default, only the final response is
evaluated. If `evaluate_intermediate_nl_responses` is set to true in the
criterion, intermediate natural language responses from agents are also
evaluated.

#### Output And How To Interpret

The criterion returns a score between 0.0 and 1.0. A score of 1.0 means all
sentences in agent's response are grounded in the context, while a score closer
to 0.0 indicates that many sentences are false, contradictory, or unsupported.
Higher values are better.

More details can be found [here](https://google.github.io/adk-docs/evaluate/criteria/#hallucinations_v1).

Takes about 1-2 minutes to run the cell below.

In [None]:
eval_config = """
{
  "criteria": {
    "hallucinations_v1": {
      "threshold": 0.5,
      "evaluate_intermediate_nl_responses": false
    }
  }
}
"""

!echo '{eval_config}' > {AGENT_BASE_PATH}/eval_config.json
!adk eval \
    {AGENT_BASE_PATH} \
    --config_file_path {AGENT_BASE_PATH}/eval_config.json \
    sample_eval_set_01 \
    --print_detailed_results \
    --log_level=CRITICAL

  metric_evaluator_registry = MetricEvaluatorRegistry()
  user_simulator_provider: UserSimulatorProvider = UserSimulatorProvider(),
Using evaluation criteria: criteria={'hallucinations_v1': BaseCriterion(threshold=0.5, evaluate_intermediate_nl_responses=False)} user_simulator_config=None
  user_simulator_provider = UserSimulatorProvider(
  eval_service = LocalEvalService(
  return StaticUserSimulator(static_conversation=eval_case.conversation)
  super().__init__(
  return self._registry[eval_metric.metric_name][0](eval_metric=eval_metric)
*********************************************************************
Eval Run Summary
sample_eval_set_01:
  Tests passed: 1
  Tests failed: 0
********************************************************************
Eval Set Id: sample_eval_set_01
Eval Id: roll_dice_9_and_check_prime_10_19
Overall Eval Status: PASSED
---------------------------------------------------------------------
Metric: hallucinations_v1, Status: PASSED, Score: 1.0, Threshold: 0.5


---

**Analyzing the Full Evaluation Output**

The evaluation results for `sample_eval_set_01` and `eval_id: roll_dice_9_and_check_prime_10_19` indicate an **Overall Eval Status: PASSED** for the `hallucinations_v1` metric.

**Metric Details:**
*   **Metric:** `hallucinations_v1`
*   **Status:** PASSED
*   **Score:** 1.0
*   **Threshold:** 0.5

This means the agent achieved a perfect score of 1.0, meeting or exceeding the set threshold of 0.5. The `hallucinations_v1` criterion assesses whether the model's response contains any false, contradictory, or unsupported claims based on the provided context. The detailed invocation results show that for all turns in the conversation, the agent's responses were fully grounded in the context, resulting in a score of 1.0 for each invocation. This demonstrates that the agent's responses are free from hallucinations.

### Criterion - `safety_v1`

This criterion evaluates the safety (harmlessness) of an Agent's Response.

#### Details

This criterion assesses whether the agent's response contains any harmful
content, such as hate speech, harassment, or dangerous information. Unlike other
metrics implemented natively within ADK, `safety_v1` delegates the evaluation to
the Vertex AI General AI Eval SDK.

#### Output And How To Interpret

The criterion returns a score between 0.0 and 1.0. Scores closer to 1.0 indicate
that the response is safe, while scores closer to 0.0 indicate potential safety
issues.

More details can be found [here](https://google.github.io/adk-docs/evaluate/criteria/#safety_v1).

Takes about 1-2 minutes to run the cell below.

In [None]:
eval_config = """
{
  "criteria": {
    "safety_v1": 0.8
  }
}
"""

!echo '{eval_config}' > {AGENT_BASE_PATH}/eval_config.json
!adk eval \
    {AGENT_BASE_PATH} \
    --config_file_path {AGENT_BASE_PATH}/eval_config.json \
    sample_eval_set_01 \
    --print_detailed_results \
    --log_level=CRITICAL

  metric_evaluator_registry = MetricEvaluatorRegistry()
  user_simulator_provider: UserSimulatorProvider = UserSimulatorProvider(),
Using evaluation criteria: criteria={'safety_v1': 0.8} user_simulator_config=None
  user_simulator_provider = UserSimulatorProvider(
  eval_service = LocalEvalService(
  return StaticUserSimulator(static_conversation=eval_case.conversation)
  super().__init__(
Computing Metrics for Evaluation Dataset: 100% 1/1 [00:03<00:00,  3.14s/it]
Computing Metrics for Evaluation Dataset: 100% 1/1 [00:03<00:00,  3.78s/it]
Computing Metrics for Evaluation Dataset: 100% 1/1 [00:03<00:00,  3.32s/it]
*********************************************************************
Eval Run Summary
sample_eval_set_01:
  Tests passed: 1
  Tests failed: 0
********************************************************************
Eval Set Id: sample_eval_set_01
Eval Id: roll_dice_9_and_check_prime_10_19
Overall Eval Status: PASSED
----------------------------------------------------------------

---

**Analyzing the Full Evaluation Output**

The evaluation results for `sample_eval_set_01` and `eval_id: roll_dice_9_and_check_prime_10_19` indicate an **Overall Eval Status: PASSED** for the `safety_v1` metric.

**Metric Details:**
*   **Metric:** `safety_v1`
*   **Status:** PASSED
*   **Score:** 1.0
*   **Threshold:** 0.8

This means the agent achieved a perfect score of 1.0, meeting or exceeding the set threshold of 0.8. The `safety_v1` criterion evaluates the safety (harmlessness) of an Agent's Response. The detailed invocation results show that for all turns in the conversation, the agent's responses were deemed safe, resulting in a score of 1.0 for each invocation. This demonstrates that the agent's responses are free from harmful content.

## 🎉 Congratulations!

You've successfully navigated through this Colab notebook, understanding how to evaluate your agents using the evaluation criteria provided by ADK.

**What you've learned**

In this notebook, you've learned how to:

*   Prepare an agent and evaluation data using the `hello_world` sample.
*   Apply and interpret various ADK evaluation criteria

**Next Steps**

To learn more, check out the official ADK documentation:

- Dive deeper: Read the [Evaluation](https://google.github.io/adk-docs/evaluate/) documentation
- Explore all metrics: See the full list of [Evaluation Criteria](https://google.github.io/adk-docs/evaluate/criteria/) supported by ADK
- See more examples: Visit the [ADK Samples](https://github.com/google/adk-samples) repository on GitHub