# Evaluate AI agents (Azure AI Agent Service) in Azure AI Foundry

## Objective


This sample demonstrates how to evaluate an AI agent (Azure AI Agent Service) on these important aspects of your agentic workflow:

- Intent Resolution: Measures how well the agent identifies the user’s request, including how well it scopes the user’s intent, asks clarifying questions, and reminds end users of its scope of capabilities.
- Tool Call Accuracy: Evaluates the agent's ability to select the appropriate tools, and process correct parameters from previous steps.
- Task Adherence: Measures how well the agent’s response adheres to its assigned tasks, according to its system message and prior steps.

For AI agents outside of Azure AI Agent Service, you can still provide th agent data in the two formats (either simple data or agent messages) specified in the individual evaluator samples:
- [Intent resolution](https://aka.ms/intentresolution-sample)
- [Tool call accuracy](https://aka.ms/toolcallaccuracy-sample)
- [Task adherence](https://aka.ms/taskadherence-sample)
- [Response Completeness](https://aka.ms/rescompleteness-sample)



## Time 

You should expect to spend about 20 minutes running this notebook. 

## Before you begin
Creating an agent using Azure AI agent service requires an Azure AI Foundry project and a deployed, supported model. See more details in [Create a new agent](https://learn.microsoft.com/azure/ai-services/agents/quickstart?pivots=ai-foundry-portal).

For quality evaluation, you need to deploy a `gpt` model supporting JSON mode. We recommend a model `gpt-4o` or `gpt-4o-mini` for their strong reasoning capabilities.    

Important: Make sure to authenticate to Azure using `az login` in your terminal before running this notebook.

### Prerequisite

Before running the sample:
```bash
pip install azure-ai-projects azure-identity azure-ai-evaluation
```
Set these environment variables with your own values:
1) **PROJECT_CONNECTION_STRING** - The project connection string, as found in the overview page of your Azure AI Foundry project.
2) **MODEL_DEPLOYMENT_NAME** - The deployment name of the model for AI-assisted evaluators, as found under the "Name" column in the "Models + endpoints" tab in your Azure AI Foundry project.
3) **AZURE_OPENAI_ENDPOINT** - Azure Open AI Endpoint to be used for evaluation.
4) **AZURE_OPENAI_API_KEY** - Azure Open AI Key to be used for evaluation.
5) **AZURE_OPENAI_API_VERSION** - Azure Open AI Api version to be used for evaluation.
6) **AZURE_SUBSCRIPTION_ID** - Azure Subscription Id of Azure AI Project
7) **PROJECT_NAME** - Azure AI Project Name
8) **RESOURCE_GROUP_NAME** - Azure AI Project Resource Group Name
9) **AGENT_MODEL_DEPLOYMENT_NAME** - The deployment name of the model for your Azure AI agent, as found under the "Name" column in the "Models + endpoints" tab in your Azure AI Foundry project.

### Initializing Project Client

In [1]:
import os
from pprint import pprint
from dotenv import load_dotenv
load_dotenv(".credentials.env")

True

In [3]:
import os
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential
from azure.ai.projects.models import FunctionTool, ToolSet

# Import your custom functions to be used as Tools for the Agent
from user_functions import user_functions

project_client = AIProjectClient.from_connection_string(
    credential=DefaultAzureCredential(),
    conn_str=os.environ["PROJECT_CONNECTION_STRING"],
)

AGENT_NAME = "London Tourist Assistant"

# Add Tools to be used by Agent
functions = FunctionTool(user_functions)

toolset = ToolSet()
toolset.add(functions)


### Create an AI agent (Azure AI Agent Service)

In [4]:
agent = project_client.agents.create_agent(
    model=os.environ["AZURE_OPENAI_DEPLOYMENT"],
    name=AGENT_NAME,
    instructions="You are a helpful assistant",
    toolset=toolset,
)

print(f"Created agent, ID: {agent.id}")

Created agent, ID: asst_RzvvHwmpzGAbqpLtrwOs9oVl


### Create Thread

In [5]:
thread = project_client.agents.create_thread()
print(f"Created thread, ID: {thread.id}")

Created thread, ID: thread_4mtq6nGHV51YCMZEwGdrFOaV


## Conversation with Agent
Use below cells to have conversation with the agent
- `Create Message[1]`
- `Execute[2]`

### Create Message[1]

In [6]:
# Create message to thread

MESSAGE = "Can you email me weather info for London ?"

message = project_client.agents.create_message(
    thread_id=thread.id,
    role="user",
    content=MESSAGE,
)
print(f"Created message, ID: {message.id}")

Created message, ID: msg_kGkDARPoiycDWUP9XQUO5WRZ


### Execute[2]

In [7]:
run = project_client.agents.create_and_process_run(thread_id=thread.id, agent_id=agent.id)

print(f"Run finished with status: {run.status}")

if run.status == "failed":
    print(f"Run failed: {run.last_error}")

print(f"Run ID: {run.id}")

Run finished with status: completed
Run ID: run_2WpoeoO2yZk8DBf0QbLTtCW4


### List Messages

In [8]:
for message in project_client.agents.list_messages(thread.id, order="asc").data:
    print(f"Role: {message.role}")
    print(f"Content: {message.content[0].text.value}")
    print("-" * 40)

Role: user
Content: Can you email me weather info for London ?
----------------------------------------
Role: assistant
Content: To send you the weather information for London, could you please provide me with your email address?
----------------------------------------


# Evaluate

### Get data from agent

In [9]:
from azure.ai.evaluation import AIAgentConverter

# Initialize the converter that will be backed by the project.
converter = AIAgentConverter(project_client)

thread_id = thread.id
run_id = run.id
file_name = "evaluation_data.jsonl"

# Get a single agent run data
evaluation_data_single_run = converter.convert(thread_id=thread_id, run_id=run_id)

# Run this to save thread data to a JSONL file for evaluation
# Save the agent thread data to a JSONL file
# evaluation_data = converter.prepare_evaluation_data(thread_ids=thread_id, filename=<>)
# print(json.dumps(evaluation_data, indent=4))



### Setting up evaluator

We will select the following evaluators to assess the different aspects relevant for agent quality: 

- [Intent resolution](https://aka.ms/intentresolution-sample): measures the extent of which an agent identifies the correct intent from a user query. Scale: integer 1-5. Higher is better.
- [Tool call accuracy](https://aka.ms/toolcallaccuracy-sample): evaluates the agent’s ability to select the appropriate tools, and process correct parameters from previous steps. Scale: float 0-1. Higher is better.
- [Task adherence](https://aka.ms/taskadherence-sample): measures the extent of which an agent’s final response adheres to the task based on its system message and a user query. Scale: integer 1-5. Higher is better.


In [10]:
from azure.ai.evaluation import (
    ToolCallAccuracyEvaluator,
    AzureOpenAIModelConfiguration,
    IntentResolutionEvaluator,
    TaskAdherenceEvaluator,
)
from pprint import pprint

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version=os.environ["AZURE_OPENAI_API_VERSION"],
    azure_deployment=os.environ["MODEL_DEPLOYMENT_NAME"],
)
# Needed to use content safety evaluators
azure_ai_project = {
    "subscription_id": os.environ["AZURE_SUBSCRIPTION_ID"],
    "project_name": os.environ["PROJECT_NAME"],
    "resource_group_name": os.environ["RESOURCE_GROUP_NAME"],
}

intent_resolution = IntentResolutionEvaluator(model_config=model_config)

tool_call_accuracy = ToolCallAccuracyEvaluator(model_config=model_config)

task_adherence = TaskAdherenceEvaluator(model_config=model_config)





### Run Evaluator

In [11]:
from azure.ai.evaluation import evaluate

response = evaluate(
    data=file_name,
    evaluators={
        "tool_call_accuracy": tool_call_accuracy,
        "intent_resolution": intent_resolution,
        "task_adherence": task_adherence,
    },
    azure_ai_project={
        "subscription_id": os.environ["AZURE_SUBSCRIPTION_ID"],
        "project_name": os.environ["PROJECT_NAME"],
        "resource_group_name": os.environ["RESOURCE_GROUP_NAME"],
    },
)
pprint(f'AI Foundary URL: {response.get("studio_url")}')

[2025-06-29 20:53:20 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_tool_call_accuracy_20250629_205318_634930, log path: C:\Users\sumohammed\.promptflow\.runs\azure_ai_evaluation_evaluators_tool_call_accuracy_20250629_205318_634930\logs.txt
[2025-06-29 20:53:20 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_task_adherence_20250629_205318_634930, log path: C:\Users\sumohammed\.promptflow\.runs\azure_ai_evaluation_evaluators_task_adherence_20250629_205318_634930\logs.txt
[2025-06-29 20:53:20 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_intent_resolution_20250629_205318_634930, log path: C:\Users\sumohammed\.promptflow\.runs\azure_ai_evaluation_evaluators_intent_resolution_20250629_205318_634930\logs.txt


2025-06-29 20:53:20 +0100   41148 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-06-29 20:53:23 +0100   41148 execution.bulk     INFO     Finished 1 / 5 lines.
2025-06-29 20:53:23 +0100   41148 execution.bulk     INFO     Average execution time for completed lines: 3.05 seconds. Estimated time for incomplete lines: 12.2 seconds.
2025-06-29 20:53:23 +0100   41148 execution.bulk     INFO     Finished 4 / 5 lines.
2025-06-29 20:53:23 +0100   41148 execution.bulk     INFO     Average execution time for completed lines: 0.87 seconds. Estimated time for incomplete lines: 0.87 seconds.
2025-06-29 20:53:23 +0100   41148 execution.bulk     INFO     Finished 5 / 5 lines.
2025-06-29 20:53:23 +0100   41148 execution.bulk     INFO     Average execution time for completed lines: 0.75 seconds. Estimated time for incomplete lines: 0.0 seconds.

Run name: "azure_ai_evaluation_evaluators_task_adherence_20250629_205318_634930"
Run stat

 Please check out C:/Users/sumohammed/.promptflow/.runs/azure_ai_evaluation_evaluators_tool_call_accuracy_20250629_205318_634930 for more details.


2025-06-29 20:53:20 +0100   41148 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-06-29 20:53:20 +0100   41148 execution.bulk     INFO     Finished 1 / 5 lines.
2025-06-29 20:53:20 +0100   41148 execution.bulk     INFO     Average execution time for completed lines: 0.51 seconds. Estimated time for incomplete lines: 2.04 seconds.
2025-06-29 20:53:24 +0100   41148 execution.bulk     INFO     Finished 2 / 5 lines.
2025-06-29 20:53:24 +0100   41148 execution.bulk     INFO     Average execution time for completed lines: 1.95 seconds. Estimated time for incomplete lines: 5.85 seconds.
2025-06-29 20:53:24 +0100   41148 execution.bulk     INFO     Finished 3 / 5 lines.
2025-06-29 20:53:24 +0100   41148 execution.bulk     INFO     Average execution time for completed lines: 1.36 seconds. Estimated time for incomplete lines: 2.72 seconds.
2025-06-29 20:53:25 +0100   41148 execution.bulk     INFO     Finished 4 / 5 lines.
2025-

  outputs.fillna(value="(Failed)", inplace=True)  # replace nan with explicit prompt
  result_df.replace("(Failed)", math.nan, inplace=True)



{
    "tool_call_accuracy": {
        "status": "Completed with Errors",
        "duration": "0:00:08.701135",
        "completed_lines": 4,
        "failed_lines": 1,
        "log_path": "C:\\Users\\sumohammed\\.promptflow\\.runs\\azure_ai_evaluation_evaluators_tool_call_accuracy_20250629_205318_634930"
    },
    "intent_resolution": {
        "status": "Completed",
        "duration": "0:00:07.634738",
        "completed_lines": 5,
        "failed_lines": 0,
        "log_path": "C:\\Users\\sumohammed\\.promptflow\\.runs\\azure_ai_evaluation_evaluators_intent_resolution_20250629_205318_634930"
    },
    "task_adherence": {
        "status": "Completed",
        "duration": "0:00:05.676154",
        "completed_lines": 5,
        "failed_lines": 0,
        "log_path": "C:\\Users\\sumohammed\\.promptflow\\.runs\\azure_ai_evaluation_evaluators_task_adherence_20250629_205318_634930"
    }
}


('AI Foundary URL: '
 'https://ai.azure.com/build/evaluation/a9c33cf2-99d8-41cb-8330-bb0cf26de2

{'intent_resolution.intent_resolution': 4.4,
 'intent_resolution.intent_resolution_threshold': 3.0,
 'task_adherence.task_adherence': 3.4,
 'task_adherence.task_adherence_threshold': 3.0,
 'tool_call_accuracy.tool_call_accuracy': 1.0,
 'tool_call_accuracy.tool_call_accuracy_threshold': 0.8}


## Inspect results on Azure AI Foundry

Go to AI Foundry URL for rich Azure AI Foundry data visualization to inspect the evaluation scores and reasoning to quickly identify bugs and issues of your agent to fix and improve.

In [12]:
# alternatively, you can use the following to get the evaluation results in memory

# average scores across all runs
pprint(response["metrics"])