# Evaluate AI agents (Azure AI Agent Service) in Azure AI Foundry

## Objective


This sample demonstrates how to evaluate an AI agent (Azure AI Agent Service) on these important aspects of your agentic workflow:

- Intent Resolution: Measures how well the agent identifies the user’s request, including how well it scopes the user’s intent, asks clarifying questions, and reminds end users of its scope of capabilities.
- Tool Call Accuracy: Evaluates the agent's ability to select the appropriate tools, and process correct parameters from previous steps.
- Task Adherence: Measures how well the agent’s response adheres to its assigned tasks, according to its system message and prior steps.

For AI agents outside of Azure AI Agent Service, you can still provide th agent data in the two formats (either simple data or agent messages) specified in the individual evaluator samples:
- [Intent resolution](https://aka.ms/intentresolution-sample)
- [Tool call accuracy](https://aka.ms/toolcallaccuracy-sample)
- [Task adherence](https://aka.ms/taskadherence-sample)
- [Response Completeness](https://aka.ms/rescompleteness-sample)



## Time 

You should expect to spend about 20 minutes running this notebook. 

## Before you begin
Creating an agent using Azure AI agent service requires an Azure AI Foundry project and a deployed, supported model. See more details in [Create a new agent](https://learn.microsoft.com/azure/ai-services/agents/quickstart?pivots=ai-foundry-portal).

For quality evaluation, you need to deploy a `gpt` model supporting JSON mode. We recommend a model `gpt-4o` or `gpt-4o-mini` for their strong reasoning capabilities.    

Important: Make sure to authenticate to Azure using `az login` in your terminal before running this notebook.

### Prerequisite

Before running the sample:
```bash
pip install azure-ai-projects azure-identity azure-ai-evaluation
```
Set these environment variables with your own values:
1) **PROJECT_CONNECTION_STRING** - The project connection string, as found in the overview page of your Azure AI Foundry project.
2) **MODEL_DEPLOYMENT_NAME** - The deployment name of the model for AI-assisted evaluators, as found under the "Name" column in the "Models + endpoints" tab in your Azure AI Foundry project.
3) **AZURE_OPENAI_ENDPOINT** - Azure Open AI Endpoint to be used for evaluation.
4) **AZURE_OPENAI_API_KEY** - Azure Open AI Key to be used for evaluation.
5) **AZURE_OPENAI_API_VERSION** - Azure Open AI Api version to be used for evaluation.
6) **AZURE_SUBSCRIPTION_ID** - Azure Subscription Id of Azure AI Project
7) **PROJECT_NAME** - Azure AI Project Name
8) **RESOURCE_GROUP_NAME** - Azure AI Project Resource Group Name
9) **AGENT_MODEL_DEPLOYMENT_NAME** - The deployment name of the model for your Azure AI agent, as found under the "Name" column in the "Models + endpoints" tab in your Azure AI Foundry project.

### Setup Azure credentials and project 
1. use az cli to login to the tenant with your credential

<!-- initializing Project Client -->

In [1]:
from dotenv import load_dotenv

# load environment variables from .env file
load_dotenv(dotenv_path=".env", override=True)

from utils.fdyauth import AuthHelper
settings = AuthHelper.load_settings()
credential = AuthHelper.test_credential()

if credential:
    print('Environment and authentication OK')
else:
    print("please login first")

Environment and authentication OK


In [2]:
import os
import azure.ai.agents as agentslib
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import (
    AgentEvaluationRequest,
    InputDataset,
    EvaluatorIds,
    EvaluatorConfiguration,
    AgentEvaluationSamplingConfiguration,
    AgentEvaluationRedactionConfiguration,
    Evaluation,
    DatasetVersion,
    FileDatasetVersion,
)
from azure.ai.agents.models import (
    FunctionTool,
    ToolSet,
    MessageRole,
)

# Import your custom functions to be used as Tools for the Agent
from utils.user_functions import user_functions

# Initialize project client with proper authentication
project_client = AIProjectClient(
    credential=credential,  # Use the credential from earlier setup
    endpoint=settings.project_endpoint
)
print(f"project client api version: {project_client._config.api_version}")
print(f"azure-ai-agents version: {agentslib.__version__}")

AGENT_NAME = "Seattle Tourist Assistant"
AGENT_INSTRUCTIONS = """You are a helpful tourist assistant"""

# Add Tools to be used by Agent
functions = FunctionTool(user_functions)

toolset = ToolSet()
toolset.add(functions)

# To enable tool calls executed automatically
project_client.agents.enable_auto_function_calls(tools=toolset, max_retry=4)

project client api version: 2025-05-15-preview
azure-ai-agents version: 1.0.1


### Create an AI agent (Azure AI Agent Service)

In [3]:
found_agent = None
all_agents_list = project_client.agents.list_agents()
for a in all_agents_list:
    if a.name == AGENT_NAME:
        found_agent = a
        break

if found_agent:
    agent = project_client.agents.update_agent(
        agent_id=found_agent.id,
        model=settings.model_deployment_name,
        instructions=AGENT_INSTRUCTIONS,
        toolset=toolset,
    )
    project_client.agents.enable_auto_function_calls(tools=toolset, max_retry=4) 
    print(f"reusing agent > {agent.name} (id: {agent.id})")
else:
    agent = project_client.agents.create_agent(
        model=settings.model_deployment_name,
        name=AGENT_NAME,
        instructions=AGENT_INSTRUCTIONS,
        toolset=toolset,
    )
    print(f"Created agent '{AGENT_NAME}' with {len(functions._functions)} tools\nID: {agent.id}")

reusing agent > Seattle Tourist Assistant (id: asst_MSuuqYfXRGp34r33IF6DO0D0)


### Create Thread

In [4]:
thread = project_client.agents.threads.create()
print(f"Created thread, ID: {thread.id}")

Created thread, ID: thread_N8cK5BNhIOXzJmm5N9Ol6FgT


## Conversation with Agent
Use below cells to have conversation with the agent
- `Create Message[1]`
- `Execute[2]`

### Create Message[1]

In [5]:
# Create message to thread

MESSAGE = "Can you email me weather info for Seattle ?"

message = project_client.agents.messages.create(
    thread_id=thread.id,
    role=MessageRole.USER,
    content=MESSAGE,
)
print(f"Created message, ID: {message.id}")

Created message, ID: msg_pPuYLfy6VYaOuNS5L9ysQn3V


### Execute[2]

In [6]:
run = project_client.agents.runs.create_and_process(thread_id=thread.id, agent_id=agent.id)

print(f"Run finished with status: {run.status}")

if run.status == "failed":
    print(f"Run failed: {run.last_error}")

print(f"Run ID: {run.id}")

Run finished with status: RunStatus.COMPLETED
Run ID: run_nptDN2Q3yToS9vT3dfPodM9q


### List Messages

In [7]:
for message in project_client.agents.messages.list(thread.id, order="asc"):
    print(f"Role: {message.role}")
    print(f"Content: {message.content[0].text.value}")
    print("-" * 40)

Role: MessageRole.USER
Content: Can you email me weather info for Seattle ?
----------------------------------------
Role: MessageRole.AGENT
Content: Could you please provide me with your email address so that I can send you the weather information for Seattle?
----------------------------------------


# Evaluate

### Evaluation in the cloud

Reference:
* https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/develop/cloud-evaluation
* https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/develop/cloud-evaluation#prerequisite-set-up-steps-for-azure-ai-foundry-projects

For the Microsoft Entra ID, give MSI (Microsoft Identity) permissions for "Storage Blob Data Owner" through IAM to both
* `User, group, or service principal` "EntraID user" and
* `Managed Identity` by adding the Role to both `Azure AI Foundry Resource` and its `Azure AI Foundry Project`
from the storage account IAM.
* And make sure to choose "Share to all project" while adding the storage account connection to the Azure AI Foundry Project v2

Adding additional `Azure AI Administrator Role` to the Microsoft EntraID User for the `Azure AI Foundry Resource`

### Prepare the data from agent response for evaluation

Reference:
* https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/develop/agent-evaluate-sdk#evaluate-azure-ai-agents

In [None]:
from azure.ai.evaluation import AIAgentConverter
import json 

# Initialize the converter that will be backed by the project.
converter = AIAgentConverter(project_client)

thread_id = thread.id
run_id = run.id

# Get a single agent run data for evaluation
single_agent_eval_input_data = converter.convert(thread_id=thread_id, run_id=run_id)

# make folder
dir_path = os.path.join(os.getcwd(), "data")
if not os.path.exists(dir_path):
    os.makedirs(dir_path)

# Save the agent run data to a JSONL file
eval_file_name = f"single_agent_eval_{thread_id}_{run_id}.jsonl"
eval_file_path = os.path.join(os.getcwd(), "data", eval_file_name)

print(single_agent_eval_input_data)



{'query': [{'role': 'system', 'content': 'You are a helpful tourist assistant'}, {'createdAt': '2025-06-27T14:25:53Z', 'role': 'user', 'content': [{'type': 'text', 'text': 'Can you email me weather info for Seattle ?'}]}], 'response': [{'createdAt': '2025-06-27T14:26:00Z', 'run_id': 'run_nptDN2Q3yToS9vT3dfPodM9q', 'role': 'assistant', 'content': [{'type': 'text', 'text': 'Could you please provide me with your email address so that I can send you the weather information for Seattle?'}]}], 'tool_definitions': [{'name': 'longest_word_in_sentences', 'type': 'function', 'description': 'Finds the longest word in each sentence.', 'parameters': {'type': 'object', 'properties': {'sentences': {'type': 'array', 'items': {'type': 'string'}, 'description': 'A list of sentences.'}}}}, {'name': 'toggle_flag', 'type': 'function', 'description': 'Toggles a boolean flag.', 'parameters': {'type': 'object', 'properties': {'flag': {'type': 'boolean', 'description': 'The flag to toggle.'}}}}, {'name': 'fetc

In [27]:
def extract_agent_data(agent_data, ground_truth="", latency=None):
    """
    :param agent_data: The agent data structure containing messages and responses.
    :param latency: Optional latency value; if not provided, a default will be used.
    :return: A dictionary with extracted information.
    """
    query = ""
    response = ""
    context = ""
    
    # Navigate through the agent data structure to extract relevant information
    if isinstance(agent_data, dict):
        # Handle the specific structure: 'query': [{'role': 'system', 'content': '...'}, {'role': 'user', 'content': [{'type': 'text', 'text': '...'}]}]
        if 'query' in agent_data and isinstance(agent_data['query'], list):
            context_parts = []
            for message in agent_data['query']:
                if isinstance(message, dict):
                    role = message.get('role', '')
                    content = message.get('content', '')
                    
                    if role == 'system':
                        # System message becomes the primary context
                        context_parts.append(content)
                    elif role == 'user':
                        # Extract user query
                        if isinstance(content, list):
                            # Handle nested content structure: [{'type': 'text', 'text': '...'}]
                            for content_item in content:
                                if isinstance(content_item, dict) and content_item.get('type') == 'text':
                                    query = content_item.get('text', '')
                                    break
                        elif isinstance(content, str):
                            # Handle simple string content
                            query = content
            
            # Use system message content directly as context (without "System:" prefix)
            context = " | ".join(context_parts)
        
        # Handle the response structure: 'response': [{'createdAt': '...', 'run_id': '...', 'role': 'assistant', 'content': [{'type': 'text', 'text': '...'}]}]
        if 'response' in agent_data:
            response_data = agent_data['response']
            if isinstance(response_data, list):
                # Find the assistant message in the response array
                for message in response_data:
                    if isinstance(message, dict) and message.get('role') == 'assistant':
                        content = message.get('content', [])
                        if isinstance(content, list):
                            # Extract text from nested content structure
                            for content_item in content:
                                if isinstance(content_item, dict) and content_item.get('type') == 'text':
                                    response = content_item.get('text', '')
                                    break
                        elif isinstance(content, str):
                            response = content
                        break
            elif isinstance(response_data, str):
                # Handle simple string response
                response = response_data
        
        # Look for query in other possible locations if not found above
        if not query and 'messages' in agent_data:
            for message in agent_data['messages']:
                if message.get('role') == 'user':
                    content = message.get('content', '')
                    if isinstance(content, list):
                        for content_item in content:
                            if isinstance(content_item, dict) and content_item.get('type') == 'text':
                                query = content_item.get('text', '')
                                break
                    else:
                        query = content
                    break
        elif not query and 'conversation' in agent_data:
            # Handle conversation structure
            conversation = agent_data['conversation']
            if isinstance(conversation, list) and len(conversation) > 0:
                first_message = conversation[0]
                if isinstance(first_message, dict):
                    query = first_message.get('content', '') or first_message.get('message', '')
        
        # Look for response in other locations if not found above
        if not response and 'messages' in agent_data:
            for message in agent_data['messages']:
                if message.get('role') == 'assistant':
                    content = message.get('content', '')
                    if isinstance(content, list):
                        # Handle nested content structure
                        for content_item in content:
                            if isinstance(content_item, dict):
                                if content_item.get('type') == 'text':
                                    response = content_item.get('text', '')
                                    break
                    else:
                        response = content
                    break
        elif not response and 'conversation' in agent_data:
            # Get the last assistant message
            conversation = agent_data['conversation']
            if isinstance(conversation, list):
                for message in reversed(conversation):
                    if isinstance(message, dict) and message.get('role') == 'assistant':
                        response = message.get('content', '')
                        break
        
        # Add additional context from other fields (append to existing context)
        additional_context = []
        if 'tools' in agent_data:
            additional_context.append(f"Tools: {agent_data['tools']}")
        
        if 'metadata' in agent_data:
            additional_context.append(f"Metadata: {agent_data['metadata']}")
        
        # Add timestamps and run_id from response if available
        if 'response' in agent_data and isinstance(agent_data['response'], list):
            for message in agent_data['response']:
                if isinstance(message, dict) and message.get('role') == 'assistant':
                    created_at = message.get('createdAt', '')
                    run_id_from_response = message.get('run_id', '')
                    if created_at:
                        additional_context.append(f"Response created: {created_at}")
                    if run_id_from_response:
                        additional_context.append(f"Run ID: {run_id_from_response}")
                    break
        
        # Combine context with additional information
        # if additional_context:
        #     if context:
        #         context = f"{context} | {' | '.join(additional_context)}"
        #     else:
        #         context = " | ".join(additional_context)
    
    # Generate ground truth based on the specific use case
    # ground_truth = ""
    
    # Calculate latency if not provided
    if latency is None:
        latency = 8.5  # Default placeholder, you should measure actual latency
    
    # Calculate response length
    response_length = len(response) if response else 0
    
    return {
        "query": query,
        "ground_truth": ground_truth,
        "response": response,
        "context": context,
        "latency": latency,
        "response_length": response_length
    }

In [28]:
eval_input = extract_agent_data(single_agent_eval_input_data)
eval_input

{'query': 'Can you email me weather info for Seattle ?',
 'ground_truth': '',
 'response': 'Could you please provide me with your email address so that I can send you the weather information for Seattle?',
 'context': 'You are a helpful tourist assistant',
 'latency': 8.5,
 'response_length': 111}

In [26]:
# save the the whole object of single_agent_eval_input_data to the eval_file_path as jsonl entry
with open(eval_file_path, "w") as f:
    f.write(json.dumps(eval_input))

#### Prepare Multi Agent Run Thread for Evaluation

In [12]:
# Multi Agent Run Thread

# Specify a file path to save agent output (which is evaluation inpout data))
# https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/develop/agent-evaluate-sdk#evaluate-multiple-agent-runs-or-threads
eval_file_name = os.path.join(os.getcwd(), "data", "eval_input_data.jsonl")
print(f"Saving multi agent run thread as evaluation input data to: {eval_file_name}")

# Run this to save multi-agent run thread data to a JSONL file for evaluation
evaluation_data = converter.prepare_evaluation_data(thread_ids=thread_id, filename=eval_file_name)

print(f"Evaluation data saved to {eval_file_name}")

# verbose output of evaluation input data
# print(json.dumps(evaluation_data, indent=4))

Saving multi agent run thread as evaluation input data to: c:\Users\yingdingwang\Documents\VCS\democollections\AgentEvalExamle\fdy\data\eval_input_data.jsonl
Evaluation data saved to c:\Users\yingdingwang\Documents\VCS\democollections\AgentEvalExamle\fdy\data\eval_input_data.jsonl


### Setting up evaluator

We will select the following evaluators to assess the different aspects relevant for agent quality: 

- [Intent resolution](https://aka.ms/intentresolution-sample): measures the extent of which an agent identifies the correct intent from a user query. Scale: integer 1-5. Higher is better.
- [Tool call accuracy](https://aka.ms/toolcallaccuracy-sample): evaluates the agent’s ability to select the appropriate tools, and process correct parameters from previous steps. Scale: float 0-1. Higher is better.
- [Task adherence](https://aka.ms/taskadherence-sample): measures the extent of which an agent’s final response adheres to the task based on its system message and a user query. Scale: integer 1-5. Higher is better.


In [None]:
from azure.ai.evaluation import (
    ToolCallAccuracyEvaluator,
    AzureOpenAIModelConfiguration,
    IntentResolutionEvaluator,
    TaskAdherenceEvaluator,
)
from pprint import pprint

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=settings.azure_openai_endpoint,
    api_key=settings.azure_openai_api_key,
    api_version=settings.azure_openai_api_version,
    azure_deployment=settings.model_deployment_name,
)

intent_resolution = IntentResolutionEvaluator(model_config=model_config)

tool_call_accuracy = ToolCallAccuracyEvaluator(model_config=model_config)

task_adherence = TaskAdherenceEvaluator(model_config=model_config)

Class IntentResolutionEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


Class ToolCallAccuracyEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class TaskAdherenceEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


In [27]:
# agent_client = project_client.agents
# agent_client._config.endpoint

# project_client._config

### Run Cloud Evaluator on evaluation dataset

You can upload the evaluation dataset, which contains the query, groundtruth, response to have that evaluated on the cloud endpoint and show the result in the foundry evaluation dashboard

In the fomart of `jsonl`:
```json
{"query":"What is the importance of choosing the right provider in getting the most value out of your health insurance plan?","ground_truth":"Choosing the right provider is an important part of getting the most value out of your health insurance plan...","response":"Choosing the right provider is important ...","context":"Northwind_Health_Plus_Benefits_Deta ... ","latency":8.733296,"response_length":2160}
```

In [13]:
print("Upload a single file and create a new Dataset to reference the file.")
dataset_name = "tourist-test-dataset-2"
dataset_version = "1.0"
# dataset: DatasetVersion = project_client.datasets.upload_file(
#     name=dataset_name,
#     version=dataset_version,
#     file_path=data_file,
# )

existing_datasets = project_client.datasets.list()
for dataset in existing_datasets:
    print(f"Dataset: {dataset.name}, Version: {dataset.version}, ID: {dataset.id}")

# stacctaievalywuno with IAM role "storage blob data owner" for the Azure AI Foundry project and Entra ID principal 
dataset: DatasetVersion = project_client.datasets.upload_file(
    name=dataset_name,
    version=dataset_version,
    file_path=eval_file_name,
    connection_name="stacctaievalywuno"
)
print(dataset)

Upload a single file and create a new Dataset to reference the file.
Dataset: eval-data-2025-06-25_215535_UTC, Version: 1, ID: azureai://accounts/foundry-proj-yw-uno-resource/projects/foundry-proj-yw-uno/data/eval-data-2025-06-25_215535_UTC/versions/1
Dataset: tourist-test-dataset, Version: 1.0, ID: azureai://accounts/foundry-proj-yw-uno-resource/projects/foundry-proj-yw-uno/data/tourist-test-dataset/versions/1.0
Dataset: eval-data-2025-06-25_213617_UTC, Version: 1, ID: azureai://accounts/foundry-proj-yw-uno-resource/projects/foundry-proj-yw-uno/data/eval-data-2025-06-25_213617_UTC/versions/1


HttpResponseError: This request is not authorized to perform this operation.
RequestId:9cab82c8-201e-0007-5599-e6becc000000
Time:2025-06-26T12:52:13.9077199Z
ErrorCode:AuthorizationFailure
Content: <?xml version="1.0" encoding="utf-8"?><Error><Code>AuthorizationFailure</Code><Message>This request is not authorized to perform this operation.
RequestId:9cab82c8-201e-0007-5599-e6becc000000
Time:2025-06-26T12:52:13.9077199Z</Message></Error>

### 3. Agent evaluation

* https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluation/sample_agent_evaluations.py

In [14]:
agent_evaluation_request = AgentEvaluationRequest(
    run_id=run_id,
    thread_id=thread_id,
    evaluators={
        "violence": EvaluatorConfiguration(id=EvaluatorIds.VIOLENCE),
    },
    sampling_configuration=AgentEvaluationSamplingConfiguration(
        name="test",
        sampling_percent=100,
        max_request_rate=100,
    ),
    redaction_configuration=AgentEvaluationRedactionConfiguration(
        redact_score_properties=False,
    ),
    app_insights_connection_string=project_client.telemetry.get_connection_string(),
)

agent_evaluation_response = project_client.evaluations.create_agent_evaluation(
    evaluation=agent_evaluation_request,
)

In [None]:
run_id = agent_evaluation_response.id
run_id

{'id': 'thread_FNkXPCgz9EYxZtH5hx7ZfAzL;run_4X4I37NSzdFJZk0JaJWj5KzG', 'status': 'Running', 'result': None, 'error': None}


In [34]:
eval_list = project_client.evaluations.list()
for eval_item in eval_list:
    print(f"Evaluation ID: {eval_item.display_name}")
    # print(type(eval_item))
    # print(f"eval_item: {eval_item}")

Evaluation ID: evaluation_test2
Evaluation ID: evaluation_yw_first_test_eval


### Run PromptFlow Evaluator

In [None]:
# from azure.ai.evaluation import evaluate

# response = evaluate(
#     data=file_name,
#     evaluators={
#         "tool_call_accuracy": tool_call_accuracy,
#         "intent_resolution": intent_resolution,
#         "task_adherence": task_adherence,
#     },
    
#     azure_ai_project={
#         "subscription_id": settings.azure_subscription_id,
#         "project_name": settings.project_name,
#         "resource_group_name": settings.resource_group_name,
#     },
# )
# pprint(f'AI Foundary URL: {response.get("studio_url")}')

## Inspect results on Azure AI Foundry

Go to AI Foundry URL for rich Azure AI Foundry data visualization to inspect the evaluation scores and reasoning to quickly identify bugs and issues of your agent to fix and improve.

In [None]:
# alternatively, you can use the following to get the evaluation results in memory

# average scores across all runs
# pprint(response["metrics"])