# Evaluating an Azure AI Agent with Azure AI Evaluation SDK

GBB learning sessione example to demonstrate how to evaluate an **Azure AI Agent** using three quality metrics provided by the Azure AI Evaluation SDK (preview):
1. **Intent Resolution** – Did the agent understand and address the user’s request?
2. **Tool Call Accuracy** – Did the agent choose and invoke the correct tool(s) with the right parameters?
3. **Task Adherence** – Did the agent follow its instructions and complete the assigned task?

Created a mock `fetch_weather` tool, simulate an agent response in various scenarios (correct, incorrect, unspecified / right tool chosen, wrong tool chosen) and evaluate with the SDK.

In [1]:
%pip install --quiet --upgrade azure-ai-projects azure-ai-evaluation azure-identity python-dotenv

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## Setup Azure credentials and project

In [2]:
import os
from dotenv import load_dotenv
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential

load_dotenv()

# REQUIRED environment variables (replace with your values or a .env file)
REQUIRED_KEYS = [
    'AZURE_OPENAI_ENDPOINT',
    'AZURE_OPENAI_API_KEY',
    'AZURE_OPENAI_API_VERSION',
    'MODEL_DEPLOYMENT_NAME',
    'PROJECT_CONNECTION_STRING',
]
missing = [k for k in REQUIRED_KEYS if not os.getenv(k)]
if missing:
    raise EnvironmentError(f'Missing required env keys: {missing}')

# Authenticate (interactive fallback)
try:
    credential = DefaultAzureCredential()
except Exception:
    credential = InteractiveBrowserCredential()

print('Environment and authentication OK')

Environment and authentication OK


## Create a sample agent and `fetch_weather` tool

In [3]:
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import FunctionTool, ToolSet
import json

def fetch_weather(location: str) -> str:
    """Mock weather service"""
    mock_weather_data = {
        'Seattle': 'Sunny, 25°C',
        'London': 'Cloudy, 18°C',
        'Tokyo': 'Rainy, 22°C'
    }
    return json.dumps({'weather': mock_weather_data.get(location, 'N/A')})

def fetch_funfact(location: str) -> str:
    """Fun fact about a location"""
    mock_weather_data = {
        'Seattle': 'There are whales',
        'London': 'Is the capital of England',
        'Tokyo': 'Has the highest population density in the world'
    }
    return json.dumps({'weather': mock_weather_data.get(location, 'N/A')})

# Initialize project client with proper authentication
project_client = AIProjectClient.from_connection_string(
    credential=credential,  # Use the credential from earlier setup
    conn_str=os.environ["PROJECT_CONNECTION_STRING"]
)
    
# Register functions as tools
functions = FunctionTool({fetch_weather, fetch_funfact})
toolset = ToolSet()
toolset.add(functions)
    
# Create agent with proper error handling
AGENT_NAME = "Weather Assistant"
agent = project_client.agents.create_agent(
        model=os.environ["MODEL_DEPLOYMENT_NAME"],
        name=AGENT_NAME,
        instructions="""You are a helpful weather assistant. When asked about the weather in a location:
        1. Use fetch_weather to get current conditions
        2. Provide clear, concise responses
        3. Stay focused on weather information
        Always use tools when available and verify data before responding.""",
        toolset=toolset,
    )
print(f"Created agent '{AGENT_NAME}' with {len(functions._functions)} tools")

Created agent 'Weather Assistant' with 2 tools


## Simulate a user query and agent responses

In [4]:
user_question = "What's the weather in Seattle?"
user_question_unspecific = "How much does it rain in Spain?"

# Correct tool usage
import json
weather_seattle = json.loads(fetch_weather('Seattle'))['weather']
weather_london = json.loads(fetch_weather('London'))['weather']
agent_response_correct = (
    f'The current weather in Seattle is {weather_seattle}.'
)

# Incorrect tool usage (wrong location)
agent_response_incorrect = (
    'The weather in Seattle is Rainy, 15°C. In London, it\'s Sunny, 28°C.'
)

print('User:', user_question)
print('Agent (correct):', agent_response_correct)
print('Agent (incorrect):', agent_response_incorrect)

print('\nUser:', user_question_unspecific)
print('Agent without information, providing irrelevant information')

User: What's the weather in Seattle?
Agent (correct): The current weather in Seattle is Sunny, 25°C.
Agent (incorrect): The weather in Seattle is Rainy, 15°C. In London, it's Sunny, 28°C.

User: How much does it rain in Spain?
Agent without information, providing irrelevant information


## Initialize evaluation metrics

In [5]:
from azure.ai.evaluation import AzureOpenAIModelConfiguration
from azure.ai.evaluation import (
    IntentResolutionEvaluator,
    ToolCallAccuracyEvaluator,
    TaskAdherenceEvaluator,
)

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ['AZURE_OPENAI_ENDPOINT'],
    api_key=os.environ['AZURE_OPENAI_API_KEY'],
    api_version=os.environ['AZURE_OPENAI_API_VERSION'],
    azure_deployment=os.environ['MODEL_DEPLOYMENT_NAME'],
)

intent_eval = IntentResolutionEvaluator(model_config=model_config)
tool_eval = ToolCallAccuracyEvaluator(model_config=model_config)
task_eval = TaskAdherenceEvaluator(model_config=model_config)
print('Evaluators setup')

Class IntentResolutionEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ToolCallAccuracyEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class TaskAdherenceEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


Evaluators setup


### Intent Resolution

Measures how well the agent identifies the user’s request, including how well it scopes the user’s intent, asks clarifying questions, and reminds end users of its scope of capabilities.


In [9]:
from pprint import pprint
res_intent_correct = intent_eval(query=user_question, response=agent_response_correct)
res_intent_incorrect = intent_eval(query=user_question, response=agent_response_incorrect)
res_intent_unspecific = intent_eval(query=user_question_unspecific, response=agent_response_incorrect)
print('Correct:')
pprint(res_intent_correct, width=120, compact=True)
print('\nIncorrect:')
pprint(res_intent_incorrect, width=120, compact=True)
print('\nUnspecific:')
pprint(res_intent_unspecific, width=120, compact=True)

Correct:
{'additional_details': {'actual_user_intent': 'get the current weather in Seattle',
                        'agent_perceived_intent': 'provide current weather information for Seattle',
                        'conversation_has_intent': True,
                        'correct_intent_detected': True,
                        'intent_resolved': True},
 'intent_resolution': 5.0,
 'intent_resolution_reason': 'The response accurately provides the current weather in Seattle, including the condition '
                             "and temperature, which directly addresses the user's query about the weather in Seattle.",
 'intent_resolution_result': 'pass',
 'intent_resolution_threshold': 3}

Incorrect:
{'additional_details': {'actual_user_intent': 'get the weather in Seattle',
                        'agent_perceived_intent': 'provide weather information for Seattle',
                        'conversation_has_intent': True,
                        'correct_intent_detected': True,
      

### Tool Call Accuracy

Evaluates the agent’s ability to select the appropriate tools, and process correct parameters from previous steps.

In [7]:
tool_definitions = [
    {
        'name': 'fetch_weather',
        'description': 'Fetches weather information for a location.',
        'parameters': {
            'type': 'object',
            'properties': {
                'location': {'type': 'string'}
            }
        }
    },
    {
        'name': 'fetch_funfact',
        'description': 'Fetches a fun fact about a location.',
        'parameters': {
            'type': 'object',
            'properties': {
                'location': {'type': 'string'}
            }
        }
    }
]


tool_calls_correct = [
    {'type': 'tool_call', 'tool_call_id': 'call_1', 'name': 'fetch_weather', 'arguments': {'location': 'Seattle'}},
]
tool_calls_incorrect = [
    {'type': 'tool_call', 'tool_call_id': 'bad_call', 'name': 'fetch_funfact', 'arguments': {'location': 'Tokyo'}},
]

res_tool_correct = tool_eval(query=user_question, tool_calls=tool_calls_correct, tool_definitions=tool_definitions)
res_tool_incorrect = tool_eval(query=user_question, tool_calls=tool_calls_incorrect, tool_definitions=tool_definitions)


print('Correct:')
pprint(res_tool_correct, width=120, compact=True)
print('\nIncorrect:')
pprint(res_tool_incorrect, width=120, compact=True)

Correct:
{'per_tool_call_details': [{'tool_call_accurate': True,
                            'tool_call_accurate_reason': "The TOOL CALL is directly relevant to the user's query about "
                                                         'the weather in Seattle, uses the correct parameter from the '
                                                         'TOOL DEFINITION, and includes the correct parameter value '
                                                         'from the CONVERSATION.',
                            'tool_call_id': 'call_1'}],
 'tool_call_accuracy': 1.0,
 'tool_call_accuracy_result': 'pass',
 'tool_call_accuracy_threshold': 0.8}

Incorrect:
{'per_tool_call_details': [{'tool_call_accurate': False,
                            'tool_call_accurate_reason': "The TOOL CALL is irrelevant to the user's query about the "
                                                         'weather in Seattle, uses a parameter value not present or '
                            

### Task Adherence

 Measures how well the agent’s final response adheres to its assigned tasks, according to its system message and prior steps.


In [8]:
res_task_correct = task_eval(query=user_question, response=agent_response_correct, tool_calls=tool_calls_correct)
res_task_incorrect = task_eval(query=user_question, response=agent_response_incorrect, tool_calls=tool_calls_incorrect)
res_task_unspecific = task_eval(query=user_question_unspecific, response=agent_response_incorrect, tool_calls=tool_calls_incorrect)
print('Correct:')
pprint(res_task_correct, width=120, compact=True)
print('\nIncorrect:')
pprint(res_task_incorrect, width=120, compact=True)
print('\nUnspecific:')
pprint(res_task_unspecific, width=120, compact=True)

Correct:
{'task_adherence': 5.0,
 'task_adherence_reason': 'The response accurately provides the current weather in Seattle, including both the '
                          "condition and temperature, which fully adheres to the query's request.",
 'task_adherence_result': 'pass',
 'task_adherence_threshold': 3}

Incorrect:
{'task_adherence': 3.0,
 'task_adherence_reason': "The response meets the core requirement by providing Seattle's weather but lacks precision "
                          "due to the inclusion of irrelevant information about London's weather.",
 'task_adherence_result': 'pass',
 'task_adherence_threshold': 3}

Unspecific:
{'task_adherence': 1.0,
 'task_adherence_reason': "The response does not address the query about Spain's rainfall at all, instead providing "
                          'unrelated weather information for Seattle and London.',
 'task_adherence_result': 'fail',
 'task_adherence_threshold': 3}


## Summary
- **Intent Resolution** confirmed the agent understood the request in both scenarios.
- **Tool Call Accuracy** detected the incorrect tool usage in the flawed scenario.
- **Task Adherence** showed the agent followed instructions even when factual output was wrong.

Use these insights to improve your agent’s tool selection logic and answer verification. For production, combine these metrics with others (e.g., factuality, safety) for a complete quality picture.