# Agent Evaluators (Preview)

This notebook demonstrates how to evaluate Azure AI agents using agent-specific evaluators. Azure AI Foundry supports three types of agent evaluators for agentic workflows:

## Agent-Specific Evaluators

1. **Intent Resolution**: Measures how well the system identifies and understands user intent
2. **Tool Call Accuracy**: Evaluates the accuracy and efficiency of tool calls made by an agent
3. **Task Adherence**: Assesses whether the agent stays on track to complete tasks

## Additional Evaluators Available

Besides agent-specific evaluators, you can also assess other quality and safety aspects:

- **Quality**: Relevance, Coherence, Fluency
- **Safety**: CodeVulnerabilities, Violence, Self-harm, Sexual, HateUnfairness, IndirectAttack, ProtectedMaterials

## Workflow Overview

1. Setup environment and Azure AI Project client
2. Create and configure an agent with tools
3. Run agent to generate test data
4. Convert agent messages for evaluation
5. Configure and run evaluators
6. Analyze evaluation results


## Table of Contents

1. [Part 1: Environment Setup](#part-1-environment-setup)
2. [Part 2: Create Azure AI Agent](#part-2-create-azure-ai-agent)
3. [Part 3: Run Agent to Generate Test Data](#part-3-run-agent-to-generate-test-data)
   - 3.1: Create Thread
   - 3.2: Send User Message
   - 3.3: Execute Agent Run
   - 3.4: View Conversation Messages
4. [Part 4: Convert Agent Messages for Evaluation](#part-4-convert-agent-messages-for-evaluation)
   - 4.1: Inspect Converted Data
5. [Part 5: Model Configuration for AI-Assisted Evaluators](#part-5-model-configuration-for-ai-assisted-evaluators)
6. [Part 6: Run Batch Evaluation](#part-6-run-batch-evaluation)
7. [Part 7: Individual Evaluator Examples](#part-7-individual-evaluator-examples)
   - 7.1: Intent Resolution Evaluator
   - 7.2: Tool Call Accuracy Evaluator
   - 7.3: Task Adherence Evaluator
8. [Summary](#summary)


In [None]:
import os
import shutil

new_path_entry = "/opt/homebrew/bin"  # Replace with the directory you want to add
current_path = os.environ.get('PATH', '')

if new_path_entry not in current_path.split(os.pathsep):
    os.environ['PATH'] = new_path_entry + os.pathsep + current_path
    print(f"Updated PATH for this session: {os.environ['PATH']}")
else:
    print(f"PATH already contains {new_path_entry}: {current_path}")

# You can then verify with shutil.which again
print(f"Location of 'az' found by kernel now: {shutil.which('az')}")

In [None]:
import sys
from pathlib import Path

from dotenv import load_dotenv

# Add parent directory to path for agent_utils import
parent_dir = Path(__file__).parent.parent if hasattr(
    __builtins__, '__file__') else Path.cwd().parent
sys.path.insert(0, str(parent_dir / "utils"))

# Load environment variables from parent directory
agent_ops_dir = Path.cwd().parent if Path.cwd(
).name == "05_evaluation" else Path.cwd()
env_path = agent_ops_dir / ".env"
load_dotenv(env_path)

## Part 1: Environment Setup

Configure the notebook environment and load necessary dependencies.


In [None]:
import os
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential
from azure.ai.agents.models import FunctionTool, ToolSet

# Import your custom functions to be used as Tools for the Agent
from user_functions import user_functions

project_client = AIProjectClient(
    endpoint=os.environ["AZURE_AI_PROJECT_ENDPOINT"],
    credential=DefaultAzureCredential(),
)

AGENT_NAME = "Seattle Tourist Assistant"

# Add Tools to be used by Agent
functions = FunctionTool(user_functions)

toolset = ToolSet()
toolset.add(functions)

# To enable tool calls executed automatically
project_client.agents.enable_auto_function_calls(toolset)

## Part 2: Create Azure AI Agent

Initialize the Azure AI Project client and create an agent with function tools.


In [None]:
agent = project_client.agents.create_agent(
    model=os.environ["AZURE_OPENAI_DEPLOYMENT_NAME"],
    name=AGENT_NAME,
    instructions="You are a helpful assistant",
    toolset=toolset,
)

print(f"Created agent, ID: {agent.id}")

## Part 3: Run Agent to Generate Test Data

Execute the agent with a sample query to generate conversation data for evaluation.


### 3.1: Create Thread

Create a conversation thread for the agent.


In [None]:
thread = project_client.agents.threads.create()
print(f"Created thread, ID: {thread.id}")

### 3.2: Send User Message

Add a user query to the thread.


In [None]:
# Create message to thread
MESSAGE = "Can you email me weather info for Seattle ?"

message = project_client.agents.messages.create(
    thread_id=thread.id,
    role="user",
    content=MESSAGE,
)
print(f"Created message, ID: {message.id}")

### 3.3: Execute Agent Run

Process the user message and generate agent response.


In [None]:
run = project_client.agents.runs.create_and_process(
    thread_id=thread.id, agent_id=agent.id)

print(f"Run finished with status: {run.status}")

if run.status == "failed":
    print(f"Run failed: {run.last_error}")

print(f"Run ID: {run.id}")

### 3.4: View Conversation Messages

Display the complete conversation between user and agent.


In [None]:
for message in project_client.agents.messages.list(thread.id, order="asc"):
    print(f"Role: {message.role}")
    print(f"Content: {message.content[0].text.value}")
    print("-" * 40)

## Part 4: Convert Agent Messages for Evaluation

Azure AI Foundry provides native integration for evaluating agent messages. The AIAgentConverter transforms agent runs into the format required by evaluators.


In [None]:
from azure.ai.evaluation import AIAgentConverter

# Initialize the converter that will be backed by the project.
converter = AIAgentConverter(project_client)

thread_id = thread.id
run_id = run.id
file_name = "./data/evaluation_agent_data.jsonl"

# Get a single agent run data
evaluation_data_single_run = converter.convert(
    thread_id=thread_id, run_id=run_id)

### 4.1: Inspect Converted Data

View the evaluation data structure that will be used by evaluators.


In [None]:
import json
from pprint import pprint

print("=" * 80)
print("CONVERTED EVALUATION DATA")
print("=" * 80)

# Print formatted JSON for better readability
print(json.dumps(evaluation_data_single_run, indent=2, default=str))

print("\n" + "=" * 80)
print("DATA STRUCTURE OVERVIEW")
print("=" * 80)
print(f"Keys in evaluation data: {list(evaluation_data_single_run.keys())}")

if 'query' in evaluation_data_single_run:
    print(f"\nQuery: {evaluation_data_single_run['query']}")

if 'response' in evaluation_data_single_run:
    print(
        f"\nResponse preview: {str(evaluation_data_single_run['response'])[:200]}...")

if 'tool_calls' in evaluation_data_single_run:
    tool_calls = evaluation_data_single_run['tool_calls']
    print(
        f"\nNumber of tool calls: {len(tool_calls) if isinstance(tool_calls, list) else 'N/A'}")

if 'conversation' in evaluation_data_single_run:
    conv = evaluation_data_single_run['conversation']
    if isinstance(conv, dict) and 'messages' in conv:
        print(f"\nNumber of messages in conversation: {len(conv['messages'])}")

print("=" * 80)

## Part 5: Model Configuration for AI-Assisted Evaluators

Configure the model that will act as the LLM-judge for evaluation. Azure AI supports both reasoning models (o-series) and non-reasoning models (GPT-4/GPT-4o) as judges.

### Supported Models

- **Reasoning Models** (e.g., o1, o3-mini): Set `is_reasoning_model=True` when initializing evaluators
- **Non-Reasoning Models** (e.g., GPT-4.1, GPT-4o): Default configuration

For complex evaluation requiring refined reasoning, we recommend using strong reasoning models like o3-mini or GPT-4.1-mini for a balance of performance and cost.


In [None]:
from azure.ai.evaluation import (
    ToolCallAccuracyEvaluator,
    AzureOpenAIModelConfiguration,
    IntentResolutionEvaluator,
    TaskAdherenceEvaluator,
)
from pprint import pprint

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT_GPT_4o"],
    api_key=os.environ["AZURE_OPENAI_API_KEY_GPT_4o"],
    api_version=os.environ["AZURE_OPENAI_API_VERSION_GPT_4o"],
    azure_deployment=os.environ["AZURE_OPENAI_MODEl_GPT_4o"],
)
# Needed to use content safety evaluators
azure_ai_project = os.environ["AZURE_AI_PROJECT_ENDPOINT"]

intent_resolution = IntentResolutionEvaluator(model_config=model_config)

tool_call_accuracy = ToolCallAccuracyEvaluator(model_config=model_config)

task_adherence = TaskAdherenceEvaluator(model_config=model_config)

## Part 6: Run Batch Evaluation

Run all configured evaluators on the evaluation dataset and upload results to Azure AI Foundry.


In [None]:
from azure.ai.evaluation import evaluate

response = evaluate(
    data=file_name,
    evaluators={
        "tool_call_accuracy": tool_call_accuracy,
        "intent_resolution": intent_resolution,
        "task_adherence": task_adherence,
    },
    azure_ai_project=azure_ai_project,
)
pprint(f'AI Foundary URL: {response.get("studio_url")}')

## Part 7: Individual Evaluator Examples

Test each evaluator individually with specific examples to understand their behavior and output format.

### 7.1: Intent Resolution Evaluator

**Purpose**: Measures how well the system identifies and understands user intent, including:

- How well it scopes the user's intent
- Whether it asks clarifying questions
- If it reminds users of capability scope

**Output**: Likert scale score (1-5, higher is better)

- Score >= threshold → pass
- Score < threshold → fail

**Use Case**: Evaluate if your agent correctly identifies what users want to accomplish.


In [None]:
from azure.ai.evaluation import IntentResolutionEvaluator

intent_resolution = IntentResolutionEvaluator(
    model_config=model_config, threshold=3)
intent_resolution(
    query="What are the opening hours of the Eiffel Tower?",
    response="Opening hours of the Eiffel Tower are 9:00 AM to 11:00 PM."
)

### 7.2: Tool Call Accuracy Evaluator

**Purpose**: Measures the accuracy and efficiency of tool calls made by an agent, including:

- Relevance and helpfulness of tools invoked
- Correctness of parameters used
- Counts of missing or excessive calls

**Supported Tools**:

- File Search
- Azure AI Search
- Bing Grounding, Bing Custom Search
- SharePoint Grounding
- Code Interpreter
- Fabric Data Agent
- OpenAPI
- Function Tool (user-defined tools)

**Output**: Likert scale score (1-5, higher is better) plus detailed breakdown:

- `tool_calls_made_by_agent`: Total calls made
- `correct_tool_calls_made_by_agent`: Correct calls
- `per_tool_call_details`: Per-tool analysis
- `excess_tool_calls`: Unnecessary calls
- `missing_tool_calls`: Required calls not made

**Use Case**: Evaluate if your agent selects the right tools and uses them correctly.


In [None]:
from azure.ai.evaluation import ToolCallAccuracyEvaluator

tool_call_accuracy = ToolCallAccuracyEvaluator(
    model_config=model_config, threshold=3)

# provide the agent response with tool calls
tool_call_accuracy(
    query="What timezone corresponds to 41.8781,-87.6298?",
    response=[
        {
            "createdAt": "2025-04-25T23:55:52Z",
            "run_id": "run_DmnhUGqYd1vCBolcjjODVitB",
            "role": "assistant",
            "content": [
                {
                    "type": "tool_call",
                    "tool_call_id": "call_qi2ug31JqzDuLy7zF5uiMbGU",
                    "name": "azure_maps_timezone",
                    "arguments": {
                        "lat": 41.878100000000003,
                        "lon": -87.629800000000003
                    }
                }
            ]
        },
        {
            "createdAt": "2025-04-25T23:55:54Z",
            "run_id": "run_DmnhUGqYd1vCBolcjjODVitB",
            "tool_call_id": "call_qi2ug31JqzDuLy7zF5uiMbGU",
            "role": "tool",
            "content": [
                {
                    "type": "tool_result",
                    "tool_result": {
                        "ianaId": "America/Chicago",
                        "utcOffset": None,
                        "abbreviation": None,
                        "isDaylightSavingTime": None
                    }
                }
            ]
        },
        {
            "createdAt": "2025-04-25T23:55:55Z",
            "run_id": "run_DmnhUGqYd1vCBolcjjODVitB",
            "role": "assistant",
            "content": [
                {
                    "type": "text",
                    "text": "The timezone for the coordinates 41.8781, -87.6298 is America/Chicago."
                }
            ]
        }
    ],
    tool_definitions=[
        {
            "name": "azure_maps_timezone",
                    "description": "local time zone information for a given latitude and longitude.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "lat": {
                                "type": "float",
                                "description": "The latitude of the location."
                            },
                            "lon": {
                                "type": "float",
                                "description": "The longitude of the location."
                            }
                        }
                    }
        }
    ]
)

# alternatively, provide the tool calls directly without the full agent response
tool_call_accuracy(
    query="How is the weather in Seattle?",
    tool_calls=[{
        "type": "tool_call",
        "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ",
        "name": "fetch_weather",
        "arguments": {
                        "location": "Seattle"
        }
    }],
    tool_definitions=[{
        "id": "fetch_weather",
        "name": "fetch_weather",
        "description": "Fetches the weather information for the specified location.",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The location to fetch weather for."
                }
            }
        }
    }
    ]
)

### 7.3: Task Adherence Evaluator

**Purpose**: Assesses whether the agent stays on track to complete tasks instead of making inefficient or out-of-scope steps. Measures how well an agent's response adheres to:

- Their assigned tasks
- Task instructions (extracted from system message and user query)
- Available tools

**Output**: Likert scale score (1-5, higher is better)

- Score >= threshold → pass (good adherence)
- Score < threshold → fail (agent went off-track)

**Use Case**: Evaluate if your agent completes tasks efficiently without unnecessary or irrelevant actions.

**Input Format**: Accepts either:

- `conversation`: Dict with messages array (shown below)
- `query` and `response`: Individual strings


In [None]:
from azure.ai.evaluation import TaskAdherenceEvaluator
from IPython.display import display, HTML
import json as json_module

task_adherence = TaskAdherenceEvaluator(model_config=model_config, threshold=3)


In [None]:
# Failure example, there's only a vague adherence to the task
result = task_adherence(
    query="What are the best practices for maintaining a healthy rose garden during the summer?",
    response="Make sure to water your roses regularly and trim them occasionally.",
)
pprint(result)

In [None]:
# Success example, full adherence to the task
result = task_adherence(
    query="What are the best practices for maintaining a healthy rose garden during the summer?",
    response="For optimal summer care of your rose garden, start by watering deeply early in the morning to ensure the roots are well-hydrated without encouraging fungal growth. Apply a 2-3 inch layer of organic mulch around the base of the plants to conserve moisture and regulate soil temperature. Fertilize with a balanced rose fertilizer every 4 to 6 weeks to support healthy growth. Prune away any dead or diseased wood to promote good air circulation, and inspect regularly for pests such as aphids or spider mites, treating them promptly with an appropriate organic insecticidal soap. Finally, ensure that your roses receive at least 6 hours of direct sunlight daily for robust flowering.",
)
pprint(result)

## Summary

This notebook demonstrated the complete workflow for evaluating Azure AI agents:

### Key Takeaways

1. **Agent Creation**: Built an agent with function tools for real-world scenarios
2. **Data Generation**: Executed agent runs to create evaluation data
3. **Native Integration**: Used AIAgentConverter for seamless evaluation
4. **Three Agent Evaluators**:
   - **Intent Resolution**: Validates user intent understanding
   - **Tool Call Accuracy**: Ensures correct tool selection and usage
   - **Task Adherence**: Confirms agents stay on task
5. **Flexible Evaluation**: Both batch evaluation and individual testing supported

### Evaluation Results Interpretation

All three evaluators use a **Likert scale (1-5)**:

- **5**: Excellent - Agent performed optimally
- **4**: Good - Minor issues, acceptable performance
- **3**: Fair - Threshold for pass/fail (default)
- **2**: Poor - Significant problems detected
- **1**: Very Poor - Critical failures

### Best Practices

1. **Set Appropriate Thresholds**: Default is 3, but adjust based on your quality requirements
2. **Use Reasoning Models**: For complex scenarios, enable `is_reasoning_model=True` with o-series models
3. **Analyze Details**: Review the `reason` and `additional_details` fields to understand scores
4. **Track Over Time**: Run evaluations regularly to monitor agent improvements
5. **Combine Evaluators**: Use all three together for comprehensive assessment

### Additional Resources

- [Azure AI Evaluation Documentation](https://learn.microsoft.com/azure/ai-studio/how-to/evaluate-sdk)
- [Agent Evaluators Reference](https://learn.microsoft.com/azure/ai-studio/how-to/evaluate-agent)
- [Azure AI Foundry Studio](https://ai.azure.com)
