# Tool Call Accuracy Evaluator

## Objective
This sample demonstrates to how to use tool call accuracy evaluator on agent data. The supported input formats include:
- simple data such as strings and `dict` describing tool calls;
- user-agent conversations in the form of list of agent messages. 

## Time

You should expect to spend about 20 minutes running this notebook. 

## Before you begin
For quality evaluation, you need to deploy a `gpt` model supporting JSON mode. We recommend a model `gpt-4o` or `gpt-4o-mini` for their strong reasoning capabilities.    

### Prerequisite
```bash
pip install azure-ai-projects azure-identity azure-ai-evaluation
```
Set these environment variables with your own values:
1) **PROJECT_CONNECTION_STRING** - The project connection string, as found in the overview page of your Azure AI Foundry project.
2) **MODEL_DEPLOYMENT_NAME** - The deployment name of the model for this AI-assisted evaluator, as found under the "Name" column in the "Models + endpoints" tab in your Azure AI Foundry project.
3) **AZURE_OPENAI_ENDPOINT** - Azure Open AI Endpoint to be used for evaluation.
4) **AZURE_OPENAI_API_KEY** - Azure Open AI Key to be used for evaluation.
5) **AZURE_OPENAI_API_VERSION** - Azure Open AI Api version to be used for evaluation.
6) **AZURE_SUBSCRIPTION_ID** - Azure Subscription Id of Azure AI Project
7) **PROJECT_NAME** - Azure AI Project Name
8) **RESOURCE_GROUP_NAME** - Azure AI Project Resource Group Name


The Tool Call Accuracy evaluator assesses how accurately an AI uses tools by examining:
- Relevance to the conversation
- Parameter correctness according to tool definitions
- Parameter value extraction from the conversation
- Potential usefulness of the tool call

The evaluator uses a binary scoring (0 or 1) for each tool call:

    - Score 0: The tool call is irrelevant or contains information not in the conversation/definition
    - Score 1: The tool call is relevant with properly extracted parameters from the conversation

If there are multiple call, the final score will be an **average** of individual tool calls, which can be interpreted as the **passing rate** of tool calls.

This evaluation focuses on measuring whether tool calls meaningfully contribute to addressing query while properly following tool definitions and using information present in the conversation history.

Tool Call Accuracy requires following input:
- Query - This can be a single query or a list of messages(conversation history with agent). Latter helps to determine if Agent used the information in history to make right tool calls.
- Tool Calls - Tool Call(s) made by Agent to answer the query. Optional - if response has tool calls, if not provided evaluator will look for tool calls in response.
- Response - (Optional) Response from Agent (or any GenAI App). This can be a single text response or a list or messages generated as part of Agent Response. If tool calls are not provide Tool Call Accuracy Evaluator will look at response for tool calls.
- Tool Definitions - Tool(s) definition used by Agent to answer the query. 


### Initialize Tool Call Accuracy Evaluator


In [16]:
import os
from pprint import pprint
from dotenv import load_dotenv
load_dotenv(".credentials.env")

True

In [None]:
import os, json
import pandas as pd
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential
from typing import Set, Callable, Any
from azure.ai.projects.models import FunctionTool, ToolSet

from dotenv import load_dotenv

load_dotenv(".credentials.env")

# Define some custom python function
def fetch_weather(location: str) -> str:
    """
    Fetches the weather information for the specified location.

    :param location (str): The location to fetch weather for.
    :return: Weather information as a JSON string.
    :rtype: str
    """
    # In a real-world scenario, you'd integrate with a weather API.
    # Here, we'll mock the response.
    mock_weather_data = {"Seattle": "Sunny, 25°C", "London": "Cloudy, 18°C", "Tokyo": "Rainy, 22°C"}
    weather = mock_weather_data.get(location, "Weather data not available for this location.")
    weather_json = json.dumps({"weather": weather})
    return weather_json


user_functions: Set[Callable[..., Any]] = {
    fetch_weather,
}

# Adding Tools to be used by Agent 
functions = FunctionTool(user_functions)

toolset = ToolSet()
toolset.add(functions)


# Create the agent
AGENT_NAME = "London Tourist Assistant"

project_client = AIProjectClient.from_connection_string(
    credential=DefaultAzureCredential(),
   # endpoint=os.environ["PROJECT_ENDPOINT"],
    conn_str=os.environ["PROJECT_CONNECTION_STRING"],
)

agent = project_client.agents.create_agent(
    model=os.environ["MODEL_DEPLOYMENT_NAME"],
    name=AGENT_NAME,
    instructions="You are a helpful assistant",
    toolset=toolset,
)
print(f"Created agent, ID: {agent.id}")

thread = project_client.agents.create_thread()
print(f"Created thread, ID: {thread.id}")

# Create message to thread
MESSAGE = "Can you fetch me the weather in London?"

message = project_client.agents.create_message(
    thread_id=thread.id,
    role="user",
    content=MESSAGE,
)
print(f"Created message, ID: {message.id}")

run = project_client.agents.create_and_process_run(thread_id=thread.id, agent_id=agent.id)

print(f"Run finished with status: {run.status}")

if run.status == "failed":
    print(f"Run failed: {run.last_error}")

print(f"Run ID: {run.id}")

# display messages
for message in project_client.agents.list_messages(thread.id, order="asc").data:
    print(f"Role: {message.role}")
    print(f"Content: {message.content[0].text.value}")
    print("-" * 40)

Created agent, ID: asst_1ofkn7ZFT3sh5oaYt3d4tIIq
Created thread, ID: thread_vWCk8Q1AMnlhSNJCGl9CrnqT
Created message, ID: msg_jgt5ETgUuIeAI2nnTvVDkHXu
Run finished with status: completed
Run ID: run_E2b2Q7a9Jm1KKUgnrV2ppgmY
Role: user
Content: Can you fetch me the weather in London?
----------------------------------------
Role: assistant
Content: The current weather in London is cloudy with a temperature of 18°C.
----------------------------------------


In [19]:
import os
from azure.ai.evaluation import ToolCallAccuracyEvaluator, AzureOpenAIModelConfiguration
from pprint import pprint

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
   # api_version=os.environ["AZURE_OPENAI_API_VERSION"],
    azure_deployment=os.environ["MODEL_DEPLOYMENT_NAME"],
)

tool_call_accuracy_evaluator = ToolCallAccuracyEvaluator(model_config=model_config)

result = tool_call_accuracy_evaluator(
    query="How is the weather in London?",
    response="The weather in London is sunny.",
    tool_calls={
        "type": "tool_call",
        "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ",
        "name": "fetch_weather",
        "arguments": {
            "location": "London"
        }
    },
    tool_definitions={
        "id": "fetch_weather",
        "name": "fetch_weather",
        "description": "Fetches the weather information for the specified location.",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The location to fetch weather for."
                }
            }
        }
    }
)
pprint(result)

{'per_tool_call_details': [{'tool_call_accurate': True,
                            'tool_call_accurate_reason': 'The TOOL CALL is '
                                                         'directly relevant to '
                                                         "the user's query "
                                                         'about the weather in '
                                                         'London, uses the '
                                                         'correct parameter as '
                                                         'defined, and '
                                                         'includes the correct '
                                                         'parameter value from '
                                                         'the conversation. It '
                                                         'is likely to provide '
                                                         'useful informat

In [20]:
# No need to import or redefine ToolCallAccuracyEvaluator or model_config, as they are already available.
from azure.ai.evaluation import ToolCallAccuracyEvaluator

model_config = {
       "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"), # https://<account_name>.services.ai.azure.com
       "api_key": os.environ.get("AZURE_OPENAI_API_KEY"),
       "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
       "api_version": os.environ.get("AZURE_OPENAI_API_VERSION"),
}   

tool_call_accuracy_evaluator = ToolCallAccuracyEvaluator(model_config=model_config)

result = tool_call_accuracy_evaluator(
    query="How is the weather in London?",
    response="The weather in London is sunny.",
    tool_calls={
        "type": "tool_call",
        "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ",
        "name": "fetch_weather",
        "arguments": {
            "location": "London"
        }
    },
    tool_definitions={
        "id": "fetch_weather",
        "name": "fetch_weather",
        "description": "Fetches the weather information for the specified location.",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The location to fetch weather for."
                }
            }
        }
    }
)
pprint(result)

{'per_tool_call_details': [{'tool_call_accurate': True,
                            'tool_call_accurate_reason': 'The TOOL CALL is '
                                                         'directly relevant to '
                                                         "the user's query, "
                                                         'uses the correct '
                                                         'parameter and value, '
                                                         'and is likely to '
                                                         'provide useful '
                                                         'information to '
                                                         'advance the '
                                                         'conversation.',
                            'tool_call_id': 'call_eYtq7fMyHxDWIgeG2s26h0lJ'}],
 'tool_call_accuracy': 1.0,
 'tool_call_accuracy_result': 'pass',
 'tool_call_accuracy_threshold':

In [22]:
# Initialize the ToolCallAccuracyEvaluator with the model configuration
# and evaluate a tool call for fetching weather information in Seattle.
tool_call_accuracy = ToolCallAccuracyEvaluator(model_config=model_config)
metric =tool_call_accuracy(
    query="How is the weather in Seattle?",
    tool_calls=[{
                    "type": "tool_call",
                    "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ",
                    "name": "fetch_weather",
                    "arguments": {
                        "location": "Seattle"
                    }
                }],
    tool_definitions=[{
                    "id": "fetch_weather",
                    "name": "fetch_weather",
                    "description": "Fetches the weather information for the specified location.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "location": {
                                "type": "string",
                                "description": "The location to fetch weather for."
                            }
                        }
                    }
                }
    ]
)

pprint(metric)

{'per_tool_call_details': [{'tool_call_accurate': True,
                            'tool_call_accurate_reason': 'The TOOL CALL is '
                                                         'directly relevant to '
                                                         "the user's query "
                                                         'about the weather in '
                                                         'Seattle. The '
                                                         'parameters used are '
                                                         'appropriate and '
                                                         'correctly extracted '
                                                         'from the '
                                                         'conversation. The '
                                                         'tool call is likely '
                                                         'to provide useful '
              

### Samples

#### Evaluating Single Tool Call

In [23]:
query = "How is the weather in London ?"
tool_call = {
    "type": "tool_call",
    "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ",
    "name": "fetch_weather",
    "arguments": {"location": "London"},
}

tool_definition = {
    "id": "fetch_weather",
    "name": "fetch_weather",
    "description": "Fetches the weather information for the specified location.",
    "parameters": {
        "type": "object",
        "properties": {"location": {"type": "string", "description": "The location to fetch weather for."}},
    },
}

In [24]:
response = tool_call_accuracy(query=query, tool_calls=tool_call, tool_definitions=tool_definition)
pprint(response)

{'per_tool_call_details': [{'tool_call_accurate': True,
                            'tool_call_accurate_reason': 'The TOOL CALL is '
                                                         'directly relevant to '
                                                         "the user's query "
                                                         'about the weather in '
                                                         'London, uses the '
                                                         'correct parameter as '
                                                         'per the TOOL '
                                                         'DEFINITION, and the '
                                                         'parameter value is '
                                                         'correctly extracted '
                                                         'from the '
                                                         'CONVERSATION. It is '
         

#### Multiple Tool Calls used by Agent to respond

In [38]:
query = "How is the weather in London ?"
tool_calls = [
    {
        "type": "tool_call",
        "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ",
        "name": "fetch_weather",
        "arguments": {"location": "Seattle"},
    },
    {
        "type": "tool_call",
        "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ",
        "name": "fetch_weather",
        "arguments": {"location": "London"},
    },
]

tool_definition = {
    "id": "fetch_weather",
    "name": "fetch_weather",
    "description": "Fetches the weather information for the specified location.",
    "parameters": {
        "type": "object",
        "properties": {"location": {"type": "string", "description": "The location to fetch weather for."}},
    },
}

In [39]:
response = tool_call_accuracy(query=query, tool_calls=tool_calls, tool_definitions=tool_definition)
pprint(response)

#### Tool Calls passed as part of `Response` (common for agent case)
- Tool Call Accuracy Evaluator extracts tool calls from response

In [32]:
query = "Can you send me an email with weather information for Seattle?"
response = [
    {
        "createdAt": "2025-03-26T17:27:35Z",
        "run_id": "run_zblZyGCNyx6aOYTadmaqM4QN",
        "role": "assistant",
        "content": [
            {
                "type": "tool_call",
                "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ",
                "name": "fetch_weather",
                "arguments": {"location": "Seattle"},
            }
        ],
    },
    {
        "createdAt": "2025-03-26T17:27:37Z",
        "run_id": "run_zblZyGCNyx6aOYTadmaqM4QN",
        "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ",
        "role": "tool",
        "content": [{"type": "tool_result", "tool_result": {"weather": "Rainy, 14\u00b0C"}}],
    },
    {
        "createdAt": "2025-03-26T17:27:38Z",
        "run_id": "run_zblZyGCNyx6aOYTadmaqM4QN",
        "role": "assistant",
        "content": [
            {
                "type": "tool_call",
                "tool_call_id": "call_iq9RuPxqzykebvACgX8pqRW2",
                "name": "send_email",
                "arguments": {
                    "recipient": "your_email@example.com",
                    "subject": "Weather Information for Seattle",
                    "body": "The current weather in Seattle is rainy with a temperature of 14\u00b0C.",
                },
            }
        ],
    },
    {
        "createdAt": "2025-03-26T17:27:41Z",
        "run_id": "run_zblZyGCNyx6aOYTadmaqM4QN",
        "tool_call_id": "call_iq9RuPxqzykebvACgX8pqRW2",
        "role": "tool",
        "content": [
            {"type": "tool_result", "tool_result": {"message": "Email successfully sent to your_email@example.com."}}
        ],
    },
    {
        "createdAt": "2025-03-26T17:27:42Z",
        "run_id": "run_zblZyGCNyx6aOYTadmaqM4QN",
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "I have successfully sent you an email with the weather information for Seattle. The current weather is rainy with a temperature of 14\u00b0C.",
            }
        ],
    },
]

tool_definitions = [
    {
        "name": "fetch_weather",
        "description": "Fetches the weather information for the specified location.",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string", "description": "The location to fetch weather for."}},
        },
    },
    {
        "name": "send_email",
        "description": "Sends an email with the specified subject and body to the recipient.",
        "parameters": {
            "type": "object",
            "properties": {
                "recipient": {"type": "string", "description": "Email address of the recipient."},
                "subject": {"type": "string", "description": "Subject of the email."},
                "body": {"type": "string", "description": "Body content of the email."},
            },
        },
    },
]

In [33]:
response = tool_call_accuracy(query=query, response=response, tool_definitions=tool_definitions)
pprint(response)

{'per_tool_call_details': [{'tool_call_accurate': True,
                            'tool_call_accurate_reason': 'The TOOL CALL is '
                                                         'directly relevant to '
                                                         "the user's request, "
                                                         'uses the correct '
                                                         'parameter and value, '
                                                         'and is likely to '
                                                         'provide useful '
                                                         'information to '
                                                         'advance the '
                                                         'conversation.',
                            'tool_call_id': 'call_CUdbkBfvVBla2YP3p24uhElJ'},
                           {'tool_call_accurate': True,
                            'tool_call_ac

## Batch evaluate and visualize results on Azure AI Foundry
Batch evaluate to leverage asynchronous evaluation on a dataset. 

Optionally, you can go to AI Foundry URL for rich Azure AI Foundry data visualization. You can inspect the evaluation scores and reasoning to quickly identify bugs and issues of your agent to fix and improve. Make sure to authenticate to Azure using `az login` in your terminal before running this cell.


In [34]:
from azure.ai.evaluation import evaluate

# This sample files contains the evaluation data in JSONL format. Where each line is a run from agent.
# This was saved using agent thread and converter.
file_name = "evaluation_data.jsonl"

response = evaluate(
    data=file_name,
    evaluation_name="Tool Call Accuracy Evaluation",
    evaluators={
        "tool_call_accuracy": tool_call_accuracy,
    },
    azure_ai_project={
        "subscription_id": os.environ["AZURE_SUBSCRIPTION_ID"],
        "project_name": os.environ["PROJECT_NAME"],
        "resource_group_name": os.environ["RESOURCE_GROUP_NAME"],
    },
)
pprint(f'AI Foundary URL: {response.get("studio_url")}')

[2025-06-26 14:59:09 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_tool_call_accuracy_20250626_145909_431868, log path: C:\Users\sumohammed\.promptflow\.runs\azure_ai_evaluation_evaluators_tool_call_accuracy_20250626_145909_431868\logs.txt
 Please check out C:/Users/sumohammed/.promptflow/.runs/azure_ai_evaluation_evaluators_tool_call_accuracy_20250626_145909_431868 for more details.


2025-06-26 14:59:09 +0100    5196 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-06-26 14:59:09 +0100    5196 execution.bulk     INFO     Finished 1 / 5 lines.
2025-06-26 14:59:09 +0100    5196 execution.bulk     INFO     Average execution time for completed lines: 0.08 seconds. Estimated time for incomplete lines: 0.32 seconds.
2025-06-26 14:59:13 +0100    5196 execution.bulk     INFO     Finished 2 / 5 lines.
2025-06-26 14:59:13 +0100    5196 execution.bulk     INFO     Average execution time for completed lines: 1.75 seconds. Estimated time for incomplete lines: 5.25 seconds.
2025-06-26 14:59:14 +0100    5196 execution.bulk     INFO     Finished 3 / 5 lines.
2025-06-26 14:59:14 +0100    5196 execution.bulk     INFO     Average execution time for completed lines: 1.76 seconds. Estimated time for incomplete lines: 3.52 seconds.
2025-06-26 14:59:16 +0100    5196 execution.bulk     INFO     Finished 4 / 5 lines.
2025-

  outputs.fillna(value="(Failed)", inplace=True)  # replace nan with explicit prompt
  result_df.replace("(Failed)", math.nan, inplace=True)



{
    "tool_call_accuracy": {
        "status": "Completed with Errors",
        "duration": "0:00:08.305983",
        "completed_lines": 4,
        "failed_lines": 1,
        "log_path": "C:\\Users\\sumohammed\\.promptflow\\.runs\\azure_ai_evaluation_evaluators_tool_call_accuracy_20250626_145909_431868"
    }
}


('AI Foundary URL: '
 'https://ai.azure.com/build/evaluation/9c0b2e66-ea66-46d9-91a4-3ce2131275ca?wsid=/subscriptions/687537c9-1139-4975-85ff-c4822c224772/resourceGroups/rg-sumohammed-6118_ai/providers/Microsoft.MachineLearningServices/workspaces/sumohammed-0192')


{'per_tool_call_details': [{'tool_call_accurate': False,
                            'tool_call_accurate_reason': 'The TOOL CALL is not '
                                                         'relevant to the '
                                                         "user's query about "
                                                         'the weather in '
                                                         'London, as it '
                                                         'fetches weather for '
                                                         'Seattle instead. The '
                                                         'parameter value used '
                                                         'is incorrect based '
                                                         'on the conversation, '
                                                         'and the TOOL CALL '
                                                         'does not contribute 