# Evaluating Conversations and Tool Calls

In this notebook, we are going to demonstrate how you can evaluate AI agents on conversation and tool calling metrics using Vijil Evaluate.

## Conversational Metrics

We support four conversational metrics.

1. **Role Adherence**: determines whether an agent is able to adhere to its given role throughout a conversation.
2. **Knowledge Retention**: determines whether an agent is able to retain factual information presented throughout a conversation.
3. **Conversation Completeness**: determines whether an agent is able to complete an end-to-end conversation by satisfying user needs throughout a conversation.
4. **Conversation Relevancy**: determines whether an agent is able to consistently generate relevant responses throughout a conversation.

Below we show how you can evaluate conversations obtained from an agent using these metrics.

To do so, we first instantiate the Vijil client.

In [1]:
import os
from dotenv import load_dotenv
load_dotenv("../.env")

# import and instantiate the client
from vijil import Vijil
client = Vijil(
    base_url="https://dev-api.vijil.ai/v1",
    api_key=os.getenv("VIJIL_API_KEY_DEV")
)

In this example, we have loaded a number of canonical conversation histories as the `conversation` test harness. Running this harness asks the agent under test to complete these conversations, then calculates the above metrics on them.

In [None]:
# create the evaluation
evaluation = client.evaluations.create(
    name="conversation-evaluation", # optional
    model_hub="openai",
    model_name="gpt-4o-mini",
    model_params={"temperature": 0}, # optional
    harnesses=["conversation"],
)
print(evaluation)

{'id': '33a44d17-7de7-4706-8e72-2d9ac0cc9389', 'status': 'CREATED'}


You can use the `get_status` method to keep track of the progress of the evaluation.

In [15]:
client.evaluations.get_status(evaluation_id=evaluation["id"])

{'id': '33a44d17-7de7-4706-8e72-2d9ac0cc9389',
 'name': 'OpenAI-gpt-4o-mini',
 'tags': [''],
 'status': 'COMPLETED',
 'cause': None,
 'total_test_count': 22,
 'completed_test_count': 22,
 'error_test_count': 0,
 'total_response_count': 22,
 'completed_response_count': 22,
 'error_response_count': 0,
 'total_generation_time': '3.000000',
 'average_generation_time': '2.0000000000000000',
 'score': 0.40131944444444445,
 'hub': 'openai',
 'model': 'gpt-4o-mini',
 'url': '',
 'created_at': 1732219057,
 'created_by': 'f6e0b128-c075-4bc3-91da-34d03fa6c67e',
 'completed_at': 1732219113,
 'team_id': '00ccc042-1b41-4f02-ae5f-6a09b5e6e844',
 'restart_count': 0,
 'metadata': None,
 'completion_tokens': 2035,
 'prompt_tokens': 1862,
 'total_tokens': 3897}

After the status changes to `COMPLETE`, you can aggregate the values of all metrics.

To do so, we first download all inputs, outputs, and metrics values.

In [16]:
df = client.evaluations.describe(evaluation_id=evaluation["id"])

As an example, let's print out a conversation history, its generated answer, and the metrics that were computed.

In [17]:
print(f"CONVERSATION:")
for p in df.prompt_list[0]:
    print(p)

print(f"\nCOMPLETION\n{df.response[0]}")
print(f"\nMETRICS\n{df.score[0]}")

CONVERSATION:
{'role': 'user', 'content': "Hi! I'm planning a trip to Paris next month from the 15th to 22nd"}
{'role': 'assistant', 'content': "I'll help you plan your Paris trip for next month. For your dates of 15th to 22nd, what specific aspects would you like help with?"}
{'role': 'user', 'content': 'I need help finding a hotel near the Eiffel Tower'}
{'role': 'assistant', 'content': "For your week-long stay from the 15th to 22nd next month, I can help you find hotels near the Eiffel Tower. What's your budget per night?"}
{'role': 'user', 'content': 'Around 200 euros per night'}
{'role': 'assistant', 'content': "I'll look for hotels near the Eiffel Tower within your 200 euros per night budget for your stay from the 15th to 22nd."}

COMPLETION
Here are some hotel options near the Eiffel Tower that typically fall within your budget of around 200 euros per night:

1. **Hotel La Comtesse**  
   - Located just a short walk from the Eiffel Tower, this hotel offers beautiful views and a 

#### Summarize Results

In [19]:
from statistics import mean

metrics = {}
for _, row in df.iterrows():
    for met, score in row.score[0].items():
        if met not in metrics:
            metrics[met] = []
        metrics[met].append(score)
        
# cleanup and average
metrics = {met.split(".")[3]: mean(metrics[met]) for met in metrics.keys()}
metrics

{'ConversationCompleteness': 0.3333333333333333,
 'ConversationKnowledgeRetention': 0.5625,
 'ConversationRelevancy': 0.7222222222222222,
 'ConversationRoleAdherence': 0.7766666666666666}

Looks like the agent under test performed well on conversation relevancy and role adherence, moderately well on knowledge retention, but did rather poorly on completing conversations.

## Tool Correctness

Our tool correctness metric determines the amount of concordance between a set of expected tools and a set of tools called by a function-calling agent.

Let's evaluate a tool calling agent---hosted on DigitalOcean---using this metric. This agent performs a simple arithmetic function called `gonzo_value` on two integers. We first store the credentials of the agent, then kick off an evaluation using an example test harness for this agent comprised of 5 questions.

In [None]:
client.api_keys.create(
    name="gonzo-agent",
    model_hub="digitalocean",
    hub_config={
        "agent_id": os.getenv("DO_AGENT_ID"),
        "agent_key": os.getenv("DO_AGENT_KEY")
    },
    rate_limit_per_interval=60, # optional
    rate_limit_interval=60 # optional
)

{'id': 'efa504cf-ba1e-4edc-97cc-2aafc8d6633b',
 'name': 'gonzo-agent',
 'hub': 'digitalocean',
 'rate_limit_per_interval': 60,
 'rate_limit_interval': 60,
 'display_value': '',
 'hub_config': {'region': None,
  'project_id': None,
  'client_id': None,
  'display_client_secret': None,
  'display_access_key': None,
  'display_secret_access_key': None,
  'agent_id': 'fbb80c9a-9d48-11ef-bf8f-4e013e2ddde4',
  'display_agent_key': 'Ro****************************bC'}}

In [None]:
# create the evaluation
evaluation = client.evaluations.create(
    name="gonzo-agent-evaluation", # optional
    api_key_name="gonzo-agent",
    model_hub="digitalocean",
    model_url=os.getenv("DO_AGENT_ENDPOINT"),
    harnesses=["tool_correctness"]
)
print(evaluation)

{'id': '94cc3ca5-9b4a-49e4-bd38-129b3a91ee62', 'status': 'CREATED'}


In [8]:
df = client.evaluations.describe(evaluation_id=evaluation["id"])
df[['prompt','response','avg_detector_score']]

Unnamed: 0,prompt,response,avg_detector_score
0,"gonzo_value(7,8) = ?","[{'id': 'n/a', 'function': {'name': 'gonzo-val...",1.0
1,Give me the gonzo value of 1 and 2.,"[{'id': 'n/a', 'function': {'name': 'gonzo-val...",1.0
2,"Gonzo(3,4) = ?","[{'id': 'n/a', 'function': {'name': 'gonzo-val...",1.0
3,What is the gonzo value of 5 and 6?,"[{'id': 'n/a', 'function': {'name': 'gonzo-val...",1.0
4,What is the capital of France?,The capital of France is Paris!,1.0


Looks like our agent did great! The first four prompts ask the agent to calculate gonzo value of two numbers in different ways. The last prompt asks a question that shouldn't require calling the `gonzo-value` function. In each of the cases, the agent was able to call the correct tools.