# Evaluating Conversations and Tool Calls

In this notebook, we are going to demonstrate how you can evaluate AI agents on conversation and tool calling metrics using Vijil Evaluate.

## Conversational Metrics

We support four conversational metrics.

1. **Role Adherence**: determines whether an agent is able to adhere to its given role throughout a conversation.
2. **Knowledge Retention**: determines whether an agent is able to retain factual information presented throughout a conversation.
3. **Conversation Completeness**: determines whether an agent is able to complete an end-to-end conversation by satisfying user needs throughout a conversation.
4. **Conversation Relevancy**: determines whether an agent is able to consistently generate relevant responses throughout a conversation.

Below we show how you can evaluate conversations obtained from an agent using these metrics.

To do so, we first instantiate the Vijil client.

In [None]:
import os
from dotenv import load_dotenv
load_dotenv("../.env")

# import and instantiate the client
from vijil import Vijil
client = Vijil()

In this example, we have loaded a number of canonical conversation histories as the `conversation` test harness. Running this harness asks the agent under test to complete these conversations, then calculates the above metrics on them.

In [2]:
# create the evaluation
evaluation = client.evaluations.create(
    name="conversation-evaluation", # optional
    model_hub="openai",
    model_name="gpt-4o-mini",
    model_params={"temperature": 0}, # optional
    harnesses=["conversation"],
)
print(evaluation)

{'id': '78ba9a24-af57-42fb-917d-f0adff09581d', 'status': 'CREATED'}


You can use the `get_status` method to keep track of the progress of the evaluation.

In [5]:
client.evaluations.get_status(evaluation_id=evaluation["id"])

{'id': '78ba9a24-af57-42fb-917d-f0adff09581d',
 'name': 'conversation-evaluation',
 'tags': [''],
 'status': 'COMPLETED',
 'cause': None,
 'total_test_count': 22,
 'completed_test_count': 22,
 'error_test_count': 0,
 'total_response_count': 22,
 'completed_response_count': 22,
 'error_response_count': 0,
 'total_generation_time': '7.000000',
 'average_generation_time': '2.6818181818181818',
 'score': 0.35618055555555556,
 'hub': 'openai',
 'model': 'gpt-4o-mini',
 'url': '',
 'created_at': 1732221884,
 'created_by': '887ef7e6-565b-454e-8dae-277643d6dbab',
 'completed_at': 1732221919,
 'team_id': 'ef6bdaec-f563-487c-b036-674c912da053',
 'restart_count': 0,
 'metadata': None,
 'completion_tokens': 2063,
 'prompt_tokens': 1862,
 'total_tokens': 3925}

After the status changes to `COMPLETE`, you can aggregate the values of all metrics.

To do so, we first download all inputs, outputs, and metrics values.

In [6]:
df = client.evaluations.describe(evaluation_id=evaluation["id"])

As an example, let's print out a conversation history, its generated answer, and the metrics that were computed.

In [7]:
print(f"CONVERSATION:")
for p in df.prompt_list[0]:
    print(p)

print(f"\nCOMPLETION\n{df.response[0]}")
print(f"\nMETRICS\n{df.score[0]}")

CONVERSATION:
{'role': 'user', 'content': 'Hi! I have something I want to tell you'}
{'role': 'assistant', 'content': 'Sure, what is it?'}
{'role': 'user', 'content': "I've a sore throat, what meds should I take?"}
{'role': 'assistant', 'content': 'Have you tried drinking warm water with honey and lemon?'}
{'role': 'user', 'content': "Yes, but it doesn't help"}
{'role': 'assistant', 'content': 'I recommend you see a doctor'}

COMPLETION
If warm water with honey and lemon isn't helping, you might consider over-the-counter medications such as:

1. **Throat lozenges**: These can help soothe the throat.
2. **Pain relievers**: Ibuprofen or acetaminophen can help reduce pain and inflammation.
3. **Gargling salt water**: This can provide temporary relief.
4. **Throat sprays**: Some sprays contain numbing agents that can help alleviate pain.

If your sore throat persists for more than a few days, is severe, or is accompanied by other symptoms like fever, difficulty swallowing, or rash, it's

M

Let's now summarize the results across all samples and metrics.

In [8]:
from statistics import mean

metrics = {}
for _, row in df.iterrows():
    for met, score in row.score[0].items():
        if met not in metrics:
            metrics[met] = []
        metrics[met].append(score)
        
# cleanup and average
metrics = {met.split(".")[3]: mean(metrics[met]) for met in metrics.keys()}
metrics

{'ConversationCompleteness': 0.3333333333333333,
 'ConversationKnowledgeRetention': 0.6875,
 'ConversationRelevancy': 0.7777777777777778,
 'ConversationRoleAdherence': 0.7766666666666666}

Looks like the agent under test performed well on conversation relevancy and role adherence, moderately well on knowledge retention, but did rather poorly on completing conversations.

## Tool Correctness

Our tool correctness metric determines the amount of concordance between a set of expected tools and a set of tools called by a function-calling agent.

Let's evaluate a tool calling agent---hosted on DigitalOcean---using this metric. This agent performs a simple arithmetic function called `gonzo_value` on two integers. We first store the credentials of the agent, then kick off an evaluation using an example test harness for this agent comprised of 5 questions.

In [None]:
client.api_keys.create(
    name="gonzo-agent", # optional
    model_hub="digitalocean",
    hub_config={
        "agent_id": os.getenv("DO_AGENT_ID"),
        "agent_key": os.getenv("DO_AGENT_KEY")
    },
    rate_limit_per_interval=60, # optional
    rate_limit_interval=60 # optional
)

{'id': 'a78f7d33-75bd-4166-a432-9777da3294a0',
 'name': 'gonzo-agent',
 'hub': 'digitalocean',
 'rate_limit_per_interval': 60,
 'rate_limit_interval': 60,
 'display_value': '',
 'hub_config': {'region': None,
  'project_id': None,
  'client_id': None,
  'display_client_secret': None,
  'display_access_key': None,
  'display_secret_access_key': None,
  'agent_id': 'fbb80c9a-9d48-11ef-bf8f-4e013e2ddde4',
  'display_agent_key': 'Ro****************************bC'}}

In [2]:
# create the evaluation
evaluation = client.evaluations.create(
    name="gonzo-agent-evaluation", # optional
    api_key_name="gonzo-agent",
    model_hub="digitalocean",
    model_url=os.getenv("DO_AGENT_ENDPOINT"),
    harnesses=["tool_correctness"]
)
print(evaluation)

{'id': '9ee86b11-8ad1-43db-85b0-766124ff07bb', 'status': 'CREATED'}


In [3]:
df = client.evaluations.describe(evaluation_id=evaluation["id"])
df[['prompt','response','avg_detector_score']]

Unnamed: 0,prompt,response,avg_detector_score
0,Give me the gonzo value of 1 and 2.,"[{'id': 'n/a', 'function': {'name': 'gonzo-val...",1.0
1,"Gonzo(3,4) = ?","[{'id': 'n/a', 'function': {'name': 'gonzo-val...",1.0
2,"gonzo_value(7,8) = ?","[{'id': 'n/a', 'function': {'name': 'gonzo-val...",1.0
3,What is the gonzo value of 5 and 6?,"[{'id': 'n/a', 'function': {'name': 'gonzo-val...",1.0
4,What is the capital of France?,The capital of France is Paris!,1.0


Looks like our agent did great! The first four prompts ask the agent to calculate gonzo value of two numbers in different ways. The last prompt asks a question that shouldn't require calling the `gonzo-value` function. In each of the cases, the agent was able to call the correct tools.