# Evaluating AI Agents using LLMs as a juge

## The Critical Need for Agent Evaluation

After building our agent, we faced a critical question: How do we know it's actually working as intended? Unlike simple chatbots, AI agents make decisions about when to use tools and how to interpret user requests. This requires specialized evaluation approaches.

In our Startup Strategist project, we implemented three specialized evaluation methods to ensure the agent performs as intended:

- **Routing Evaluation:** Does the agent correctly decide when to use tools versus responding directly?

- **Function Generation Evaluation:** Does the agent select the appropriate tool for each task?

- **Parameter Extraction Evaluation:** Does the agent properly extract and format parameters from user queries?

These evaluations don't just check if the system works, they verify that it works intelligently, making context-aware decisions that align with entrepreneurial best practices. Without this evaluation layer, we'd have no way to measure whether our agent is truly providing value or just producing plausible-sounding outputs.


## Setting Up the Evaluation Framework

To implement our evaluation system, we used Phoenix's evaluation toolkit combined with OpenAI's GPT-4 as our judge model. The setup process involves several key steps:

First, we install the necessary packages including **Phoenix** for observability and the **OpenAI** client library. The environment configuration ensures we can track all agent interactions for later analysis:


In [None]:
!pip install -qq "arize-phoenix[evals,llama-index]" "llama-index-llms-openai" "openai>=1" gcsfs nest_asyncio

In [None]:
!pip uninstall -y numpy
!pip install numpy

In [None]:
import os
from getpass import getpass

def _set_if_undefined(var: str):
    if not os.environ.get(var):
        os.environ[var] = getpass(f"Please provide your {var}")


_set_if_undefined("OPENAI_API_KEY")

In [None]:
import nest_asyncio
import pandas as pd

nest_asyncio.apply()

In [None]:
from phoenix.evals import OpenAIModel

## Generating Realistic Test Cases
To properly test our Startup Strategist agent, we needed queries that mirror how real entrepreneurs think and speak. We generated test cases using GPT-4 with specific instructions to create nuanced prompts that would challenge the agent's capabilities. The dataset includes:

- **Mixed requests:** Combining brainstorming with document generation or email drafting

- **Vague explorations:** Early-stage ideas lacking clear direction

- **Indirect phrasing:** Natural language where intent isn't explicitly stated

- **Multi-step workflows:** Requests spanning multiple tools in one conversation

This approach surfaces how the agent handles real ambiguity rather than just optimized test cases. The dataset becomes a living benchmark that grows as we encounter new edge cases in production.

In [None]:
GEN_TEMPLATE = """
You are an AI startup strategist assistant that interacts with aspiring entrepreneurs and product teams. You help brainstorm startup ideas, validate concepts, create lean business canvases, write emails, and save content to DOCX files.

Generate complex and nuanced user queries that could be directed to such an assistant. The prompts should test multiple abilities of the agent and simulate real-world startup discussions.

Include:
- Mixed Intentions: Questions that blur brainstorming, validation, and action (e.g., save/send).
- Vague Goals: Queries with unclear or evolving business concepts that need refining.
- Multiple Tools: Prompts where the user expects several tasks in one (e.g., idea + email + doc).
- Polite or Indirect Phrasing: Users unsure of what they want or how to start.
- Hypothetical or Early-Stage Scenarios: People exploring an idea before committing.

Encourage queries such as:
- Startup idea brainstorming
- Market validation or competitive analysis
- Business model and lean canvas generation
- Professional email drafting for outreach or funding
- Requests to document/save notes, ideas, or canvas

Examples of More Challenging Prompts:

Mixed Intentions
"I have this rough idea for a marketplace app for freelancers—could we brainstorm it and also maybe draft something to pitch?"
"I'm trying to validate an AI-based tutor tool. Should I test with students or start with a lean canvas? Can we save our notes?"

Vague Goals
"I'm thinking of doing something with AI in mental health... not sure if it's a good direction?"
"What do you think about combining wearables and machine learning? Is there a startup in that?"

Indirect Language
"I was wondering if you could help me shape a business idea I have—it's a bit raw still."
"Maybe you could suggest a few startup directions if I told you I'm into education and automation?"

Tool Combination
"Can you help me draft an investor email for my B2B SaaS tool and also save a lean canvas in DOCX?"
"I want to log all our brainstorming today in a document. Also, can you suggest a few competitor angles?"

Straightforward Examples:
- "Generate a lean business canvas for a pet health app."
- "Save our idea summary and validation notes into a DOCX file."
- "Write a follow-up email to a potential mentor about my AI product."

Respond with a list, one question per line. Do not include any numbering at the beginning of each line. Do not include any category headings.
Generate 50 questions.
"""


In [None]:
model = OpenAIModel(model="gpt-4o", max_tokens=1300)

In [None]:
resp = model(GEN_TEMPLATE)

In [None]:
resp

"I'm considering a startup in the renewable energy sector, but I'm not sure where to start. Could you help me brainstorm and maybe draft an initial concept note?\nI have an idea for a subscription box service for eco-friendly products. Can we explore this idea and also create a lean business canvas?\nCould you help me validate a concept for a virtual reality fitness app? I'm not sure if I should start with market research or a prototype.\nI'm interested in developing a platform for remote team building. Can you suggest some features and help me draft an email to potential partners?\nI have a vague idea about using AI for personalized learning experiences. Could you help refine this and maybe save our discussion in a document?\nI'm thinking about a startup in the elder care tech space. Could you help me brainstorm and also draft a pitch email for potential investors?\nCould you assist me in creating a lean canvas for a blockchain-based supply chain solution and save it as a DOCX file?\n

In [None]:
split_response = resp.strip().split('\n')
questions_df = pd.DataFrame(split_response, columns=['questions'])

In [None]:
questions_df

Unnamed: 0,questions
0,I'm considering a startup in the renewable ene...
1,I have an idea for a subscription box service ...
2,Could you help me validate a concept for a vir...
3,I'm interested in developing a platform for re...
4,I have a vague idea about using AI for persona...
5,I'm thinking about a startup in the elder care...
6,Could you assist me in creating a lean canvas ...
7,I'm exploring the idea of a digital detox app....
8,I have a concept for a smart home device but n...
9,I'm interested in a startup that combines AI a...


These above instructions intentionally include ambiguous requests, multi-step tasks, and incomplete information, exactly the kind of inputs real users would provide. This gives us a robust dataset for evaluating how the agent handles messy, real-world scenarios.



## Creating our function-calling agent

To properly test our Startup Strategist, we'll recreate the same LangChain agent architecture from our production system in this evaluation notebook.

We'll use the identical configuration:

- GPT-4 as our base LLM

- the same functions we gave it as tools

- The same prompt template establishing the assistant's personality

In [None]:
!pip install langchain
!pip install -qU arize-phoenix openinference-instrumentation-langchain

In [None]:
!pip install langchain_openai

In [None]:
from langchain.pydantic_v1 import BaseModel, Field
from langchain.tools import BaseTool, StructuredTool, tool
from langchain_openai import ChatOpenAI
from langchain.prompts.chat import (
    ChatPromptTemplate,
    MessagesPlaceholder
)
from langchain.agents import AgentType, Tool, initialize_agent

from phoenix.otel import register
from openinference.instrumentation.langchain import LangChainInstrumentor

tracer_provider = register()
LangChainInstrumentor().instrument(tracer_provider=tracer_provider)

In [None]:
import phoenix as px
session = px.launch_app()

In [None]:
pip install requests

In [None]:
pip install python-docx

In [None]:
import requests
from docx import Document
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText

EMAIL_CONFIG = {
    "smtp_server": "smtp.gmail.com",
    "smtp_port": 587,
    "sender_email": "",
    "sender_password": ""
}

In [None]:

@tool
def brainstorm_startup_ideas(
    interests: str,
    industry: str = "",
    skills: str = "",
    pain_points: str = ""
) -> dict:
    """
    Generate creative startup ideas based on interests, industry, skills, and pain points.
    """
    url = "https://api.ubiai.tools:8443/api_v1/annotate"
    my_token = ""

    user_prompt = f"""Generate creative and actionable startup or project ideas based on the following:
- Interests: {interests}
- Industry: {industry}
- Skills: {skills}
- Pain Points: {pain_points}
Format the output clearly with bullet points or numbers."""

    data = {
        "input_text": "",
        "system_prompt": "You are a helpful assistant who generates startup ideas tailored to user input.",
        "user_prompt": user_prompt,
        "temperature": 0.7,
        "monitor_model": True,
        "knowledge_base_ids": [],
        "session_id": "",
        "images_urls": []
    }

    try:
        response = requests.post(url + my_token, json=data)
        if response.status_code == 200:
            res = response.json()
            return {"ideas": res.get("output")}
        else:
            return {"error": f"{response.status_code} - {response.text}"}
    except Exception as e:
        return {"error": str(e)}


@tool
def validate_idea(
    idea_description: str,
    target_audience: str = "",
    value_proposition: str = "",
    assumptions: str = ""
) -> dict:
    """
    Validate a startup idea using lean startup framework and provide validation steps.
    """
    url = "https://api.ubiai.tools:8443/api_v1/annotate"
    my_token = ""

    user_prompt = f"""Validate the following idea using a lean startup framework. Provide a checklist or set of validation questions that help assess its feasibility.

Idea: {idea_description}
Target Audience: {target_audience}
Value Proposition: {value_proposition}
Key Assumptions: {assumptions}

Give practical and direct validation steps or criteria.
"""

    data = {
        "input_text": "",
        "system_prompt": "You are a lean startup expert helping users validate ideas realistically.",
        "user_prompt": user_prompt,
        "temperature": 0.7,
        "monitor_model": True,
        "knowledge_base_ids": [],
        "session_id": "",
        "images_urls": []
    }

    try:
        response = requests.post(url + my_token, json=data)
        if response.status_code == 200:
            res = response.json()
            return {"validation": res.get("output")}
        else:
            return {"error": f"{response.status_code} - {response.text}"}
    except Exception as e:
        return {"error": str(e)}


@tool
def generate_lean_canvas(
    idea: str,
    target_market: str = "",
    value_proposition: str = ""
) -> dict:
    """
    Generate a Lean Canvas for a startup idea with detailed sections.
    """
    url = "https://api.ubiai.tools:8443/api_v1/annotate"
    my_token = ""

    user_prompt = f"""Create a Lean Canvas for the following startup idea. Provide clear sections such as Problem, Solution, Key Metrics, Unique Value Proposition, Channels, Customer Segments, Cost Structure, Revenue Streams, and Unfair Advantage.

Idea: {idea}
Target Market: {target_market}
Value Proposition: {value_proposition}

Format the canvas clearly with section headers.
"""

    data = {
        "input_text": "",
        "system_prompt": "You are a business strategist generating detailed Lean Canvas models for startups.",
        "user_prompt": user_prompt,
        "temperature": 0.7,
        "monitor_model": True,
        "knowledge_base_ids": [],
        "session_id": "",
        "images_urls": []
    }

    try:
        response = requests.post(url + my_token, json=data)
        if response.status_code == 200:
            res = response.json()
            return {"lean_canvas": res.get("output")}
        else:
            return {"error": f"{response.status_code} - {response.text}"}
    except Exception as e:
        return {"error": str(e)}


@tool
def save_to_word_doc(content: str, filename: str = "business_canvas.docx") -> dict:
    """
    Save text content to a Word document.
    """
    try:
        doc = Document()
        doc.add_heading("Business Canvas Document", 0)
        for line in content.split('\n'):
            doc.add_paragraph(line)
        doc.save(filename)
        return {"filename": filename, "message": f"Document saved as {filename}"}
    except Exception as e:
        return {"error": str(e)}


@tool
def send_email(
    recipient_email: str,
    subject: str,
    body: str,
    sender_email: str = None,
    sender_password: str = None
) -> dict:
    """
    Send an email using SMTP.
    """
    try:
        from_email = sender_email if sender_email else EMAIL_CONFIG["sender_email"]
        password = sender_password if sender_password else EMAIL_CONFIG["sender_password"]

        if not from_email or not password:
            return {"error": "Email credentials not configured"}

        msg = MIMEMultipart()
        msg['From'] = from_email
        msg['To'] = recipient_email
        msg['Subject'] = subject
        msg.attach(MIMEText(body, 'plain'))

        server = smtplib.SMTP(EMAIL_CONFIG["smtp_server"], EMAIL_CONFIG["smtp_port"])
        server.starttls()
        server.login(from_email, password)
        server.sendmail(from_email, recipient_email, msg.as_string())
        server.quit()

        return {"status": "success", "message": f"Email sent to {recipient_email}"}
    except Exception as e:
        return {"error": str(e)}




In [None]:
tools = [
    brainstorm_startup_ideas,
    validate_idea,
    generate_lean_canvas,
    save_to_word_doc,
    send_email
]

In [None]:
llm = ChatOpenAI(model="gpt-4o")
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a helpful assistant"),
        ("human", "{input}"),
        MessagesPlaceholder(variable_name="agent_scratchpad"),
    ]
)

In [None]:
agent_executor = initialize_agent(
    tools,
    llm,
    agent=AgentType.OPENAI_FUNCTIONS)

we then use the agent to answer all the instructions we generated earlier. this might take some time, so go do something fun wile you wait!

In [None]:
questions_df["response"] = questions_df["questions"].apply(agent_executor.invoke)

let's view the trace and results of each agent execution. (This is what we will use to evaluate our agent)

In [None]:
trace_df = px.Client().get_spans_dataframe()

In [None]:
trace_df

Unnamed: 0_level_0,name,span_kind,parent_id,start_time,end_time,status_code,status_message,events,context.span_id,context.trace_id,...,attributes.input.value,attributes.metadata,attributes.llm.model_name,attributes.llm.input_messages,attributes.llm.token_count.prompt,attributes.output.value,attributes.llm.token_count.total,attributes.llm.token_count.prompt_details.audio,attributes.tool.description,attributes.tool.name
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
a6e06d6b4560b1df,ChatOpenAI,LLM,cec641271dc4a0d2,2025-06-18 06:41:47.940763+00:00,2025-06-18 06:41:49.249043+00:00,OK,,[],a6e06d6b4560b1df,ddeebbf9e61c4e6ffed5c0d0ff33bbb6,...,"{""messages"": [[{""lc"": 1, ""type"": ""constructor""...","{'ls_provider': 'openai', 'ls_model_name': 'gp...",gpt-4o,"[{'message.role': 'system', 'message.content':...",317.0,"{""generations"": [[{""text"": """", ""generation_inf...",343.0,0.0,,
057e0b62f9562abd,brainstorm_startup_ideas,TOOL,cec641271dc4a0d2,2025-06-18 06:41:49.267468+00:00,2025-06-18 06:42:05.122377+00:00,OK,,[],057e0b62f9562abd,ddeebbf9e61c4e6ffed5c0d0ff33bbb6,...,"{'interests': 'renewable energy', 'industry': ...",,,,,"{""output"": {""ideas"": ""Here are some creative a...",,,Generate creative startup ideas based on inter...,brainstorm_startup_ideas
52fbfd6baeacce0b,ChatOpenAI,LLM,cec641271dc4a0d2,2025-06-18 06:42:05.140027+00:00,2025-06-18 06:42:17.907953+00:00,OK,,[],52fbfd6baeacce0b,ddeebbf9e61c4e6ffed5c0d0ff33bbb6,...,"{""messages"": [[{""lc"": 1, ""type"": ""constructor""...","{'ls_provider': 'openai', 'ls_model_name': 'gp...",gpt-4o,"[{'message.role': 'system', 'message.content':...",1270.0,"{""generations"": [[{""text"": ""Here are some crea...",2127.0,0.0,,
cec641271dc4a0d2,AgentExecutor,AGENT,,2025-06-18 06:41:47.915617+00:00,2025-06-18 06:42:17.910891+00:00,OK,,[],cec641271dc4a0d2,ddeebbf9e61c4e6ffed5c0d0ff33bbb6,...,I'm considering a startup in the renewable ene...,,,,,Here are some creative and actionable startup ...,,,,
916565ca2d2c3038,ChatOpenAI,LLM,23953d87c74fb54b,2025-06-18 06:42:17.926053+00:00,2025-06-18 06:42:19.011231+00:00,OK,,[],916565ca2d2c3038,1700e241849ea29857919b9799bd8047,...,"{""messages"": [[{""lc"": 1, ""type"": ""constructor""...","{'ls_provider': 'openai', 'ls_model_name': 'gp...",gpt-4o,"[{'message.role': 'system', 'message.content':...",313.0,"{""generations"": [[{""text"": """", ""generation_inf...",336.0,0.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7cb37d47e8905328,brainstorm_startup_ideas,TOOL,89c624339d8f86e1,2025-06-18 06:55:45.838170+00:00,2025-06-18 06:56:03.854497+00:00,OK,,[],7cb37d47e8905328,a32207bf41efd9b98a3d3b39a3eabf55,...,"{'interests': 'digital wellness', 'industry': ...",,,,,"{""output"": {""ideas"": ""Here are some creative a...",,,Generate creative startup ideas based on inter...,brainstorm_startup_ideas
a93aa9d780723e81,ChatOpenAI,LLM,89c624339d8f86e1,2025-06-18 06:56:03.870303+00:00,2025-06-18 06:56:12.584963+00:00,OK,,[],a93aa9d780723e81,a32207bf41efd9b98a3d3b39a3eabf55,...,"{""messages"": [[{""lc"": 1, ""type"": ""constructor""...","{'ls_provider': 'openai', 'ls_model_name': 'gp...",gpt-4o,"[{'message.role': 'system', 'message.content':...",1301.0,"{""generations"": [[{""text"": ""Here are some crea...",1855.0,0.0,,
89c624339d8f86e1,AgentExecutor,AGENT,,2025-06-18 06:55:44.457910+00:00,2025-06-18 06:56:12.588774+00:00,OK,,[],89c624339d8f86e1,a32207bf41efd9b98a3d3b39a3eabf55,...,I'm exploring the idea of a digital wellness p...,,,,,Here are some creative and actionable startup ...,,,,
ec48319cf1dde3d2,ChatOpenAI,LLM,1fc221bde5e341b1,2025-06-18 06:56:12.619406+00:00,2025-06-18 06:56:15.099709+00:00,OK,,[],ec48319cf1dde3d2,dce553128cbb1ae9ce18a191cc1f1bac,...,"{""messages"": [[{""lc"": 1, ""type"": ""constructor""...","{'ls_provider': 'openai', 'ls_model_name': 'gp...",gpt-4o,"[{'message.role': 'system', 'message.content':...",310.0,"{""generations"": [[{""text"": ""Let's start by ref...",409.0,0.0,,


## Implementing the Three Evaluation Layers
### Routing Evaluation: Knowing When to Act

The first critical evaluation checks whether our agent properly routes requests: deciding when to use tools versus responding directly. This is fundamental because incorrect routing leads to either missed opportunities for action or inappropriate tool usage.

We evaluate this by analyzing cases where the agent responded directly (without tool calls) and having GPT-4 judge whether a tool should have been used instead:

In [None]:
from phoenix.trace import SpanEvaluations
from phoenix.evals import (
    OpenAIModel,
    llm_classify,
)

In [None]:
evals_df = trace_df[trace_df['attributes.llm.function_call'].isnull() & trace_df['attributes.llm.input_messages'].notnull()]
evals_df_trimmed = evals_df[['attributes.llm.input_messages', 'attributes.output.value']]
evals_df_trimmed.rename(columns={
    'attributes.llm.input_messages': 'question',
    'attributes.output.value': 'response'
}, inplace=True)

In [None]:
evals_df_trimmed

Unnamed: 0_level_0,question,response
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1
52fbfd6baeacce0b,"[{'message.role': 'system', 'message.content':...","{""generations"": [[{""text"": ""Here are some crea..."
bdecff0e9809cf0b,"[{'message.role': 'system', 'message.content':...","{""generations"": [[{""text"": ""I've generated a L..."
fa9527cbd8b0b986,"[{'message.role': 'system', 'message.content':...","{""generations"": [[{""text"": ""Validating a start..."
b8703cdd6ea28caf,"[{'message.role': 'system', 'message.content':...","{""generations"": [[{""text"": ""Creating a platfor..."
80f2cceeebe59004,"[{'message.role': 'system', 'message.content':...","{""generations"": [[{""text"": ""To refine your ide..."
7c8e85768600d50b,"[{'message.role': 'system', 'message.content':...","{""generations"": [[{""text"": ""Here are some crea..."
ddc43ddf179a7e99,"[{'message.role': 'system', 'message.content':...","{""generations"": [[{""text"": ""The Lean Canvas fo..."
fc0fef311872a607,"[{'message.role': 'system', 'message.content':...","{""generations"": [[{""text"": ""To validate your d..."
992a5777275119ba,"[{'message.role': 'system', 'message.content':...","{""generations"": [[{""text"": ""I can definitely h..."
535472320933e3b8,"[{'message.role': 'system', 'message.content':...","{""generations"": [[{""text"": ""I've brainstormed ..."


The following prompt enables us to use a model as a juge to determine if a function should have been called or not.

In [None]:
ROUTER_TEMPLATE = ''' You are comparing a response to a question, and verifying whether that response should have made a function call instead of responding directly. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Response]: {response}
    [END DATA]

Compare the Question above to the response. You must determine whether the reponse
should have instead made a function call to one of the functions listed below.
Your response must be single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"incorrect" means that the agent should have made function call instead of responding correctly, or the response is not related to the question, or the response includes information that was not provided.
"correct" means no function call should have been made, and the response directly answers the question without making up other info.

Here is more information on each function:
product_comparison: Compare features of two products. Should include either the product id or name. If the name or id is present in the question and not present in the generated function, the response is incorrect.
product_search: Search for products based on criteria.
track_package: Track the status of a package based on the tracking number.
customer_support: Get contact information for customer support regarding an issue. The response should always include an email or phone number.
apply_discount_code: Applies a discount code to an order.
product_details: Get detailed features on one product.

'''

In [None]:
eval_model = OpenAIModel(model="gpt-4o")

In [None]:
# The rails is used to hold the output to specific values based on the template
# It will remove text such as ",,," or "..."
# Will ensure the binary value expected from the template is returned
rails = ["incorrect", "correct"]
#MultiClass would be rails = ["irrelevant", "relevant", "semi-relevant"]
response_classifications = llm_classify(
    dataframe=evals_df_trimmed,
    template=ROUTER_TEMPLATE,
    model=eval_model,
    rails=rails,
    provide_explanation=True,
)

In [None]:
response_classifications

Unnamed: 0_level_0,label,explanation,exceptions,execution_status,execution_seconds
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
52fbfd6baeacce0b,correct,The response provided a detailed list of start...,[],COMPLETED,1.764512
bdecff0e9809cf0b,correct,The response provided a detailed Lean Canvas f...,[],COMPLETED,4.012474
fa9527cbd8b0b986,correct,The response directly addresses the user's que...,[],COMPLETED,2.639585
b8703cdd6ea28caf,correct,The response directly addresses the user's req...,[],COMPLETED,2.026621
80f2cceeebe59004,correct,The response directly addresses the user's req...,[],COMPLETED,1.600247
7c8e85768600d50b,correct,The response directly answers the user's reque...,[],COMPLETED,1.67047
ddc43ddf179a7e99,correct,The response correctly summarizes the actions ...,[],COMPLETED,1.898266
fc0fef311872a607,correct,The response directly addresses the user's req...,[],COMPLETED,2.586299
992a5777275119ba,correct,The user's request is for help in refining a c...,[],COMPLETED,2.441575
535472320933e3b8,correct,The response provided a detailed list of start...,[],COMPLETED,2.795124


In [None]:
px.Client().log_evaluations(
   SpanEvaluations(eval_name="Routing Eval", dataframe=response_classifications),
)

Looks like our agent did not mess up on the tool calling for this test case.

To quantitatively verify this result, we used Ragas' ToolCallAccuracy metric. This evaluates:
 Whether the agent selects the correct tool for a given query

In [None]:
!pip install ragas
!pip install openai

In [None]:
import os
import openai

In [None]:
os.environ["OPENAI_API_KEY"] =""

In [None]:
from ragas.metrics import ToolCallAccuracy

from ragas.dataset_schema import MultiTurnSample

from ragas.messages import HumanMessage,AIMessage, ToolMessage, ToolCall
from langchain_openai import ChatOpenAI

from ragas.metrics._tool_call_accuracy import ToolCallAccuracy


sample = MultiTurnSample(
user_input=sample,
reference_tool_calls=[
ToolCall(name="weather_check", args={"location": "New York"}),
ToolCall(name="temperature_conversion", args={"temperature_fahrenheit": 75})
]
)

scorer = ToolCallAccuracy()
scorer.llm = ChatOpenAI(model="gpt-40")
await scorer.multi_turn_ascore(evals_df_trimmed)


1.0

A perfect 1.0 score indicates our agent: Consistently chose the right tool for each query

## Parameter Extraction: Precision in Execution
This evaluation layer examines whether the agent correctly extracts parameters from user queries to populate tool calls. Even with the right tool selected, incorrect parameters lead to useless outputs.

We validate this by comparing extracted parameters against the expected JSON schema for each tool:

In [None]:
evals_df = trace_df[trace_df['attributes.llm.function_call'].notnull()]
evals_df_trimmed = evals_df[['attributes.llm.input_messages', 'attributes.llm.function_call']]
evals_df_trimmed.rename(columns={
    'attributes.llm.input_messages': 'question',
    'attributes.llm.function_call': 'generated_function'
}, inplace=True)

In [None]:
json_function = """
functions = [
    {
        "name": "brainstorm_startup_ideas",
        "description": "Generate creative startup ideas based on interests, industry, skills, and pain points.",
        "parameters": {
            "type": "object",
            "properties": {
                "interests": {
                    "type": "string",
                    "description": "User's interests relevant to the startup."
                },
                "industry": {
                    "type": "string",
                    "description": "Industry context for the startup ideas.",
                    "default": ""
                },
                "skills": {
                    "type": "string",
                    "description": "Skills that the user can leverage.",
                    "default": ""
                },
                "pain_points": {
                    "type": "string",
                    "description": "Pain points or problems to solve.",
                    "default": ""
                }
            },
            "required": ["interests"]
        }
    },
    {
        "name": "validate_idea",
        "description": "Validate a startup idea using lean startup framework and provide validation steps.",
        "parameters": {
            "type": "object",
            "properties": {
                "idea_description": {
                    "type": "string",
                    "description": "Description of the startup idea to validate."
                },
                "target_audience": {
                    "type": "string",
                    "description": "Target audience for the idea.",
                    "default": ""
                },
                "value_proposition": {
                    "type": "string",
                    "description": "Value proposition of the startup idea.",
                    "default": ""
                },
                "assumptions": {
                    "type": "string",
                    "description": "Key assumptions to be validated.",
                    "default": ""
                }
            },
            "required": ["idea_description"]
        }
    },
    {
        "name": "generate_lean_canvas",
        "description": "Generate a Lean Canvas for a startup idea with detailed sections.",
        "parameters": {
            "type": "object",
            "properties": {
                "idea": {
                    "type": "string",
                    "description": "The startup idea to create a Lean Canvas for."
                },
                "target_market": {
                    "type": "string",
                    "description": "Target market for the startup.",
                    "default": ""
                },
                "value_proposition": {
                    "type": "string",
                    "description": "Value proposition of the startup idea.",
                    "default": ""
                }
            },
            "required": ["idea"]
        }
    },
    {
        "name": "save_to_word_doc",
        "description": "Save text content to a Word document.",
        "parameters": {
            "type": "object",
            "properties": {
                "content": {
                    "type": "string",
                    "description": "Text content to save in the document."
                },
                "filename": {
                    "type": "string",
                    "description": "Name of the Word document file.",
                    "default": "business_canvas.docx"
                }
            },
            "required": ["content"]
        }
    },
    {
        "name": "send_email",
        "description": "Send an email using SMTP.",
        "parameters": {
            "type": "object",
            "properties": {
                "recipient_email": {
                    "type": "string",
                    "description": "Recipient email address."
                },
                "subject": {
                    "type": "string",
                    "description": "Subject of the email."
                },
                "body": {
                    "type": "string",
                    "description": "Body text of the email."
                },
                "sender_email": {
                    "type": "string",
                    "description": "Sender email address.",
                    "default": null
                },
                "sender_password": {
                    "type": "string",
                    "description": "Sender email password.",
                    "default": null
                }
            },
            "required": ["recipient_email", "subject", "body"]
        }
    }
]
"""


In [None]:
ENUM_CATEGORICAL_TEMPLATE_JSON = ''' You are comparing a function call response to a question and trying to determine if the generated call has extracted the exact right parameters from the question. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Generated Function]: {generated_function}
    [END DATA]

Compare the parameters in the generated function against the JSON provided below.
The parameters extracted from the question must match the JSON below exactly.
Your response must be single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"incorrect" means that the parameters in the function do not match the JSON schema below exactly, or the generated function does not correctly answer the user's question.
You should also respond with "incorrect" if the response makes up information that is not in the JSON schema.
"correct" means the function call parameters match the JSON below and provides only relevant information.

JSON describing each function:
{json_function}
'''

In [None]:
enum_eval_model = OpenAIModel(model="gpt-4o")

In [None]:
# Escape curly braces in json_function to prevent template parsing errors
escaped_json_function = json_function.replace("{", "{{").replace("}", "}}")

rails = ["incorrect", "correct"]

response_classifications_json = llm_classify(
    dataframe=evals_df_trimmed,
    template=ENUM_CATEGORICAL_TEMPLATE_JSON.replace("{json_function}", escaped_json_function),
    model=enum_eval_model,
    rails=rails,
    provide_explanation=True,
)


In [None]:
response_classifications_json

Unnamed: 0_level_0,label,explanation,exceptions,execution_status,execution_seconds
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
a6e06d6b4560b1df,correct,The generated function call correctly identifi...,[],COMPLETED,3.132282
916565ca2d2c3038,correct,The generated function 'validate_idea' correct...,[],COMPLETED,4.148727
22edf146aef166e8,correct,The generated function call 'generate_lean_can...,[],COMPLETED,3.205135
e80c6f5eb7dd50a1,incorrect,The generated function call includes parameter...,[],COMPLETED,3.731467
0446fcd382dd494d,correct,The generated function 'generate_lean_canvas' ...,[],COMPLETED,3.125785
8e8f7956c7e7a794,correct,The generated function call 'save_to_word_doc'...,[],COMPLETED,2.53938
7ab498b93809ce8a,correct,The generated function 'validate_idea' correct...,[],COMPLETED,3.949395
3c51c5c47c7f36d3,incorrect,The generated function call only addresses the...,[],COMPLETED,2.310368
3e9a6833b4b4b226,correct,The generated function call correctly identifi...,[],COMPLETED,5.315232
96d4dcec920e0274,correct,The generated function call 'save_to_word_doc'...,[],COMPLETED,4.475209


In [None]:
px.Client().log_evaluations(
    SpanEvaluations(eval_name="Parameter Extraction JSON", dataframe=response_classifications_json),
)

This catches cases where the agent might miss key parameters (like forgetting the target audience in validation requests) or hallucinate nonexistent ones.

## Function Generation: Choosing the Right Tool
Once we confirm the agent should use a tool, we need to verify it selects the appropriate one. Our Startup Strategist has five distinct tools, each serving a specific purpose in the startup development lifecycle.

The evaluation compares the user's question against the generated function call to assess fit:



In [None]:
evals_df = trace_df[trace_df['attributes.llm.function_call'].notnull()]
evals_df_trimmed = evals_df[['attributes.llm.input_messages', 'attributes.llm.function_call']]
evals_df_trimmed.rename(columns={
    'attributes.llm.input_messages': 'question',
    'attributes.llm.function_call': 'generated_function'
}, inplace=True)

In [None]:
CATEGORICAL_TEMPLATE = ''' You are comparing a function call response to a question and trying to determine if the generated call is correct. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Generated Function]: {generated_function}
    [END DATA]

Compare the Question above to the function call. You must determine whether the function call
will return the answer to the Question. Please focus on whether the very specific
question can be answered by the function call.
Your response must be single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"incorrect" means that the function call will not provide an answer to the Question.
"correct" means the function call will definitely provide an answer to the Question.

Here is more information on each function:
product_comparison: Compare features of two products. Should include either the product id or name. If the name or id is present in the question and not present in the generated function, the response is incorrect.
product_search: Search for products based on criteria.
track_package: Track the status of a package based on the tracking number.
customer_support: Get contact information for customer support regarding an issue. The response should always include an email or phone number.
apply_discount_code: Applies a discount code to an order.
product_details: Get detailed features on one product.

'''

In [None]:
eval_model = OpenAIModel(model="gpt-4o")

In [None]:
# The rails is used to hold the output to specific values based on the template
# It will remove text such as ",,," or "..."
# Will ensure the binary value expected from the template is returned
rails = ["incorrect", "correct"]
#MultiClass would be rails = ["irrelevant", "relevant", "semi-relevant"]
response_classifications = llm_classify(
    dataframe=evals_df_trimmed,
    template=CATEGORICAL_TEMPLATE,
    model=eval_model,
    rails=rails,
    provide_explanation=True,
)

In [None]:
response_classifications

Unnamed: 0_level_0,label,explanation,exceptions,execution_status,execution_seconds
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
a6e06d6b4560b1df,correct,The question asks for help in brainstorming an...,[],COMPLETED,3.830446
916565ca2d2c3038,incorrect,The user's question involves exploring an idea...,[],COMPLETED,1.789323
22edf146aef166e8,correct,The question asks to explore the idea of a sub...,[],COMPLETED,2.939164
e80c6f5eb7dd50a1,incorrect,The question asks for help in brainstorming st...,[],COMPLETED,2.356395
0446fcd382dd494d,incorrect,The question asks for assistance in creating a...,[],COMPLETED,3.46931
8e8f7956c7e7a794,correct,The user asked for assistance in creating a le...,[],COMPLETED,3.020613
7ab498b93809ce8a,incorrect,The question asks for help in validating a dig...,[],COMPLETED,3.458879
3c51c5c47c7f36d3,incorrect,The user's question involves two parts: brains...,[],COMPLETED,4.201804
3e9a6833b4b4b226,incorrect,The question asks for help in brainstorming id...,[],COMPLETED,3.943129
96d4dcec920e0274,correct,The user asked for help brainstorming ideas fo...,[],COMPLETED,3.523712


In [None]:
px.Client().log_evaluations(
    SpanEvaluations(eval_name="Function Calls", dataframe=response_classifications),
)

This test ensures that requests for idea validation don't mistakenly trigger the Lean Canvas generator, and that multi-step requests properly sequence tool usage.

## Analyzing Evaluation Results

The Phoenix dashboard provides visualizations of our evaluation results across all three dimensions. You should now see your evals at an aggregate level, and at a span level by clicking the "Spans" tab in Phoenix. More importantly, it lets us drill into specific failures to understand their root causes.

Common failure patterns we identified and addressed include:

- **Over-eager tool usage:** The agent sometimes reached for tools when a simple response would suffice. We adjusted the system prompt to emphasize when direct answers are preferred.

- **Parameter under-extraction:** Early versions often missed implicit parameters in user queries. We improved this by adding examples of parameter inference to the tool descriptions.

- **Multi-step confusion:** For complex requests involving multiple tools, the agent occasionally sequenced steps incorrectly. We added explicit chain-of-thought prompting to fix this.

These evaluations aren't one-time checks, they're part of an ongoing evaluation process. As we add new tools or modify existing ones, we rerun the evaluations to ensure no regressions occur.
