### üß† Building & Evaluating Complex Agents with `crewai` and `flotorch-eval`

In this notebook, we'll walk through a complete example of evaluating agents using the **`flotorch-eval`** package across key metrics. These metrics help assess both agent quality and system performance.

---

#### üîç Evaluation Metrics

- **`AgentGoalAccuracy`**  
  Evaluates whether the agent successfully understood and achieved the user's goal.  
  - **Binary** (1 = goal achieved, 0 = not achieved)

- **`ToolCallAccuracy`**  
  Measures the correctness of tool usage by the agent‚Äîi.e., whether the agent called the right tools to complete a task.  
  - **Binary** (1 = relevant tools invoked, 0 = relevant tools not invoked)

- **`TrajectoryEvalWithLLM`**  
  Evaluates whether the trajectory (based on OpenTelemetry spans) is meaningful based on the steps taken by the agent.  
  - **Binary** (1 = meaningful, 0 = invalid)

- **`TrajectoryEvalWithLLMWithReference`**
  Evaluates whether the trajectory is meaningful by comparing it with a reference trajectory
  - **Binary** (1 = meaningful, 0 = invalid)


- **`LatencyMetric`**  
  Measures agent latency‚Äîhow fast the agent responds or completes tasks.  
  

- **`UsageMetric`**  
  Evaluates the cost-efficiency of the agent in terms of compute, tokens, or other usage dimensions.
---


#### Setup and dependencies

In [None]:
!pip install numpy pandas langchain-aws -q
!pip install flotorch-eval crewai crewai-tools ddgs uv -q

In [None]:
import time
# Import required libraries
from typing import Dict, Any

from crewai.tools import tool
from crewai import Crew
from ddgs import DDGS


from flotorch_eval.agent_eval.core.client import FlotorchEvalClient
from flotorch.crewai.agent import FlotorchCrewAIAgent
from evaluation_utils import display_evaluation_results


##### Configure Flotorch Credentials

Set up the Flotorch base URL and API key here to enable tracing and evaluation by connecting to Flotorch.
These credentials allow you to:

Access configured agents from the Flotorch console

Enable tracing for your runs

In [None]:
FLOTORCH_GATEWAY_BASE_URL = ""
FLOTORCH_API_KEY= ""
evaluation_llm_model = ""

### Define the Evaluation Logic
This helper function manages the evaluation workflow.
It accepts the trace ID generated by the agent run and an optional reference trace, which can be provided either as a JSON object or as the ID of a previously created trace to use as a reference.

The function creates a client to retrieve the traces and evaluates them across different metrics.
By default, the trace is evaluated on all available metrics, but you can customize the evaluation by passing a list of specific metrics to the metrics parameter of the evaluate method.
**Example:**  

```python
metrics = [TrajectoryEvalWithLLM()]
client.evaluate(trace_id, metrics)
```

##### Reference Trajectory

A Reference Trajectory defines the ideal "golden path" for an agent's behavior. It outlines not just what the agent should do, but also why it does it, step by step. This structure is used to evaluate if the agent is reasoning and acting correctly.
You can create the reference on your own based on how your agent is supposed to work:
The structure of a reference trajectory consists of a main object with two key fields:

**`input`**: This is the initial prompt or question from the user that kicks off the entire process.

**`expected_steps`**: This is an ordered list where each item represents a single step in the agent's thought process and the resulting action.

Each step in the "expected_steps" list is an object that must contain two parts: a thought and an action.

**`thought`**: A string representing the agent's internal reasoning or "inner monologue." This explains why the agent is about to take a specific action.

**`action`**: Each step must conclude with exactly one action. The action can be one of two types:

A. **tool_call**: The agent decides to use a tool. This object has two parts:

**`name`**: The name of the function or tool to be executed.

**`arguments`**: A dictionary of the parameters to pass to that tool.

B. **final_response**: The agent decides it has enough information to answer the user. This is a string containing the agent's final text reply, which typically occurs as the very last step.

**Sample Trajectory**
```json
{
  "input": "What's the weather like in Mumbai?",
  "expected_steps": [
    {
      "thought": "The user is asking for the weather in a specific city. I should use the `get_weather` tool to find this information.",
      "tool_call": {
        "name": "get_weather",
        "arguments": {
          "city": "Mumbai"
        }
      }
    },
    {
      "thought": "I have successfully retrieved the weather information. Now I will formulate the final answer for the user.",
      "final_response": "The weather in Mumbai is currently warm and humid with a temperature of 31¬∞C."
    }
  ]
}
```

In [None]:
async def evaluate_agent(trace_id: str, reference: Dict[str, Any]=None, reference_id: str=None):
    start_time = time.time()
    
    if reference and reference_id:
        raise ValueError("Provide either 'reference' or 'reference_trace_id', not both.")

    client = FlotorchEvalClient(
        api_key=FLOTORCH_API_KEY,
        base_url=FLOTORCH_GATEWAY_BASE_URL,
        default_evaluator=evaluation_llm_model # Setting a default evaluator for all metrics that require an LLM.
        )

    traces = client.fetch_traces(trace_id)
    print(f"Traces: {traces}")
    results = await client.evaluate( # Metrics can be optionally provided as a list
        trace=traces,
        reference=reference,
        reference_trace_id=reference_id
    )
    
    display_evaluation_results(results)
    print(results.model_dump_json(indent=4))
    end_time = time.time()
    time_taken = round(end_time - start_time, 2)
    print(f"Time taken for evaluation: {time_taken} seconds")

##### Using Evaluators for Metrics

If a **`default_evaluator`** is set, that model will be used as the LLM judge for all metrics that require an LLM.

To evaluate a specific metric with a different model, define the metric individually and specify the desired model using the llm parameter.

The model must be one that is configured in the Flotorch console.

This will override the default_evaluator only for that metric.

Notes:

If you provide a metric separately, only that metric will be evaluated.

If no default_evaluator is set, you must provide an evaluator for every metric that requires one.

```python
metrics = [
    TrajectoryEvalWithLLM(
        llm="flotorch/gpt-4o"
    )
    ]
client.evaluate(trace_id, metrics)
```

### **Use Case: AWS Tech Agent**

Now, let‚Äôs build, run, and evaluate our agent.

#### Build the Agent

Flotorch makes it simple to create agents that can be reused across different frameworks.
To set up an agent:

Go to the **`Flotorch console`** ‚Üí **`Agent Builder`** ‚Üí click **`Create Flotorch Agent`**.

Provide a name for your agent.

Configure the agent with:

- The model you want the agent to use

- Input and output structures

- Any MCP tools you want to integrate

- Agent details such as the system prompt and agent goal

Once configured, use the **same agent name here** to set up the **Flotorch CrewAI agent** and run it as a crew.

You can also define custom tools in addition to the MCP tools you may have already configured.

**`Tools`**:

DuckDuckGoSearch ‚Äì to search and retrieve information

Salesforce API ‚Äì to fetch Salesforce data

In [None]:
# Define the tools that will be used by the agents
@tool('DuckDuckGoSearch')
def search_tool(search_query: str):
    """Search the web for information on a given topic"""
    return DDGS().text(search_query, max_results=5)

@tool('SalesforceIntegration')
def salesforce_tool(soql_query: str):
    """Call Salesforce API to get data"""
    return "Salesforce Integration"

agent_name = "aws-tech-agent" # Name of the agent that you have set on the Flotorch console
agent_client = FlotorchCrewAIAgent(
    agent_name =agent_name,
    api_key=FLOTORCH_API_KEY, # Flotorch API key and Base URL is optional here if you have set it in the environment variables
    base_url=FLOTORCH_GATEWAY_BASE_URL,
    custom_tools= [search_tool, salesforce_tool],
    tracer_config={
            "enabled": True, 
            "sampling_rate": 1
        }
    )

agent = agent_client.get_agent()
task = agent_client.get_task()
crew = Crew(
    agents = [agent],
    tasks = [task],
    verbose = False
)

#### Run the Agent
Let's kickoff the crew to perform its task with a topic. Flotorch Tracing will automatically capture the entire execution in the background.

In [None]:
query = "You must use the given tools to research and only then write about this topic"
result = crew.kickoff(inputs = {"query": query, "topic": "AWS CodeWhisperer"})

print(result)

#### Run the Evaluation
Finally, we can use the generated trace ID to evaluate_agent function.

We can create a reference trajectory for running the **`TrajectoryEvalWithLLMWithReference`**.

In [None]:
REFERENCE_TRAJECTORY_OUTPUTS = {
    "input": "What is AWS Bedrock?",
    "expected_steps": [
        {
            "thought": "To answer this question about Amazon Bedrock, I first need to gather information about what Amazon Bedrock is. I will use the available tool to search for this information.",
            "tool_call": {
                "name": "Search the web for information on a given topic",
                "arguments": "{\"search_query\": \"Amazon Bedrock\"}"
            }
        },
        {
            "thought": "Now that I have the search results, I will synthesize the information to provide a comprehensive answer.",
            "final_response": "Based on the observation, I have learned that Amazon Bedrock is a fully managed service that makes it easy to use foundation models from third-party providers and Amazon. It allows users to build generative AI applications with a choice of foundation models from different AI companies, using a single API. Users can customize these models with their data, orchestrate multistep tasks, trace reasoning, and apply guardrails for responsible AI. Additionally, Amazon Bedrock enables the creation of generative AI workflows by connecting its features with other AWS services."
        }
    ]
}


In [None]:
async def main():
    trace_ids = agent_client.get_tracer_ids() 
    for trace_id in trace_ids:
        if trace_id:
            print(f"Evaluating trace id: {trace_id}")
            await evaluate_agent(
                trace_id=trace_id,
                reference=REFERENCE_TRAJECTORY_OUTPUTS
            )

await main()  