### 🧠 Building & Evaluating Complex Agents with `strands` and `flotorch-eval`

In this notebook, we'll walk through a complete example of evaluating agents using the **`flotorch-eval`** package across key metrics. These metrics help assess both agent quality and system performance.

---

#### 🔍 Evaluation Metrics

- **`AgentGoalAccuracyMetric`**  
  Evaluates whether the agent successfully understood and achieved the user's goal.  
  - **Binary** (1 = goal achieved, 0 = not achieved)

- **`ToolCallAccuracyMetric`**  
  Measures the correctness of tool usage by the agent—i.e., whether the agent called the right tools to complete a task.  
  - **Binary** (1 = relevant tools invoked, 0 = relevant tools not invoked)

- **`TrajectoryEvalWithLLM`**  
  Evaluates whether the trajectory (based on OpenTelemetry spans) is meaningful, either with or without a reference trajectory.  
  - **Binary** (1 = meaningful, 0 = invalid)

- **`LatencyMetric`**  
  Measures agent latency—how fast the agent responds or completes tasks.  
  

- **`UsageMetric`**  
  Evaluates the cost-efficiency of the agent in terms of compute, tokens, or other usage dimensions.  

---


#### Setup and dependencies

In [None]:
!pip install numpy pandas langchain-aws ragas openlit -q
!pip install flotorch-eval crewai duckduckgo-search uv -q

#### Configure Tracing with OpenLit
To evaluate our agent, we first need to record what it does. We'll use OpenLit to automatically create a detailed trace of the agent's execution, including every LLM call and tool usage.

We'll store these traces in memory for easy access during the evaluation phase.

In [None]:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter

# Create an in-memory span exporter
memory_exporter = InMemorySpanExporter()
span_processor = SimpleSpanProcessor(memory_exporter)

# Set up the tracer provider and add the span processor
tracer_provider = TracerProvider()
tracer_provider.add_span_processor(span_processor)
trace.set_tracer_provider(tracer_provider)

# Initialize OpenLit - this will automatically instrument CrewAI when it's imported
import openlit
openlit.init()

In [None]:
import sys, os
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "../../..")))

# Import required libraries
from typing import List
import asyncio

from crewai.tools import tool
from crewai import LLM
from crewai import Agent, Task, Crew
from duckduckgo_search import DDGS
from langchain_aws import ChatBedrockConverse
from ragas.llms import LangchainLLMWrapper

from flotorch_eval.agent_eval.core.evaluator import Evaluator
from flotorch_eval.agent_eval.metrics.base import BaseMetric
from flotorch_eval.agent_eval.metrics.langchain_metrics import (
    TrajectoryEvalWithLLMMetric,
    TrajectoryEvalWithoutLLMMetric)
from flotorch_eval.agent_eval.metrics.ragas_metrics import (
    AgentGoalAccuracyMetric,
    ToolCallAccuracyMetric,)
from flotorch_eval.agent_eval.metrics.latency_metrics import LatencyMetric
from flotorch_eval.agent_eval.metrics.base import MetricConfig
from flotorch_eval.agent_eval.metrics.usage_metrics import UsageMetric

### Evaluation Setup
Before we build the agent, let's set up the components needed for its evaluation.

##### Configure the LLM Judge
Some of our metrics require an LLM to "judge" the agent's output for quality and correctness.  
For this, we're choosing **`Amazon Nova Micro`**.

In [4]:
region = "us-east-1"
bedrock_model = ChatBedrockConverse(
    region_name=region,
    endpoint_url=f"https://bedrock-runtime.us-east-1.amazonaws.com",
    model_id="us.amazon.nova-micro-v1:0"
)

llm_judge = LangchainLLMWrapper(bedrock_model)

#### Define the Evaluation Logic
This helper function orchestrates the evaluation process. It takes the captured traces (spans) and the evaluation metrics, runs the evaluator, and displays the results.

In [5]:
from evaluation_utils import create_trajectory, initialize_evaluator, display_evaluation_results

async def evaluate_agent(metrics: List[BaseMetric], spans: List):
    """
    Runs an agent with a given prompt, captures its trace, evaluates it,
    and displays the results.

    Args:
        agent: The agent to be evaluated.
        prompt: The input prompt for the agent.
        metrics: A list of configured metrics for evaluation.
    """

    # 1. Capture and convert the trace
    if not spans:
        print("\nEvaluation failed: No spans were provided.")
        return

    trajectory = create_trajectory(spans)

    # 2. Initialize and run the evaluator
    evaluator = initialize_evaluator(metrics)
    print("\n--- Running Evaluation ---")
    results = await evaluator.evaluate(trajectory)

    # 3. Display results
    print("\n--- Evaluation Scores ---")
    display_evaluation_results(results)
    return results

#### Use case: AWS Tech Agent

Now, let's build, run, and evaluate our agent.

#### Build the Agent
We'll define the agent's tools, its LLM, its role, and the task it needs to perform.

**`Tools`**: It will have a DuckDuckGoSearch tool to look up information, Salesforce API to get data  
**`LLM`**: It will be powered by the amazon.nova-pro-v1:0 model on Amazon Bedrock.  
**`Role`**: Its purpose is to be a Writer that simplifies GenAI concepts on AWS for beginner

In [6]:
# Define the tools that will be used by the agents
@tool('DuckDuckGoSearch')
def search_tool(search_query: str):
    """Search the web for information on a given topic"""
    return DDGS().text(search_query, max_results=5)

@tool('SalesforceIntegration')
def salesforce_tool(soql_query: str):
    """Call Salesforce API to get data"""
    return "Salesforce Integration"

# setup the model that is going to be used with the model
model = LLM(
    # model="sagemaker/INSERT ENDPOINT NAME",
    model="bedrock/us.amazon.nova-pro-v1:0",
    temperature=0.7, max_tokens=4*1024,
)

writer = Agent(
        role="Writer",
        goal="You make GenAI concepts understandable for newbies exploring GenAI on AWS",
        backstory="You're an expert in writing crisp summaries about GenAI on AWS.",
        tools=[search_tool],
        llm=model
    )

task = Task(description=("What is {topic}?"),
            expected_output=("Compose a short summary that includes the answer."),
            agent=writer)

crew = Crew(
  agents=[writer],
  tasks=[task],
  share_crew=False
)

#### Run the Agent
Let's kickoff the crew to perform its task with a topic. OpenLit will automatically capture the entire execution in the background.

In [None]:
result = crew.kickoff({"topic": "AWS Bedrock"})
print(result)

In [None]:
spans = memory_exporter.get_finished_spans()
print("Number of spans:", len(spans))

Evaluate the agent where no ground-truth answers or trajectories are provided. Instead, rely solely on an LLM-based evaluator to assess the agent's performance.

In [None]:
metrics = [
    ToolCallAccuracyMetric(),
    AgentGoalAccuracyMetric(llm=llm_judge),
    TrajectoryEvalWithLLMMetric(llm = bedrock_model),
    UsageMetric(config=MetricConfig(
        metric_params={"aws_region": "us-east-1"}
    )),
    LatencyMetric()
    ]

# Evaluate trajectory
async def evaluate_dataframe_agent():
    return await evaluate_agent(metrics, spans)

# Execute the evaluation
results = asyncio.run(evaluate_dataframe_agent())

#### Define Ground Truth for Evaluation
Evaluate the agent using reference data, including the expected agent response and trajectory. Use an LLM to compare the agent's output against this reference to assess performance.

**`A reference answer`**: The ideal final output we expect.  
**`A reference trajectory`**: The ideal sequence of thoughts and tool calls the agent should have taken.

In [8]:
REFERENCE_FINAL_ANSWER="Amazon Bedrock is a fully managed service that makes it easy to use foundation models from third-party providers and Amazon. It allows users to build generative AI applications with a choice of foundation models from different AI companies, using a single API. Users can customize these models with their data, orchestrate multistep tasks, trace reasoning, and apply guardrails for responsible AI. Additionally, Amazon Bedrock enables the creation of generative AI workflows by connecting its features with other AWS services."

REFERENCE_TRAJECTORY_OUTPUTS = [
                {"role": "user", "content": "What is AWS Bedrock?"},
                {
                    "role": "assistant",
                    "content": "To compose a poem about Amazon Bedrock, I first need to gather information about what Amazon Bedrock is. I will use the available tool to search for this information.",
                    "tool_calls": [
                        {
                            "function": {
                                "name": "Search the web for information on a given topic",
                                "arguments": "{\"search_query\": \"Amazon Bedrock\"}"
                            }
                        }
                    ]
                },
                {"role": "tool", "content": "{\"searchParameters\": {\"q\": \"Amazon Bedrock\", \"type\": \"search\", \"num\": 5, \"engine\": \"google\"}, \"organic\": [{\"title\": \"Amazon Bedrock - Generative AI - AWS\", \"link\": \"https://aws.amazon.com/bedrock/\", \"snippet\": \"Amazon Bedrock Data Automation streamlines the generation of valuable insights from unstructured multimodal content such as documents, images, audio, and videos ...\", \"position\": 1, \"sitelinks\": [{\"title\": \"Amazon Bedrock\", \"link\": \"https://docs.aws.amazon.com/bedrock/latest/userguide/what-is-bedrock.html\"}, {\"title\": \"Amazon Bedrock Pricing\", \"link\": \"https://aws.amazon.com/bedrock/pricing/\"}, {\"title\": \"Amazon Bedrock Documentation\", \"link\": \"https://docs.aws.amazon.com/bedrock/\"}, {\"title\": \"Amazon Bedrock FAQs\", \"link\": \"https://aws.amazon.com/bedrock/faqs/\"}, {\"title\": \"Amazon Bedrock Agents\", \"link\": \"https://aws.amazon.com/bedrock/agents/\"}]}, {\"title\": \"Getting Started with Amazon Bedrock - AWS\", \"link\": \"https://aws.amazon.com/awstv/watch/6ff4cd6fa97/\", \"snippet\": \"So check the region that you're currently in, make sure it's a region that's supported by Bedrock. Then I'm gonna scroll to the bottom of this ...\", \"position\": 2}], \"relatedSearches\": [{\"query\": \"Amazon Bedrock pricing\"}, {\"query\": \"Amazon Bedrock documentation\"}, {\"query\": \"Amazon Bedrock Claude\"}, {\"query\": \"Amazon Bedrock logo\"}, {\"query\": \"Amazon Bedrock DeepSeek\"}], \"credits\": 1}"},
                {"role": "assistant", "content": "Based on the observation, I have learned that mazon Bedrock is a fully managed service that makes it easy to use foundation models from third-party providers and Amazon. It allows users to build generative AI applications with a choice of foundation models from different AI companies, using a single API. Users can customize these models with their data, orchestrate multistep tasks, trace reasoning, and apply guardrails for responsible AI. Additionally, Amazon Bedrock enables the creation of generative AI workflows by connecting its features with other AWS services."}
            ]

#### Run the Evaluation
Finally, we retrieve the captured traces (spans), configure our evaluation metrics, and run the evaluate_agent function.

In [None]:
spans = memory_exporter.get_finished_spans()
print("Number of spans:", len(spans))

In [None]:
metrics = [
    ToolCallAccuracyMetric(),
    AgentGoalAccuracyMetric(llm=llm_judge,config=MetricConfig(
        metric_params={
            "reference_answer": REFERENCE_FINAL_ANSWER
            }
        )),
    TrajectoryEvalWithLLMMetric(llm = bedrock_model,config=MetricConfig(
        metric_params={
            "reference_outputs": REFERENCE_TRAJECTORY_OUTPUTS
        }
    )),
    UsageMetric(config=MetricConfig(
        metric_params={"aws_region": "us-east-1"}
    )),
    LatencyMetric()
    ]

# Evaluate trajectory
async def evaluate_dataframe_agent():
    return await evaluate_agent(metrics, spans)

# Execute the evaluation
results = asyncio.run(evaluate_dataframe_agent())