### 🧠 Building & Evaluating Complex Agents with `strands` and `flotorch-eval`

In this notebook, we'll walk through a complete example of evaluating agents using the **`flotorch-eval`** package across key metrics. These metrics help assess both agent quality and system performance.

---

#### 🔍 Evaluation Metrics

- **`AgentGoalAccuracyMetric`**  
  Evaluates whether the agent successfully understood and achieved the user's goal.  
  - **Binary** (1 = goal achieved, 0 = not achieved)

- **`ToolCallAccuracyMetric`**  
  Measures the correctness of tool usage by the agent—i.e., whether the agent called the right tools to complete a task.  
  - **Binary** (1 = relevant tools invoked, 0 = relevant tools not invoked)

- **`TrajectoryEvalWithLLM`**  
  Evaluates whether the trajectory (based on OpenTelemetry spans) is meaningful, either with or without a reference trajectory.  
  - **Binary** (1 = meaningful, 0 = invalid)

- **`LatencyMetric`**  
  Measures agent latency—how fast the agent responds or completes tasks.  
  

- **`UsageMetric`**  
  Evaluates the cost-efficiency of the agent in terms of compute, tokens, or other usage dimensions.  

---


#### Setup and dependencies

Installs `flotorch-eval`, `strands-agents`, and dependencies for developing Python agents and tools that execute and assess Python code.

Note: You might see some pip's dependency resolver issues - please ignore them.

In [None]:
# Install dependencies
!pip install numpy pandas langchain-aws ragas jinja2 -q
!pip install flotorch-eval strands-agents strands-agents-tools strands-agents-builder nest_asyncio agentevals uv -q

In [2]:
# Optionally install additional libraries that are needed for the use cases. 
# If you do not have them, LLM will identify missing libraries and will try to install in the Python REPL environment
!pip install yfinance matplotlib -q
!pip install beautifulsoup4 pandas requests -q
!pip install pandas numpy scikit-learn matplotlib seaborn joblib -q

In [None]:
# Import libraries

import asyncio
import pandas as pd
from typing import List
from strands.models.bedrock import BedrockModel
from strands.telemetry.tracer import get_tracer
from strands import Agent, tool
from langchain_aws import ChatBedrockConverse
from ragas.llms import LangchainLLMWrapper

from flotorch_eval.agent_eval.metrics.base import BaseMetric
from flotorch_eval.agent_eval.metrics.langchain_metrics import (
    TrajectoryEvalWithLLMMetric,
)
from flotorch_eval.agent_eval.metrics.ragas_metrics import (
    AgentGoalAccuracyMetric,
    ToolCallAccuracyMetric
)
from flotorch_eval.agent_eval.metrics.latency_metrics import LatencyMetric
from flotorch_eval.agent_eval.metrics.tool_accuracy import ToolAccuracyMetric
from flotorch_eval.agent_eval.metrics.base import MetricConfig
from flotorch_eval.agent_eval.metrics.usage_metrics import UsageMetric

#### Setting Up Tracing and Evaluation
To evaluate an agent, we first need to understand what it did. We'll use OpenTelemetry to "trace" the agent's execution. The code below sets up a custom QueueSpanExporter, which captures every step (span) of the agent's process and stores it in a queue for later analysis.

In [4]:
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentel_utils import QueueSpanExporter

# Configure the tracer with both console export and our queue export
# Create our queue exporter
queue_exporter = QueueSpanExporter()

# Get the current tracer provider
provider = TracerProvider()

# Add our queue processor
provider.add_span_processor(BatchSpanProcessor(queue_exporter))
tracer = get_tracer(
    enable_console_export=True  # Keep console export for visibility
)

# Add our queue processor to the existing tracer provider
tracer.tracer_provider.add_span_processor(BatchSpanProcessor(queue_exporter))

In [5]:
# Clear the exporter queue for any residue
queue_exporter.clear()

Finally, this asynchronous function ties everything together. It takes an agent, a prompt, and a list of metrics, then performs the entire run-trace-evaluate-display cycle

In [6]:
from typing import List, Dict, Any, Optional
from evaluation_utils import get_all_spans, create_trajectory, initialize_evaluator, display_evaluation_results, save_trajectory_to_json

async def evaluate_agent(metrics: List[BaseMetric], spans: List, output_path: str):
    """
    Runs an agent with a given prompt, captures its trace, evaluates it,
    and displays the results.

    Args:
        metrics: A list of configured metrics for evaluation.
        spans: List of OpenTelemetry spans from the agent's execution.

    Returns:
        Optional[Dict[str, Any]]: Evaluation results if successful, None if evaluation fails.
        
    Raises:
        ValueError: If invalid metrics or spans are provided.
        RuntimeError: If evaluation process fails.
    """
    try:
        # Input validation
        if not metrics:
            raise ValueError("No metrics provided for evaluation")
        if not spans:
            raise ValueError("No spans provided for evaluation")

        print("Starting agent evaluation with %d metrics", len(metrics))
        
        # Create trajectory from spans
        try:
            trajectory = create_trajectory(spans)
            print("Successfully created trajectory from spans")
            save_trajectory_to_json(trajectory, output_path)
        except Exception as e:
            print("Failed to create trajectory: %s", str(e))
            raise RuntimeError(f"Trajectory creation failed: {str(e)}")

        # Initialize and run evaluator
        try:
            evaluator = initialize_evaluator(metrics)
            print("Running evaluation with initialized evaluator")
            print("\n--- Running Evaluation ---")
            results = await evaluator.evaluate(trajectory)
        except Exception as e:
            print("Evaluation failed: %s", str(e))
            raise RuntimeError(f"Evaluation process failed: {str(e)}")

        # Display and return results
        print("\n--- Evaluation Scores ---")
        display_evaluation_results(results)
        print("Evaluation completed successfully")
        return results

    except Exception as e:
        print("Evaluation failed with error: %s", str(e))
        raise

`Claude 3.7 Sonnet` is designated as the LLM judge for evaluating agent performance.

In [7]:
# Initialization LLM judge
bedrock_sonnet = ChatBedrockConverse(
    region_name="us-west-2",
    model_id="us.anthropic.claude-3-7-sonnet-20250219-v1:0"
)
llm_judge = LangchainLLMWrapper(bedrock_sonnet)

#### Use Case 1. Data frame manipulation agent

In [None]:
dataframe_agent = Agent(model="us.amazon.nova-pro-v1:0")
dataframe_prompt = """
Write a Python script using the pandas library that performs the following tasks:

- Create a sample DataFrame with the columns: 'Name', 'Age', and 'Salary'.
- Add a new column named 'Bonus' that is 10% of the corresponding 'Salary' value.
- Filter the DataFrame to include only rows where the 'Age' is greater than 30.
- Group the data by age brackets (e.g., 20s, 30s, 40s) and calculate the average 'Salary' and 'Bonus' for each group.

Execute the python script and show the output.

Requirements:
- Include clear inline comments to explain the logic.
- Add a docstring for the function, describing its purpose, parameters, and return value.
- Provide an example of how to use the function.
- List any external libraries that need to be installed with pip (if any).
- Include brief documentation describing how the code works and how to run it.
"""

response = dataframe_agent(dataframe_prompt)
print("\n--- Agent Final Response ---")
print(response.message)

In [9]:
# Retrieve all spans from queue exporter
dataframe_agent_spans = get_all_spans(queue_exporter)

Evaluate the agent where no ground-truth answers or trajectories are provided. Instead, rely solely on an LLM-based evaluator to assess the agent's performance.

In [None]:
dataframe_agent_eval_metrics = [
    ToolCallAccuracyMetric(),
    TrajectoryEvalWithLLMMetric(
        llm=bedrock_sonnet
    ),
    AgentGoalAccuracyMetric(
        llm=llm_judge
    ),
    UsageMetric(config=MetricConfig(
        metric_params={"aws_region": "us-west-2"}
    )),
    LatencyMetric()
]

async def evaluate_dataframe_agent():
    return await evaluate_agent(dataframe_agent_eval_metrics, dataframe_agent_spans, "dataframe_agent_trajectory.json")

# Execute the evaluation
dataframe_results = asyncio.run(evaluate_dataframe_agent())

Evaluate the agent using reference data, including the expected agent response and trajectory. Use an LLM to compare the agent's output against this reference to assess performance.

In [11]:
REFERENCE_FINAL_ANSWER = """
import pandas as pd
import numpy as np

def process_employee_data():
    \"""
    Creates a sample employee DataFrame and performs various data manipulations.
    
    This function:
    1. Creates a sample DataFrame with employee information
    2. Adds a bonus column based on salary
    3. Filters employees by age
    4. Groups data by age brackets and calculates statistics
    
    Returns:
        tuple: Contains the original DataFrame, filtered DataFrame, and grouped statistics
    \"""
    # Create a sample DataFrame with employee data
    data = {
        'Name': ['John', 'Alice', 'Bob', 'Emma', 'David', 'Sarah', 'Michael', 'Linda'],
        'Age': [25, 32, 45, 28, 35, 42, 29, 51],
        'Salary': [50000, 65000, 85000, 52000, 72000, 78000, 60000, 95000]
    }
    
    df = pd.DataFrame(data)
    print("Original DataFrame:")
    print(df)
    print("\\n")
    
    # Add a new column 'Bonus' that is 10% of 'Salary'
    df['Bonus'] = df['Salary'] * 0.1
    print("DataFrame with Bonus column added:")
    print(df)
    print("\\n")
    
    # Filter the DataFrame to include only rows where Age > 30
    filtered_df = df[df['Age'] > 30]
    print("Filtered DataFrame (Age > 30):")
    print(filtered_df)
    print("\\n")
    
    # Create a new column for age brackets
    df['Age_Bracket'] = pd.cut(df['Age'], 
                               bins=[20, 30, 40, 50, 60],
                               labels=['20s', '30s', '40s', '50s'])
    
    # Group by age brackets and calculate average Salary and Bonus
    grouped_df = df.groupby('Age_Bracket').agg({
        'Salary': 'mean',
        'Bonus': 'mean'
    }).reset_index()
    
    print("Average Salary and Bonus by Age Bracket:")
    print(grouped_df)
    
    return df, filtered_df, grouped_df

# Execute the function
if __name__ == "__main__":
    original_df, filtered_df, grouped_df = process_employee_data()
"""


In [12]:
REFERENCE_TRAJECTORY_OUTPUTS = [
    {
        "role": "user",
        "content": """
Write a Python script using the pandas library that performs the following tasks:

- Create a sample DataFrame with the columns: 'Name', 'Age', and 'Salary'.
- Add a new column named 'Bonus' that is 10% of the corresponding 'Salary' value.
- Filter the DataFrame to include only rows where the 'Age' is greater than 30.
- Group the data by age brackets (e.g., 20s, 30s, 40s) and calculate the average 'Salary' and 'Bonus' for each group.

Execute the python script and show the output.

Requirements:
- Include clear inline comments to explain the logic.
- Add a docstring for the function, describing its purpose, parameters, and return value.
- Provide an example of how to use the function.
- List any external libraries that need to be installed with pip (if any).
- Include brief documentation describing how the code works and how to run it.
"""
    },
    {
        "role": "assistant",
        "content": """
Certainly! Here is a complete, executable solution with all requirements met.

import pandas as pd
import numpy as np

def process_employee_data():
    '''
    Creates, processes, and analyzes a sample employee DataFrame.

    This function performs the following steps:
    1. Creates a pandas DataFrame with 'Name', 'Age', and 'Salary'.
    2. Adds a 'Bonus' column calculated as 10% of the 'Salary'.
    3. Filters the DataFrame to show only employees older than 30.
    4. Groups the data into age brackets (20s, 30s, 40s, 50s).
    5. Calculates the average salary and bonus for each age group.
    6. Prints each step's result to the console.

    Returns:
        tuple: A tuple containing four DataFrames:
               (df_original, df_with_bonus, filtered_df, grouped_stats)
    '''
    # 1. Create a sample DataFrame
    data = {
        'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hank'],
        'Age': [25, 35, 45, 28, 52, 31, 38, 41],
        'Salary': [50000, 65000, 80000, 48000, 95000, 60000, 72000, 78000]
    }
    df = pd.DataFrame(data)
    print("--- Original DataFrame ---")
    print(df, "\n")
    
    # Store a copy for the return value before further modifications
    df_original = df.copy()

    # 2. Add a 'Bonus' column that is 10% of the 'Salary'
    df['Bonus'] = df['Salary'] * 0.1
    print("--- DataFrame with 'Bonus' Column ---")
    print(df, "\n")
    df_with_bonus = df.copy()

    # 3. Filter the DataFrame for Age > 30
    filtered_df = df[df['Age'] > 30]
    print("--- Filtered DataFrame (Age > 30) ---")
    print(filtered_df, "\n")

    # 4. Group data by age brackets
    # Define the bins and labels for the age groups
    bins = [20, 30, 40, 50, 60]
    labels = ['20s', '30s', '40s', '50s']
    df['Age_Bracket'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)

    # 5. Calculate the average 'Salary' and 'Bonus' for each group
    grouped_stats = df.groupby('Age_Bracket')[['Salary', 'Bonus']].mean().round(2)
    print("--- Average Salary and Bonus by Age Bracket ---")
    print(grouped_stats, "\n")
    
    return df_original, df_with_bonus, filtered_df, grouped_stats

# Example of how to use the function
if __name__ == "__main__":
    # The function prints the output as it runs
    original, with_bonus, filtered, grouped = process_employee_data()

**Output:**
--- Original DataFrame ---
      Name  Age  Salary
0    Alice   25   50000
1      Bob   35   65000
2  Charlie   45   80000
3    David   28   48000
4      Eve   52   95000
5    Frank   31   60000
6    Grace   38   72000
7     Hank   41   78000

--- DataFrame with 'Bonus' Column ---
      Name  Age  Salary    Bonus
0    Alice   25   50000   5000.0
1      Bob   35   65000   6500.0
2  Charlie   45   80000   8000.0
3    David   28   48000   4800.0
4      Eve   52   95000   9500.0
5    Frank   31   60000   6000.0
6    Grace   38   72000   7200.0
7     Hank   41   78000   7800.0

--- Filtered DataFrame (Age > 30) ---
      Name  Age  Salary    Bonus
1      Bob   35   65000   6500.0
2  Charlie   45   80000   8000.0
4      Eve   52   95000   9500.0
5    Frank   31   60000   6000.0
6    Grace   38   72000   7200.0
7     Hank   41   78000   7800.0

--- Average Salary and Bonus by Age Bracket ---
                 Salary     Bonus
Age_Bracket
20s          49000.00   4900.00
30s          65666.67   6566.67
40s          79000.00   7900.00
50s          95000.00   9500.00

---

**External Libraries Required:**
- pandas

Install with:
pip install pandas

text

---

How the Code Works
Initialization: A dictionary data is created and then converted into a pandas DataFrame.
Vectorized Operation: A new Bonus column is created by multiplying the entire Salary column by 0.1. This is a fast, vectorized operation.
Boolean Filtering: A boolean mask (df['Age'] > 30) is created to select and display only the rows that meet this condition.
Binning: The pd.cut function is used to segment the Age data into discrete intervals (bins), which are then assigned descriptive labels.
Grouping & Aggregation: The groupby() method groups all rows by their Age_Bracket. Then, the average (mean()) of the Salary and Bonus columns is calculated for each group.
How to Run the Script
Save the code above into a file named salary_analyzer.py.
Open your terminal or command prompt.
Make sure you have installed the required libraries (pip install pandas numpy).
Navigate to the directory where you saved the file.
Execute the script using the command: python salary_analyzer.py.
"""
    }
]

In [None]:
dataframe_agent_eval_metrics = [
    ToolCallAccuracyMetric(),
    TrajectoryEvalWithLLMMetric(
        llm=bedrock_sonnet,
        config=MetricConfig(metric_params={"reference_outputs": REFERENCE_TRAJECTORY_OUTPUTS})
    ),
    AgentGoalAccuracyMetric(
        llm=llm_judge,
        config=MetricConfig(
            metric_params={"reference_answer": REFERENCE_FINAL_ANSWER}
        )
    ),
    UsageMetric(config=MetricConfig(
        metric_params={"aws_region": "us-west-2"}
    )),
    LatencyMetric()
]

async def evaluate_dataframe_agent():
    return await evaluate_agent(dataframe_agent_eval_metrics, dataframe_agent_spans, "dataframe_agent_trajectory.json")

# Execute the evaluation
dataframe_agent_evalresults = asyncio.run(evaluate_dataframe_agent())

### Use Case 2: Financial advisor orchestrator

In [14]:
# Clear the exporter queue for any residue
queue_exporter.clear()

In [15]:
# Define the tools necessary for the agents

model_research = BedrockModel(model_id="us.amazon.nova-pro-v1:0")

# Investment Research Assistant and it's tool to get the stock prices
@tool
def investment_research_assistant(query: str) -> str:
    investment_researcher = Agent(
        model=model_research,
        system_prompt="""
        You are a financial research analyst who helps users explore various investment opportunities. 
        You specialize in providing insights into stocks, ETFs, mutual funds, bonds, and other instruments.
        You also highlight key market trends, risk factors, and historical performance.
        Your goal is to equip users with comprehensive, objective information to support their investment decisions.
        """,
    )
    return investment_researcher(query).message

# Budget Optimizer Assistant
model_budget = BedrockModel(model_id="us.amazon.nova-lite-v1:0")

@tool
def budget_optimizer_assistant(query: str) -> str:
    budget_coach = Agent(
        model=model_budget,
        system_prompt="""
        You are a smart budgeting assistant who helps users manage and optimize their monthly expenses.
        You analyze income, spending patterns, and savings goals, and suggest personalized recommendations 
        to cut unnecessary costs and improve savings. Your goal is to help users maintain a healthy financial balance.
        """,
    )
    return budget_coach(query).message

# Financial Planner Assistant
model_planner = BedrockModel(model_id="us.amazon.nova-micro-v1:0")

@tool
def financial_planner_assistant(query: str) -> str:
    financial_planner = Agent(
        model=model_planner,
        system_prompt="""
        You are a certified financial advisor bot who helps users create customized financial plans 
        based on their goals, income, age, and risk tolerance. 
        You guide them through budgeting, saving, debt management, insurance, and retirement planning.
        Your goal is to provide practical, step-by-step advice to help users achieve financial stability and growth.
        """,
    )
    return financial_planner(query).message

In [None]:
# Define orchestrator system prompt with clear tool selection guidance

MAIN_SYSTEM_PROMPT = """
        You are a certified financial advisor bot who helps users create customized financial plans 
        based on their goals, income, age, and risk tolerance. 
        You guide them through budgeting, saving, debt management, insurance, and retirement planning.
        Your goal is to provide practical, step-by-step advice to help users achieve financial stability and growth.
        """
financial_orchestrator = Agent(
    system_prompt=MAIN_SYSTEM_PROMPT,
    tools=[investment_research_assistant, budget_optimizer_assistant, financial_planner_assistant]
)

# Define the complex user query
user_query_financial = """
    I'm 30 years old, earning around $6,000 per month. 
    I have some student loans and moderate savings.
    I want to understand how I can better manage my monthly budget, 
    explore investment options, and build a solid long-term financial plan 
    for buying a house and retiring early. Can you help?
    """

financial_orchestrator(user_query_financial)

In [17]:
financial_agent_spans = get_all_spans(queue_exporter)

Evaluate the agent where no ground-truth answers or trajectories are provided. Instead, rely solely on an LLM-based evaluator to assess the agent's performance.

In [None]:
financial_agent_eval_metrics = [
    ToolCallAccuracyMetric(),
    AgentGoalAccuracyMetric(
        llm=llm_judge
    ),
    TrajectoryEvalWithLLMMetric(
        llm=bedrock_sonnet
    ),
    UsageMetric(config=MetricConfig(
        metric_params={"aws_region": "us-west-2"}
    )),
    LatencyMetric()
]

# Run the orchestrator with the user's query
async def evaluate_financial_agent():
    return await evaluate_agent(financial_agent_eval_metrics, financial_agent_spans, "financial_agent_trajectory.json")

# Execute the evaluation
financial_agent_eval_results = asyncio.run(evaluate_financial_agent())

Evaluate the agent using reference data, including the expected agent response and trajectory. Use an LLM to compare the agent's output against this reference to assess performance.

In [19]:
REFERENCE_FINAL_ANSWER = """
# Your Comprehensive Financial Plan

Based on the detailed analysis from our financial tools, here is your comprehensive three-part financial plan:

## Part 1: Monthly Budget with Aggressive Loan Repayment

**Monthly Income:** $6,000

**Monthly Expenses:**
- Housing: $1,500
- Utilities: $200
- Groceries: $300
- Transportation: $300
- Food (dining out): $200
- Entertainment: $100
- Insurance: $200
- Personal Care: $150
- Miscellaneous: $150
- **Total Monthly Expenses:** $3,200

**Debt and Savings Allocation:**
- Student Loan Payment: $1,613.75
  - Monthly Interest ($113.75) + Extra Principal Payment ($1,500)
- High-Yield Savings Account: $1,000
- Remaining for Emergency Fund/Investment: $186.25

This aggressive payment plan will help you eliminate your student loan much faster than minimum payments while still building your savings.

## Part 2: Investment Strategy for Long-Term Growth

### ETF Comparison:
Both VOO (currently at $555.14) and QQQ (currently at $535.16) are popular ETFs with different focuses:
- **VOO (Vanguard S&P 500 ETF)** tracks the S&P 500 index, providing broad market exposure
- **QQQ (Invesco QQQ Trust)** tracks the Nasdaq-100 index, focusing more on tech companies

### Recommended Portfolio Allocation for Age 30:

**Stocks (70-80%):**
- VOO (Vanguard S&P 500 ETF): 40%
- QQQ (Invesco QQQ Trust): 30%
- VXUS (Vanguard Total International Stock ETF): 20-30%

**Bonds (20-30%):**
- AGG (iShares Core U.S. Aggregate Bond ETF): 20%
- BNDX (Vanguard Total International Bond ETF): 10%

This balanced portfolio provides a good foundation for long-term growth with appropriate diversification across domestic and international markets.

## Part 3: Long-Term Plan for Home Purchase and Retirement

### Timeline to Home Purchase ($450,000 in 7 years):
1. **Years 1-2:** 
   - Aggressively pay off your $25,000 student loan (approximately 12 months)
   - Build emergency fund of 3-6 months expenses ($7,740-$13,900)

2. **Years 3-7:**
   - Save $90,000 for a 20% down payment (approximately 5 years at $1,250/month)
   - Continue retirement contributions
   - Monitor real estate market trends

### Retirement by Age 55:
1. **Start Now:**
   - Contribute 15% of your income to retirement accounts ($10,800/year)
   - Maximize employer match if available
   - Use tax-advantaged accounts (401(k), IRA)

2. **After Home Purchase:**
   - Allocate former down payment savings toward retirement
   - Increase contributions as income grows
   - Review and rebalance investments annually

3. **Ages 40-55:**
   - Accelerate retirement savings
   - Gradually shift to more conservative investments
   - Consider additional income streams

## Action Steps:

1. **Immediate:** Implement the budget with aggressive loan payments while maintaining savings contributions
2. **Within 1 year:** Eliminate student loan debt
3. **Years 1-2:** Complete emergency fund
4. **Years 2-7:** Focus on house down payment while maintaining retirement contributions
5. **Years 7+:** Redirect house savings to retirement after purchase
6. **Ongoing:** Review and adjust plan annually or with major life changes

This comprehensive plan balances your immediate need to eliminate debt while setting you on a path to both homeownership and early retirement. As your income grows, consider increasing your savings rate to accelerate progress toward your goals.
"""


REFERENCE_TRAJECTORY_OUTPUTS = [
    {
        "role": "user",
        "content": """
        I'm a 30-year-old, earning $6,000 per month after taxes. I have $15,000 in a high-yield savings account and a $25,000 student loan with a 5.5% interest rate.
        I need a comprehensive financial plan with three parts:
        1.  A monthly budget that helps me aggressively pay down my loan while still investing.
        2.  An investment strategy for long-term growth. Please compare the recent performance of the VOO and QQQ ETFs and suggest a sample portfolio for my age.
        3.  A long-term plan outlining the steps to buy a $450,000 house in the next 7 years and retire by age 55.
        Can you help create this for me?
        """,
    },
    {
        "role": "assistant",
        "content": "I'll help you create a comprehensive financial plan by consulting with our specialist tools for each part of your request. Let's address each component systematically.",
        "tool_calls": [
            {
                "function": {
                    "name": "budget_optimizer_assistant",
                    "arguments": '{"query": "I earn $6,000 per month after taxes. I have $15,000 in a high-yield savings account and a $25,000 student loan with a 5.5% interest rate. Create a monthly budget that helps me aggressively pay down my loan while still investing."}'
                }
            }
        ]
    },
    {
        "role": "tool",
        "content": """
        Certainly! Here's a structured monthly budget that helps you aggressively pay down your student loan while still allowing you to invest in your high-yield savings account.

        ### Monthly Income
        - **Total Monthly Income:** $6,000

        ### Monthly Expenses
        1. **Housing (Rent/Mortgage):** $1,500
        2. **Utilities (Electricity, Water, Internet, etc.):** $200
        3. **Groceries:** $300
        4. **Transportation (Car payments, Gas, Public Transport):** $300
        5. **Food (Dining out, Eating at restaurants):** $200
        6. **Entertainment (Subscriptions, Hobbies, etc.):** $100
        7. **Insurance (Health, Car, etc.):** $200
        8. **Personal Care (Clothing, Haircuts, etc.):** $150
        9. **Miscellaneous:** $150

        ### Total Monthly Expenses: $3,200

        ### Remaining Income for Loan and Savings
        - **Remaining Income:** $6,000 - $3,200 = $2,800

        ### Loan and Savings Allocation
        1. **Student Loan Payment (Extra Principal Payment):**
           - **Monthly Interest:** $25,000 * 5.5% / 12 = $113.75
           - **Extra Principal Payment:** $1,500 (This is an aggressive payment to reduce the principal faster)
           - **Total Loan Payment:** $113.75 + $1,500 = $1,613.75

        2. **High-Yield Savings Account:**
           - **Monthly Contribution:** $1,000

        ### Summary of Allocation
        - **Student Loan Payment:** $1,613.75
        - **High-Yield Savings Account:** $1,000
        - **Remaining Balance:** $2,800 - $1,613.75 - $1,000 = $176.25

        By following this budget, you'll be aggressively paying down your student loan while still contributing to your savings.
        """,
        "tool_calls": []
    },
    {
        "role": "assistant",
        "content": "Now I'll analyze investment options for your long-term growth strategy, including a comparison of VOO and QQQ ETFs.",
        "tool_calls": [
            {
                "function": {
                    "name": "investment_research_assistant",
                    "arguments": '{"query": "I\'m a 30-year-old looking for an investment strategy for long-term growth. Please compare the recent performance of the VOO and QQQ ETFs and suggest a sample portfolio for someone my age."}'
                }
            }
        ]
    },
    {
        "role": "tool",
        "content": """
        For someone in their 30s looking for long-term growth, a balanced portfolio with a mix of stocks and bonds is typically recommended. Here's a sample portfolio:

        **Current ETF Prices:**
        - VOO (Vanguard S&P 500 ETF): $555.14
        - QQQ (Invesco QQQ Trust): $535.16

        **Recommended Portfolio:**
        1. **Stocks (70-80%)**:
           - **VOO (Vanguard S&P 500 ETF)**: 40%
           - **QQQ (Invesco QQQ Trust)**: 30%
           - **VXUS (Vanguard Total International Stock ETF)**: 20-30%

        2. **Bonds (20-30%)**:
           - **AGG (iShares Core U.S. Aggregate Bond ETF)**: 20%
           - **BNDX (Vanguard Total International Bond ETF)**: 10%

        This portfolio aims for growth with a moderate level of risk. The inclusion of international stocks and bonds helps diversify the portfolio geographically.
        """,
        "tool_calls": []
    },
    {
        "role": "assistant",
        "content": "Finally, I'll create a comprehensive long-term plan that integrates your budget with your home buying and retirement goals.",
        "tool_calls": [
            {
                "function": {
                    "name": "financial_planner_assistant",
                    "arguments": '{"query": "I\'m 30 years old, earning $6,000 per month after taxes. I have $15,000 in a high-yield savings account and a $25,000 student loan with a 5.5% interest rate. Create a long-term plan outlining the steps to buy a $450,000 house in the next 7 years and retire by age 55."}'
                }
            }
        ]
    },
    {
        "role": "tool",
        "content": """
        Here's a step-by-step long-term financial plan:

        ### Step 1: Pay Off Student Loan
        **Student Loan Repayment Strategy:**
        - **Aggressive Repayment:** Monthly payment: $2,166.67 (Principal + Interest)
        - **Time to pay off:** ~12 months

        ### Step 2: Build Emergency Fund
        **Emergency Fund:** Aim to save 3-6 months of expenses ($7,740 - $13,900)
        - Start with $15,000 in savings, add monthly surplus

        ### Step 3: Save for Down Payment
        **Down Payment:** Target: $90,000 (20% of $450,000)
        - **Savings Rate:** Monthly surplus: $1,253.33
        - **Time to save:** $75,000 / $1,253.33 ≈ 60 months (5 years)

        ### Step 4: Retirement Savings
        **Retirement Savings:** Aim to save 15% of income for retirement (annual: $72,000 * 0.15 = $10,800)
        - **Monthly retirement savings:** $1,080

        ### Timeline:
        **Years 1-2:** Pay off student loan, build emergency fund
        **Years 3-7:** Save down payment, continue retirement investing
        **Years 8-15:** Accelerate retirement savings

        By following this plan, you should be well on your way to buying a house and achieving a comfortable retirement by age 55.
        """,
        "tool_calls": []
    },
    {
        "role": "assistant",
        "content": """# Your Comprehensive Financial Plan

Based on the detailed analysis from our financial tools, here is your comprehensive three-part financial plan:

## Part 1: Monthly Budget with Aggressive Loan Repayment

**Monthly Income:** $6,000

**Monthly Expenses:**
- Housing: $1,500
- Utilities: $200
- Groceries: $300
- Transportation: $300
- Food (dining out): $200
- Entertainment: $100
- Insurance: $200
- Personal Care: $150
- Miscellaneous: $150
- **Total Monthly Expenses:** $3,200

**Debt and Savings Allocation:**
- Student Loan Payment: $1,613.75
  - Monthly Interest ($113.75) + Extra Principal Payment ($1,500)
- High-Yield Savings Account: $1,000
- Remaining for Emergency Fund/Investment: $186.25

This aggressive payment plan will help you eliminate your student loan much faster than minimum payments while still building your savings.

## Part 2: Investment Strategy for Long-Term Growth

### ETF Comparison:
Both VOO (currently at $555.14) and QQQ (currently at $535.16) are popular ETFs with different focuses:
- **VOO (Vanguard S&P 500 ETF)** tracks the S&P 500 index, providing broad market exposure
- **QQQ (Invesco QQQ Trust)** tracks the Nasdaq-100 index, focusing more on tech companies

### Recommended Portfolio Allocation for Age 30:

**Stocks (70-80%):**
- VOO (Vanguard S&P 500 ETF): 40%
- QQQ (Invesco QQQ Trust): 30%
- VXUS (Vanguard Total International Stock ETF): 20-30%

**Bonds (20-30%):**
- AGG (iShares Core U.S. Aggregate Bond ETF): 20%
- BNDX (Vanguard Total International Bond ETF): 10%

This balanced portfolio provides a good foundation for long-term growth with appropriate diversification across domestic and international markets.

## Part 3: Long-Term Plan for Home Purchase and Retirement

### Timeline to Home Purchase ($450,000 in 7 years):
1. **Years 1-2:** 
   - Aggressively pay off your $25,000 student loan (approximately 12 months)
   - Build emergency fund of 3-6 months expenses ($7,740-$13,900)

2. **Years 3-7:**
   - Save $90,000 for a 20% down payment (approximately 5 years at $1,250/month)
   - Continue retirement contributions
   - Monitor real estate market trends

### Retirement by Age 55:
1. **Start Now:**
   - Contribute 15% of your income to retirement accounts ($10,800/year)
   - Maximize employer match if available
   - Use tax-advantaged accounts (401(k), IRA)

2. **After Home Purchase:**
   - Allocate former down payment savings toward retirement
   - Increase contributions as income grows
   - Review and rebalance investments annually

3. **Ages 40-55:**
   - Accelerate retirement savings
   - Gradually shift to more conservative investments
   - Consider additional income streams

## Action Steps:

1. **Immediate:** Implement the budget with aggressive loan payments while maintaining savings contributions
2. **Within 1 year:** Eliminate student loan debt
3. **Years 1-2:** Complete emergency fund
4. **Years 2-7:** Focus on house down payment while maintaining retirement contributions
5. **Years 7+:** Redirect house savings to retirement after purchase
6. **Ongoing:** Review and adjust plan annually or with major life changes

This comprehensive plan balances your immediate need to eliminate debt while setting you on a path to both homeownership and early retirement. As your income grows, consider increasing your savings rate to accelerate progress toward your goals."""
    }
]

In [None]:
financial_metrics = [
    ToolCallAccuracyMetric(),
    AgentGoalAccuracyMetric(
        llm=llm_judge,
        config=MetricConfig(metric_params={"reference_answer": REFERENCE_FINAL_ANSWER})
    ),
    TrajectoryEvalWithLLMMetric(
        llm=bedrock_sonnet,
        config=MetricConfig(metric_params={"reference_outputs": REFERENCE_TRAJECTORY_OUTPUTS})
    ),
    UsageMetric(config=MetricConfig(
        metric_params={"aws_region": "us-west-2"}
    )),
    LatencyMetric()
]

# Run the orchestrator with the user's query
async def evaluate_financial_agent():
    return await evaluate_agent(financial_metrics, financial_agent_spans, "financial_agent_trajectory.json")

# Execute the evaluation
financial_results = asyncio.run(evaluate_financial_agent())