# EcoHome Energy Advisor - Agent Run & Evaluation

In this notebook, we'll run the Energy Advisor agent with various real-world scenarios and see how it helps customers optimize their energy usage.

- Create the agent's instructions
- Run the Energy Advisor with different types of questions
- Evaluate response quality and accuracy
- Measure tool usage effectiveness
- Identify areas for improvement
- Implement evaluation metrics

## Evaluation Criteria
- **Accuracy**: Correct information and calculations
- **Relevance**: Responses address the user's question
- **Completeness**: Comprehensive answers with actionable advice
- **Tool Usage**: Appropriate use of available tools
- **Reasoning**: Clear explanation of recommendations


## 1. Import and Initialize

In [27]:
from datetime import datetime
from agent import Agent
import os
import re
from typing import Dict, Any, List, Set

In [28]:
## Create the agent's instructions

ECOHOME_SYSTEM_PROMPT = """
You are the EcoHome Energy Advisor, an AI assistant that helps homeowners optimize
their energy usage, electricity costs, and environmental impact.

You have access to these tools:
- get_weather_forecast(location, days): weather forecasts with solar-relevant info.
- get_electricity_prices(date): hourly time-of-use (TOU) electricity prices.
- query_energy_usage(start_date, end_date, device_type): historical household usage.
- query_solar_generation(start_date, end_date): historical solar generation.
- get_recent_energy_summary(hours): recent usage + solar overview by device and cost.
- search_energy_tips(query): retrieval-augmented energy-saving tips and best practices.
- calculate_energy_savings(device_type, current_usage_kwh, optimized_usage_kwh, price_per_kwh):
  estimate savings in kWh, dollars, and annual impact.

YOUR KEY CAPABILITIES

Weather Integration (solar-aware optimization):
- Use get_weather_forecast to understand cloud cover, temperature, and sunlight.
- Use this to:
  - Identify when solar production will be highest.
  - Suggest running flexible loads (EV charging, pool pumps, appliances) during sunny hours.
  - Recommend pre-cooling or pre-heating before extreme temperature periods.

Dynamic Pricing (time-of-use optimization):
- Use get_electricity_prices to:
  - Identify off-peak, mid-peak, and on-peak hours.
  - Shift flexible device usage from high-cost to low-cost periods.
- Always consider both price and solar availability when recommending schedules.

Historical Analysis (personalized advice):
- Use query_energy_usage and/or get_recent_energy_summary to:
  - Identify which devices consume the most energy and cost.
  - Detect patterns (e.g., EV charging at expensive times, HVAC dominating usage).
- Use query_solar_generation to understand how much solar is typically available.
- Base your suggestions on this history whenever the user asks about "my usage"
  or "based on my data/history."

RAG Pipeline (energy-saving knowledge):
- Use search_energy_tips when the user asks for:
  - “tips”, “ways to reduce energy”, “best practices”, “how can I save more”.
- Combine retrieved tips with the user’s historical data and context.
- Summarize tips in your own words and keep them specific and actionable.

Multi-device Optimization:
- Be able to reason about:
  - Electric vehicles (EVs) and when to charge them.
  - HVAC / thermostat settings and pre-cooling/pre-heating.
  - Dishwashers, washing machines, dryers, and other appliances.
  - Pool pumps and circulation schedules.
  - Solar systems and, if mentioned, energy storage (batteries).
- When appropriate, propose a coordinated schedule across multiple devices.

Cost Calculations & ROI:
- Use calculate_energy_savings to quantify:
  - kWh savings.
  - Dollar savings per day, month, or year.
- Clearly state your assumptions (e.g., kWh per cycle, hours of runtime, price per kWh).
- Whenever the user asks “how much can I save” or “what is the impact”, include numbers.

BEHAVIOR GUIDELINES

- Accuracy:
  - Call tools instead of guessing.
  - Use the most relevant tools for the user’s question (weather, prices, history, tips, savings).
  - Double-check that units (kWh, °C/°F, $, hours) are consistent and reasonable.

- Relevance:
  - Directly answer the user’s question first.
  - Then briefly expand with useful related recommendations (but don’t ramble).

- Completeness:
  - Provide actionable steps: specific hours, temperatures, or device schedules.
  - When possible, give at least 2–3 concrete recommendations instead of just one.

- Tool Usage:
  - For scheduling questions ("when should I run/charge…"), combine:
    - get_weather_forecast + get_electricity_prices, and optionally usage history.
  - For “based on my usage” questions, use:
    - query_energy_usage or get_recent_energy_summary.
  - For generic savings tips, use:
    - search_energy_tips and then personalize.
  - For numeric savings, use:
    - calculate_energy_savings.

- Reasoning:
  - Briefly explain WHY you are making a recommendation
    (e.g., “these hours are off-peak and have strong solar production”).
  - Keep explanations clear and non-technical.

EXAMPLE EXPECTATIONS

“When should I charge my electric car tomorrow to minimize cost and maximize solar power?”:
  - Use get_weather_forecast and get_electricity_prices (for tomorrow).
     Recommend specific charging hours that are both sunny and off-peak.

“What temperature should I set my thermostat on Wednesday afternoon if electricity prices spike?”:
  - Use get_electricity_prices (for that date) and optionally weather.
     Recommend a concrete thermostat range and pre-cooling strategy.

“Suggest three ways I can reduce energy use based on my usage history.”:
  - Use get_recent_energy_summary or query_energy_usage plus search_energy_tips.
     Provide 3 personalized, high-impact actions.

“How much can I save by running my dishwasher during off-peak hours?”:
  - Use get_electricity_prices and calculate_energy_savings with a reasonable
     kWh-per-cycle assumption. Return approximate $ savings per cycle and per month.

“What’s the best time to run my pool pump this week based on the weather forecast?”:
  - Use get_weather_forecast (and pricing if useful) to choose daily windows that
     balance good circulation/solar with acceptable electricity cost.

If data or tools are unavailable or return errors, say so briefly and provide
the best-practice advice you can, clearly noting that it is based on general
guidelines rather than the user’s specific data.
"""


In [29]:
from dotenv import load_dotenv
load_dotenv()

True

In [30]:
ecohome_agent = Agent(
    instructions=ECOHOME_SYSTEM_PROMPT,
)

In [31]:
ecohome_agent.get_agent_tools()

['get_weather_forecast',
 'get_electricity_prices',
 'query_energy_usage',
 'query_solar_generation',
 'get_recent_energy_summary',
 'search_energy_tips',
 'calculate_energy_savings']

In [32]:
response = ecohome_agent.invoke(
    question="When should I charge my electric car tomorrow to minimize cost and maximize solar power?",
    context="Location: San Francisco, CA"
)

In [33]:
print(response["messages"][-1].content)

To minimize costs and maximize solar power when charging your electric car tomorrow (October 24, 2023), here are the best times to charge based on the weather forecast and electricity prices:

### Best Charging Times:
1. **Off-Peak Hours (Lowest Cost)**:
   - **12:00 AM - 5:00 AM**: $0.132 per kWh
   - **11:00 PM - 12:00 AM**: $0.132 per kWh

2. **Mid-Peak Hours (Moderate Cost)**:
   - **6:00 AM - 10:00 AM**: $0.22 per kWh
   - **10:00 PM - 11:00 PM**: $0.242 per kWh

### Solar Power Availability:
- **Solar Generation**: The forecast indicates that solar generation will start to pick up around **9:00 AM** and peak around **3:00 PM**. However, charging during this time will be more expensive due to on-peak rates.

### Recommendations:
- **Best Option**: Charge your electric car during the **off-peak hours** from **12:00 AM to 5:00 AM**. This will ensure you are using electricity at the lowest rate.
- If you prefer to charge during the day when solar power is available, consider charging

In [34]:
print("TOOLS:")
for msg in response["messages"]:
    obj = msg.model_dump()
    if obj.get("tool_call_id"):
        print("-", msg.name)

TOOLS:
- get_weather_forecast
- get_electricity_prices


## 2. Define Test Cases

In [14]:
# Define comprehensive test cases for the Energy Advisor
# Create 10 test cases covering different scenarios:
# - EV charging optimization
# - Thermostat settings
# - Appliance scheduling
# - Solar power maximization
# - Cost savings calculations

In [35]:
test_cases = [
    {
        "id": "ev_charging_1",
        "question": "When should I charge my electric car tomorrow to minimize cost and maximize solar power?",
        "expected_tools": ["get_weather_forecast", "get_electricity_prices"],
        "expected_response": "Must recommend specific charging hours, reference sunny periods and off-peak pricing."
    },
    {
        "id": "thermostat_2",
        "question": "What temperature should I set my thermostat on Wednesday afternoon to save money?",
        "expected_tools": ["get_electricity_prices", "get_weather_forecast"],
        "expected_response": "Should suggest a numeric thermostat range and explain price/weather reasoning."
    },
    {
        "id": "dishwasher_3",
        "question": "How much can I save by running my dishwasher at night instead of 6 PM?",
        "expected_tools": ["get_electricity_prices", "calculate_energy_savings"],
        "expected_response": "Must estimate savings per cycle and per month using TOU rate difference."
    },
    {
        "id": "laundry_4",
        "question": "When is the best time to run my washing machine this weekend?",
        "expected_tools": ["get_electricity_prices"],
        "expected_response": "Should detect weekend pricing and recommend cheapest hours."
    },
    {
        "id": "solar_forecast_5",
        "question": "How much solar energy can I expect tomorrow in San Francisco?",
        "expected_tools": ["get_weather_forecast"],
        "expected_response": "Should reference sunny hours, irradiance levels, or expected generation patterns."
    },
    {
        "id": "usage_history_6",
        "question": "Which appliance used the most electricity last month?",
        "expected_tools": ["query_energy_usage"],
        "expected_response": "Must identify highest-consumption device and provide kWh + cost breakdown."
    },
    {
        "id": "optimization_multi_device_7",
        "question": "Help me schedule my EV, dishwasher, and dryer tomorrow for lowest electricity cost.",
        "expected_tools": ["get_electricity_prices", "get_weather_forecast"],
        "expected_response": "Should propose a coordinated time schedule minimizing on-peak usage."
    },
    {
        "id": "energy_tips_8",
        "question": "Give me three ways to reduce electricity usage at home.",
        "expected_tools": ["search_energy_tips"],
        "expected_response": "Should return 3 actionable, personalized efficiency recommendations."
    },
    {
        "id": "recent_summary_9",
        "question": "Summarize my energy usage over the past 48 hours.",
        "expected_tools": ["get_recent_energy_summary"],
        "expected_response": "Should return total kWh, cost, device breakdown, and insights."
    },
    {
        "id": "pool_pump_10",
        "question": "What is the best time to run my pool pump this week?",
        "expected_tools": ["get_weather_forecast", "get_electricity_prices"],
        "expected_response": "Must recommend a daily schedule balancing sunlight and off-peak pricing."
    }
]

if len(test_cases) < 10:
    raise ValueError("You MUST have at least 10 test cases")

## 3. Run Agent Tests

In [36]:
CONTEXT = "Location: San Francisco, CA"

In [37]:
# Run the agent tests
# For each test case, call the agent and collect the response
# Store results for evaluation

print("=== Running Agent Tests ===")
test_results = []

for i, test_case in enumerate(test_cases):
    print(f"\nTest {i+1}: {test_case['id']}")
    print(f"Question: {test_case['question']}")
    print("-" * 50)
    
    try:
        # Call the agent
        response = ecohome_agent.invoke(
            question=test_case['question'],
            context=CONTEXT
        )
        
        # Store the result
        result = {
            'test_id': test_case['id'],
            'question': test_case['question'],
            'response': response,
            'messages': response['messages'] if isinstance(response, dict) and 'messages' in response else [],
            'expected_tools': test_case['expected_tools'],
            'expected_response': test_case['expected_response'],
            'timestamp': datetime.now().isoformat(),
            'failed': False,
        }
        test_results.append(result)
                
    except Exception as e:
        print(f"Error: {e}")
        result = {
            'test_id': test_case['id'],
            'question': test_case['question'],
            'response': f"Error: {str(e)}",
            'expected_tools': test_case['expected_tools'],
            'expected_response': test_case['expected_response'],
            'timestamp': datetime.now().isoformat(),
            'error': str(e)
        }
        test_results.append(result)

print(f"\nCompleted {len(test_results)} tests")


=== Running Agent Tests ===

Test 1: ev_charging_1
Question: When should I charge my electric car tomorrow to minimize cost and maximize solar power?
--------------------------------------------------

Test 2: thermostat_2
Question: What temperature should I set my thermostat on Wednesday afternoon to save money?
--------------------------------------------------

Test 3: dishwasher_3
Question: How much can I save by running my dishwasher at night instead of 6 PM?
--------------------------------------------------

Test 4: laundry_4
Question: When is the best time to run my washing machine this weekend?
--------------------------------------------------

Test 5: solar_forecast_5
Question: How much solar energy can I expect tomorrow in San Francisco?
--------------------------------------------------

Test 6: usage_history_6
Question: Which appliance used the most electricity last month?
--------------------------------------------------

Test 7: optimization_multi_device_7
Question: He

In [None]:
test_results

## 4. Evaluate Responses

In [None]:
# Implement evaluation functions
# Create functions to evaluate:
# - Final Response
# - Tool usage

In [39]:
def _normalize(text: str) -> str:
    """Lowercase + remove extra spaces and punctuation for simple matching."""
    if text is None:
        return ""
    text = str(text)
    text = text.lower()
    text = re.sub(r"[^a-z0-9\s:.%-]", " ", text)  # keep simple symbols like % and :
    text = re.sub(r"\s+", " ", text).strip()
    return text

In [40]:
def _tokenize(text: str) -> List[str]:
    return [t for t in _normalize(text).split(" ") if t]


In [41]:
def _overlap_score(a: str, b: str) -> float:
    """
    Simple token overlap between two strings.
    Returns a score between 0 and 1.
    """
    ta = set(_tokenize(a))
    tb = set(_tokenize(b))
    if not ta or not tb:
        return 0.0
    inter = len(ta.intersection(tb))
    union = len(ta.union(tb))
    return inter / union if union > 0 else 0.0


In [45]:
def evaluate_response(
    question: str,
    final_response: str,
    expected_response: str,
) -> Dict[str, Any]:
    """
    Evaluate a single response against an expected description.

    Metrics (0.0 – 1.0):
      - ACCURACY: How well the response matches the expected_response content.
      - RELEVANCE: How well the response stays on topic with the question.
      - COMPLETENESS: Does it cover the key aspects implied in expected_response?
      - USEFULNESS: Practical, actionable quality of the answer.

    Returns a dict with per-metric scores + feedback and an overall summary.
    """
    if not isinstance(final_response, str):
        final_response = str(final_response or "")
    if not isinstance(expected_response, str):
        expected_response = str(expected_response or "")
    if not isinstance(question, str):
        question = str(question or "")

    resp_norm = _normalize(final_response)
    q_norm = _normalize(question)
    exp_norm = _normalize(expected_response)

    # --- 1. ACCURACY ---
    accuracy_raw = _overlap_score(final_response, expected_response)

    if accuracy_raw >= 0.75:
        acc_feedback = (
            "The response closely aligns with the expected behavior and content."
        )
    elif accuracy_raw >= 0.4:
        acc_feedback = (
            "The response partially aligns with expectations but misses some details."
        )
    else:
        acc_feedback = (
            "The response diverges significantly from the expected guidance or omits "
            "important elements."
        )

    # --- 2. RELEVANCE ---
    relevance_raw = _overlap_score(final_response, question)

    # Penalize clearly generic error responses
    if any(x in resp_norm for x in ["internal error", "try again", "cannot", "sorry"]):
        relevance_raw *= 0.5

    if relevance_raw >= 0.75:
        rel_feedback = "The response is highly relevant to the user’s question."
    elif relevance_raw >= 0.4:
        rel_feedback = (
            "The response is somewhat relevant but includes tangential or generic content."
        )
    else:
        rel_feedback = "The response does not adequately address the user’s question."

    # --- 3. COMPLETENESS ---
    completeness_overlap = _overlap_score(final_response, expected_response)
    resp_len = len(resp_norm.split())

    # Very short answers can’t be fully complete
    length_factor = 1.0
    if resp_len < 40:
        length_factor = 0.4
    elif resp_len < 80:
        length_factor = 0.7

    completeness_raw = min(1.0, completeness_overlap * 1.2 * length_factor)

    if completeness_raw >= 0.75:
        comp_feedback = (
            "The response is comprehensive and covers the key elements implied by the expectations."
        )
    elif completeness_raw >= 0.4:
        comp_feedback = (
            "The response is partially complete; it covers some important aspects but "
            "misses others."
        )
    else:
        comp_feedback = (
            "The response feels incomplete and lacks several important details or steps."
        )

    # --- 4. USEFULNESS ---
    has_numbers = bool(re.search(r"\d", resp_norm))
    has_actions = any(
        kw in resp_norm
        for kw in [
            "recommend",
            "should",
            "you can",
            "best time",
            "set your",
            "run your",
            "schedule",
            "try",
            "consider",
        ]
    )

    base_usefulness = (accuracy_raw + relevance_raw + completeness_raw) / 3.0
    action_boost = 0.1 if has_numbers and has_actions else 0.0
    usefulness_raw = min(1.0, base_usefulness + action_boost)

    if usefulness_raw >= 0.75:
        use_feedback = (
            "The response is highly useful, with clear, actionable recommendations."
        )
    elif usefulness_raw >= 0.4:
        use_feedback = (
            "The response is somewhat useful but could be more concrete or actionable."
        )
    else:
        use_feedback = (
            "The response offers limited practical value and needs clearer, more "
            "actionable guidance."
        )

    # ---Overall---
    overall = (accuracy_raw + relevance_raw + completeness_raw + usefulness_raw) / 4.0

    if overall >= 0.8:
        overall_feedback = (
            "Overall, this is a strong response: it is accurate, relevant, and actionable "
            "for the user’s energy optimization question."
        )
    elif overall >= 0.5:
        overall_feedback = (
            "Overall, the response is decent but has room for improvement in either "
            "accuracy, completeness, or practical usefulness."
        )
    else:
        overall_feedback = (
            "Overall, the response does not sufficiently meet the expectations. It should "
            "be more accurate, better aligned with the question, and more concrete."
        )

    return {
        "accuracy": {
            "score": round(accuracy_raw, 3),
            "feedback": acc_feedback,
        },
        "relevance": {
            "score": round(relevance_raw, 3),
            "feedback": rel_feedback,
        },
        "completeness": {
            "score": round(completeness_raw, 3),
            "feedback": comp_feedback,
        },
        "usefulness": {
            "score": round(usefulness_raw, 3),
            "feedback": use_feedback,
        },
        "overall_score": round(overall, 3),
        "overall_feedback": overall_feedback,
    }


In [46]:
def _extract_tools_used(messages) -> List[str]:
    """
    Extract tool names used from a LangGraph / LangChain messages list.
    Works with message objects or dicts.
    """
    if messages is None:
        return []

    used: List[str] = []

    for msg in messages:
        obj = msg
        if not isinstance(obj, dict) and hasattr(msg, "model_dump"):
            obj = msg.model_dump()

        # Tool calls are embedded in AIMessage
        tool_calls = obj.get("tool_calls") if isinstance(obj, dict) else None
        if tool_calls:
            for tc in tool_calls:
                name = tc.get("name")
                if name:
                    used.append(name)

        # ToolMessage rows
        #   In dict form: type == "tool", has name
        if isinstance(obj, dict) and obj.get("type") == "tool" and obj.get("name"):
            used.append(obj["name"])

        # Fallback: if it's a ToolMessage-like object
        if not isinstance(obj, dict):
            if getattr(msg, "tool_call_id", None) and getattr(msg, "name", None):
                used.append(msg.name)

    # Deduplicate while preserving order
    seen: Set[str] = set()
    unique_used: List[str] = []
    for t in used:
        if t not in seen:
            seen.add(t)
            unique_used.append(t)

    return unique_used

In [47]:
def evaluate_tool_usage(messages, expected_tools: List[str]) -> Dict[str, Any]:
    """
    Evaluate tool usage based on messages_list and expected_tools.

    Metrics:
      - Tool Appropriateness: Were the tools used are the right ones?
      - Tool Completeness: Were all necessary tools (expected_tools) used?

    Returns metrics + detailed feedback.
    """
    expected_set = set(expected_tools or [])
    used_tools = _extract_tools_used(messages)
    used_set = set(used_tools)

    correct_used = used_set.intersection(expected_set)
    extra_used = used_set.difference(expected_set)
    missing = expected_set.difference(used_set)

    # Appropriateness: of the tools the agent did call, how many were expected?
    if used_tools:
        appropriateness_score = len(correct_used) / len(used_set)
    else:
        # No tools used at all
        appropriateness_score = 1.0 if not expected_set else 0.0

    # Completeness: of the tools we expected, how many did the agent use?
    if expected_set:
        completeness_score = len(correct_used) / len(expected_set)
    else:
        # No tools required for this test
        completeness_score = 1.0

    # --- Feedback for Appropriateness ---
    if appropriateness_score >= 0.9:
        app_feedback = (
            "The agent selected the appropriate tools for this scenario. "
            f"Used: {sorted(list(used_set)) or ['<none>']}."
        )
    elif appropriateness_score >= 0.5:
        app_feedback = (
            "The agent used some appropriate tools, but also included unnecessary "
            f"or less relevant tools. Extra tools: {sorted(list(extra_used)) or ['<none>']}."
        )
    else:
        app_feedback = (
            "The agent often chose inappropriate tools or failed to rely on the right ones. "
            f"Expected: {sorted(list(expected_set)) or ['<none>']}, "
            f"but used: {sorted(list(used_set)) or ['<none>']}."
        )

    # --- Feedback for Completeness ---
    if completeness_score >= 0.9:
        comp_feedback_tool = (
            "The agent called all of the necessary tools for this task."
        )
    elif completeness_score >= 0.5:
        comp_feedback_tool = (
            "The agent used some, but not all, of the necessary tools. "
            f"Missing: {sorted(list(missing)) or ['<none>']}."
        )
    else:
        comp_feedback_tool = (
            "The agent missed most of the key tools needed for this scenario. "
            f"Expected: {sorted(list(expected_set)) or ['<none>']}, "
            f"Missing: {sorted(list(missing)) or ['<none>']}."
        )

    overall_tool_score = (appropriateness_score + completeness_score) / 2.0
    if overall_tool_score >= 0.8:
        overall_tool_feedback = (
            "Overall, tool usage is strong: the agent generally picks the right tools "
            "and uses all that are needed."
        )
    elif overall_tool_score >= 0.5:
        overall_tool_feedback = (
            "Overall, tool usage is acceptable but could be more consistent in picking "
            "all the right tools and avoiding unnecessary calls."
        )
    else:
        overall_tool_feedback = (
            "Overall, tool usage needs significant improvement. The agent should rely "
            "more on the appropriate tools and ensure all required tools are invoked."
        )

    return {
        "used_tools": used_tools,
        "expected_tools": list(expected_set),
        "tool_appropriateness": {
            "score": round(appropriateness_score, 3),
            "feedback": app_feedback,
        },
        "tool_completeness": {
            "score": round(completeness_score, 3),
            "feedback": comp_feedback_tool,
        },
        "overall_score": round(overall_tool_score, 3),
        "overall_feedback": overall_tool_feedback,
    }


In [48]:
def generate_evaluation_report(test_results: List[Dict[str, Any]]) -> Dict[str, Any]:
    """
    Generate a comprehensive evaluation report from test_results.

    Each item in test_results is expected to have:
      - test_id
      - question
      - response (raw agent output or error string)
      - expected_response (text description of desired behavior)
      - expected_tools (list of tool names)
      - messages (optional LangChain/LangGraph messages list)
      - failed (optional bool)
    """
    if not test_results:
        return {
            "summary": {
                "num_tests": 0,
                "avg_accuracy": 0.0,
                "avg_relevance": 0.0,
                "avg_completeness": 0.0,
                "avg_usefulness": 0.0,
                "avg_tool_appropriateness": 0.0,
                "avg_tool_completeness": 0.0,
                "avg_overall_response_score": 0.0,
                "avg_overall_tool_score": 0.0,
                "num_failed": 0,
            },
            "per_test": [],
            "strengths": [],
            "weaknesses": [],
            "recommendations": [],
        }

    total_acc = total_rel = total_comp = total_use = 0.0
    total_resp_overall = 0.0
    total_tool_approp = total_tool_comp = 0.0
    total_tool_overall = 0.0

    num_tests = len(test_results)
    num_failed = 0

    per_test_details = []

    for tr in test_results:
        question = tr.get("question", "")
        expected_resp = tr.get("expected_response", "")
        raw_resp = tr.get("response", "")
        messages = tr.get("messages", [])
        failed = tr.get("failed", False)

        # If failed, treat as very low-scoring with clear feedback
        if failed:
            num_failed += 1
            resp_eval = {
                "accuracy": {"score": 0.0, "feedback": "Agent failed to produce a valid response."},
                "relevance": {"score": 0.0, "feedback": "No relevant response due to failure."},
                "completeness": {"score": 0.0, "feedback": "Response is incomplete or missing."},
                "usefulness": {"score": 0.0, "feedback": "No actionable guidance due to failure."},
                "overall_score": 0.0,
                "overall_feedback": "Agent execution failed.",
            }
            tool_eval = {
                "used_tools": [],
                "expected_tools": tr.get("expected_tools", []),
                "tool_appropriateness": {"score": 0.0, "feedback": "No tools were used due to failure."},
                "tool_completeness": {"score": 0.0, "feedback": "Expected tools were not invoked."},
                "overall_score": 0.0,
                "overall_feedback": "Tool execution did not occur.",
            }
        else:
            # Evaluate response
            resp_eval = evaluate_response(
                question=question,
                final_response=str(raw_resp),
                expected_response=expected_resp,
            )

            # Evaluate tool usage
            tool_eval = evaluate_tool_usage(
                messages=messages,
                expected_tools=tr.get("expected_tools", []),
            )

        # Aggregate scores
        total_acc += resp_eval["accuracy"]["score"]
        total_rel += resp_eval["relevance"]["score"]
        total_comp += resp_eval["completeness"]["score"]
        total_use += resp_eval["usefulness"]["score"]
        total_resp_overall += resp_eval["overall_score"]

        total_tool_approp += tool_eval["tool_appropriateness"]["score"]
        total_tool_comp += tool_eval["tool_completeness"]["score"]
        total_tool_overall += tool_eval["overall_score"]

        per_test_details.append(
            {
                "test_id": tr.get("test_id"),
                "question": question,
                "response_snippet": str(raw_resp)[:280],
                "response_metrics": resp_eval,
                "tool_metrics": tool_eval,
            }
        )

    # Compute averages
    avg_accuracy = round(total_acc / num_tests, 2)
    avg_relevance = round(total_rel / num_tests, 2)
    avg_completeness = round(total_comp / num_tests, 2)
    avg_usefulness = round(total_use / num_tests, 2)
    avg_resp_overall = round(total_resp_overall / num_tests, 2)
    avg_tool_approp = round(total_tool_approp / num_tests, 2)
    avg_tool_comp = round(total_tool_comp / num_tests, 2)
    avg_tool_overall = round(total_tool_overall / num_tests, 2)

    # High-level strengths & weaknesses from averages
    strengths = []
    weaknesses = []
    recommendations = []

    if avg_relevance > 0.7:
        strengths.append("Responses are generally well-aligned with the user questions.")
    else:
        weaknesses.append("Relevance is moderate/low; some answers drift off-topic.")
        recommendations.append(
            "Tighten the instructions so the agent always directly answers the core question first."
        )

    if avg_completeness > 0.7:
        strengths.append("Most responses cover the key aspects expected for the scenarios.")
    else:
        weaknesses.append("Completeness is lacking; some responses miss time, cost, or solar details.")
        recommendations.append(
            "Emphasize in the system prompt that answers must include time windows, pricing, and solar considerations when applicable."
        )

    if avg_tool_approp > 0.7:
        strengths.append("The agent usually selects appropriate tools for each task.")
    else:
        weaknesses.append("Tool selection is sometimes suboptimal or incomplete.")
        recommendations.append(
            "Refine tool call examples in the system prompt so the agent knows exactly which tools to use for EV charging, thermostat, tips, and history-based questions."
        )

    if avg_tool_comp < 0.7:
        weaknesses.append("Not all expected tools are being used in multi-tool scenarios.")
        recommendations.append(
            "Encourage explicit combination of weather + pricing + history tools for optimization questions."
        )

    if num_failed > 0:
        weaknesses.append(f"{num_failed} test(s) resulted in internal errors.")
        recommendations.append(
            "Harden the agent against tool errors and validation issues; catch exceptions in tools and return structured error messages the agent can interpret."
        )

    return {
        "summary": {
            "num_tests": num_tests,
            "num_failed": num_failed,
            "avg_accuracy": avg_accuracy,
            "avg_relevance": avg_relevance,
            "avg_completeness": avg_completeness,
            "avg_usefulness": avg_usefulness,
            "avg_overall_response_score": avg_resp_overall,
            "avg_tool_appropriateness": avg_tool_approp,
            "avg_tool_completeness": avg_tool_comp,
            "avg_overall_tool_score": avg_tool_overall,
        },
        "per_test": per_test_details,
        "strengths": strengths,
        "weaknesses": weaknesses,
        "recommendations": recommendations,
    }


In [51]:
def display_evaluation_report(report: Dict[str, Any]) -> None:
    """Pretty print the evaluation report."""
    summary = report.get("summary", {})
    per_test = report.get("per_test", [])
    strengths = report.get("strengths", [])
    weaknesses = report.get("weaknesses", [])
    recommendations = report.get("recommendations", [])

    print("====================================================")
    print("      EcoHome Energy Advisor - Evaluation Report    ")
    print("====================================================\n")

    print("=== Summary Metrics ===")
    print(f"Total tests        : {summary.get('num_tests', 0)}")
    print(f"Failed tests       : {summary.get('num_failed', 0)}")
    print(f"Avg Accuracy       : {summary.get('avg_accuracy', 0.0):.2f}")
    print(f"Avg Relevance      : {summary.get('avg_relevance', 0.0):.2f}")
    print(f"Avg Completeness   : {summary.get('avg_completeness', 0.0):.2f}")
    print(f"Avg Usefulness     : {summary.get('avg_usefulness', 0.0):.2f}")
    print(f"Avg Resp. Overall  : {summary.get('avg_overall_response_score', 0.0):.2f}")
    print(f"Avg Tool Appropri. : {summary.get('avg_tool_appropriateness', 0.0):.2f}")
    print(f"Avg Tool Completeness: {summary.get('avg_tool_completeness', 0.0):.2f}")
    print(f"Avg Tool Overall   : {summary.get('avg_overall_tool_score', 0.0):.2f}")
    print()

    print("=== Per-Test Details  ===")
    for t in per_test:
        print(f"- Test ID  : {t['test_id']}")
        print(f"  Question : {t['question']}")
        print(f"  Response : {t['response_snippet']!r}")
        rm = t["response_metrics"]
        tm = t["tool_metrics"]
        print(
            f"  Response Scores -> acc={rm['accuracy']['score']:.2f}, "
            f"rel={rm['relevance']['score']:.2f}, "
            f"comp={rm['completeness']['score']:.2f}, "
            f"use={rm['usefulness']['score']:.2f}, "
            f"overall={rm['overall_score']:.2f}"
        )
        print(
            f"  Tool Scores     -> appr={tm['tool_appropriateness']['score']:.2f}, "
            f"comp={tm['tool_completeness']['score']:.2f}, "
            f"overall={tm['overall_score']:.2f}"
        )
        print()

    print("=== Strengths ===")
    if strengths:
        for s in strengths:
            print(f"- {s}")
    else:
        print("- (None identified yet)")
    print()

    print("=== Weaknesses ===")
    if weaknesses:
        for w in weaknesses:
            print(f"- {w}")
    else:
        print("- (None identified yet)")
    print()

    print("=== Recommendations ===")
    if recommendations:
        for r in recommendations:
            print(f"- {r}")
    else:
        print("- (No specific recommendations yet)")
    print()


In [52]:
report = generate_evaluation_report(test_results)
display_evaluation_report(report)

      EcoHome Energy Advisor - Evaluation Report    

=== Summary Metrics ===
Total tests        : 10
Failed tests       : 0
Avg Accuracy       : 0.01
Avg Relevance      : 0.03
Avg Completeness   : 0.01
Avg Usefulness     : 0.12
Avg Resp. Overall  : 0.04
Avg Tool Appropri. : 0.85
Avg Tool Completeness: 0.90
Avg Tool Overall   : 0.88

=== Per-Test Details  ===
- Test ID  : ev_charging_1
  Question : When should I charge my electric car tomorrow to minimize cost and maximize solar power?
  Response : "{'messages': [SystemMessage(content='Additional user context for personalization: Location: San Francisco, CA', additional_kwargs={}, response_metadata={}, id='41b07427-a1d9-4b1e-af56-3ef8d6627158'), HumanMessage(content='When should I charge my electric car tomorrow to minimize "
  Response Scores -> acc=0.01, rel=0.04, comp=0.02, use=0.12, overall=0.05
  Tool Scores     -> appr=1.00, comp=1.00, overall=1.00

- Test ID  : thermostat_2
  Question : What temperature should I set my thermosta

## Summary of the Report

. All 10 test cases executed successfully

. No runtime or tool-execution failures

. Agent is stable and responsive

### Performance Insights

***Strength — Tool Usage***

Tool Appropriateness: 0.85 avg
- The agent usually selects the right tools

Tool Completeness: 0.90 avg
- It often uses all tools needed for the task

Tool Overall Score: 0.88
- Strong reasoning about when to call tools

This indicates good understanding of:
Weather + pricing optimization
Usage-history questions
EV scheduling
RAG-based energy tips

### Weakness — Response Quality

Average response scoring:

| Metric        | Score |
|--------------|-------|
| Accuracy     | 0.01  |
| Relevance    | 0.03  |
| Completeness | 0.01  |
| Usefulness   | 0.12  |
| Overall      | 0.04  |

This means:
Answers don’t fully address the user question
Missing key details (hours, prices, savings, solar data)
Responses sometimes drift or remain generic
Not integrating retrieved tool outputs into explanations

### Recommendations for Improvement

Improve post-tool reasoning requiring the agent to summarize tool results in sentences
Example: “Solar peaks at 1–3 PM, rates lowest at midnight—so charge at ___”
Update system prompt to explicitly include Specific time windows, Pricing numbers, Solar irradiance references
Enforce structured response format
such as  {best_time, rationale, cost_estimate, solar_factor}
