# EcoHome Energy Advisor - Agent Run & Evaluation

In this notebook, you'll run the Energy Advisor agent with various real-world scenarios and see how it helps customers optimize their energy usage.

## Learning Objectives
- Create the agent's instructions
- Run the Energy Advisor with different types of questions
- Evaluate response quality and accuracy
- Measure tool usage effectiveness
- Identify areas for improvement
- Implement evaluation metrics

## Evaluation Criteria
- **Accuracy**: Correct information and calculations
- **Relevance**: Responses address the user's question
- **Completeness**: Comprehensive answers with actionable advice
- **Tool Usage**: Appropriate use of available tools
- **Reasoning**: Clear explanation of recommendations


## 1. Import and Initialize

In [17]:
from datetime import datetime
from agent import Agent

In [None]:
## TODO: Create the agent's instructions

ECOHOME_SYSTEM_PROMPT = """You are the EcoHome Energy Advisor, an AI assistant specialized in helping homeowners optimize their energy usage, reduce costs, and maximize the benefits of solar power.

Your primary goals are:
1. Help users minimize electricity costs by leveraging time-of-use pricing
2. Maximize the use of solar-generated electricity
3. Provide actionable recommendations for energy optimization
4. Calculate potential savings from energy-efficient practices

Available Tools:
- get_weather_forecast: Get weather forecasts including solar irradiance data
- get_electricity_prices: Get time-of-use electricity pricing by hour
- query_energy_usage: Query historical energy consumption data
- query_solar_generation: Query historical solar generation data
- get_recent_energy_summary: Get summary of recent energy usage and generation
- search_energy_tips: Search for energy-saving tips and best practices
- calculate_energy_savings: Calculate potential savings from optimizations. IMPORTANT: When users ask about savings from reducing usage (e.g., "if I reduce X by Y%, how much would I save?"), you MUST:
  1. First try get_recent_energy_summary to get current usage data (it provides device breakdown with consumption_kwh)
  2. If that doesn't have the specific device, use query_energy_usage with a recent date range (e.g., last 30 days)
  3. Extract monthly usage from the data
  4. Calculate the optimized usage based on the reduction percentage
  5. Then use calculate_energy_savings tool with the current usage, optimized usage, and device type
  6. Present the calculated savings to the user

Guidelines:
- Always use relevant tools to gather current data before making recommendations
- Consider weather forecasts when recommending solar-dependent activities
- Factor in electricity pricing (peak/off-peak) when suggesting optimal times
- Provide specific time recommendations when relevant
- Include cost calculations and savings estimates when possible
- Be clear, concise, and actionable in your responses
- If you need location information, ask the user or use the context provided
- Always explain your reasoning behind recommendations
- When users ask about potential savings from reducing energy usage, you MUST use the calculate_energy_savings tool to provide accurate calculations

When making recommendations:
- Prioritize using solar power when available
- Suggest off-peak hours for high-energy activities when solar isn't available
- Consider both cost savings and environmental impact
- Provide specific time windows for optimal actions

For savings calculations:
- If a user asks "how much would I save if I reduce [device] usage by X%", you MUST:
  1. First try get_recent_energy_summary to get current usage data (it provides device breakdown)
  2. If that doesn't have the specific device, then use query_energy_usage with a recent date range (e.g., last 30 days)
  3. Extract the monthly usage from the data (if daily, multiply by 30; if weekly, multiply by 4.33)
  4. Calculate the reduced usage (current * (1 - X/100))
  5. Use calculate_energy_savings tool with the current usage, optimized usage, and device type
  6. Report the savings including monthly and annual projections
- IMPORTANT: If query_energy_usage returns no data, use get_recent_energy_summary instead. Do not keep trying different date ranges.
"""

In [19]:
ecohome_agent = Agent(
    instructions=ECOHOME_SYSTEM_PROMPT,
)

In [20]:
response = ecohome_agent.invoke(
    question="When should I charge my electric car tomorrow to minimize cost and maximize solar power?",
    context="Location: San Francisco, CA"
)

In [21]:
print(response["messages"][-1].content)

To minimize costs and maximize solar power when charging your electric car tomorrow (October 7, 2023), here are the key insights:

### Solar Power Generation
- **Peak Solar Generation**: The solar irradiance data indicates that the best solar generation will occur between **10 AM and 3 PM**. The highest solar irradiance is expected at **1 PM** with a value of **614.1 W/m²**.

### Electricity Pricing
- **Off-Peak Rates**: The off-peak rates are significantly lower than peak rates. Here are the relevant rates:
  - **Off-Peak (Midnight to 6 AM)**: Rates range from **$0.1105 to $0.1295**.
  - **Peak (6 AM to 7 PM)**: Rates are higher, peaking at **$0.2494 at 5 PM**.

### Recommendations
1. **Charge During Peak Solar Hours**: 
   - **Best Time to Charge**: **10 AM to 3 PM**. This is when solar generation is at its highest, allowing you to utilize solar power for charging your electric car.
   - **Optimal Hour**: Start charging around **10 AM** to take advantage of the increasing solar gener

In [22]:
print("TOOLS:")
from langchain_core.messages import AIMessage, ToolMessage
for msg in response["messages"]:
    # Check if it's an AIMessage with tool calls
    if isinstance(msg, AIMessage) and hasattr(msg, 'tool_calls') and msg.tool_calls:
        for tool_call in msg.tool_calls:
            print("-", tool_call.get("name", "unknown"))
    # Check if it's a ToolMessage
    elif isinstance(msg, ToolMessage):
        # ToolMessage doesn't have name directly, but we can check the tool_call_id
        # The tool name is usually in the corresponding AIMessage
        pass

TOOLS:
- get_weather_forecast
- get_electricity_prices


## 2. Define Test Cases

In [23]:
# TODO: Define comprehensive test cases for the Energy Advisor
# Create 10 test cases covering different scenarios:
# - EV charging optimization
# - Thermostat settings
# - Appliance scheduling
# - Solar power maximization
# - Cost savings calculations

test_cases = [
    {
        "id": "ev_charging_1",
        "question": "When should I charge my electric car tomorrow to minimize cost and maximize solar power?",
        "expected_tools": ["get_weather_forecast", "get_electricity_prices"],
        "expected_response": "The response should contain time recommendation, cost analysis and solar consideration",
    },
    {
        "id": "ev_charging_2",
        "question": "I need to charge my EV from 20% to 80% battery. When is the cheapest time to do this today?",
        "expected_tools": ["get_electricity_prices", "get_weather_forecast"],
        "expected_response": "The response should recommend specific hours with cost analysis",
    },
    {
        "id": "thermostat_1",
        "question": "What's the optimal thermostat setting for today to balance comfort and energy costs?",
        "expected_tools": ["get_weather_forecast", "get_electricity_prices", "get_recent_energy_summary"],
        "expected_response": "The response should provide temperature recommendations with cost considerations",
    },
    {
        "id": "appliance_scheduling_1",
        "question": "When should I run my dishwasher and washing machine to save the most money?",
        "expected_tools": ["get_electricity_prices", "get_weather_forecast"],
        "expected_response": "The response should recommend optimal times considering pricing and solar availability",
    },
    {
        "id": "solar_maximization_1",
        "question": "How can I maximize my use of solar power tomorrow?",
        "expected_tools": ["get_weather_forecast", "query_solar_generation", "get_electricity_prices"],
        "expected_response": "The response should provide strategies to align energy usage with solar generation",
    },
    {
        "id": "cost_analysis_1",
        "question": "How much did I spend on electricity last week and what was my biggest energy expense?",
        "expected_tools": ["query_energy_usage", "get_recent_energy_summary"],
        "expected_response": "The response should provide cost breakdown and identify highest consumption areas",
    },
    {
        "id": "energy_tips_1",
        "question": "What are the best ways to reduce my HVAC energy consumption?",
        "expected_tools": ["search_energy_tips", "query_energy_usage"],
        "expected_response": "The response should provide actionable tips for HVAC energy savings",
    },
    {
        "id": "savings_calculation_1",
        "question": "If I reduce my air conditioning usage by 30%, how much money would I save per month? Please calculate the exact savings using my current AC usage data.",
        "expected_tools": ["query_energy_usage", "calculate_energy_savings"],
        "expected_response": "The response should calculate and present potential monthly savings using the calculate_energy_savings tool",
    },
    {
        "id": "solar_generation_1",
        "question": "How much solar power did I generate last month and how does it compare to my usage?",
        "expected_tools": ["query_solar_generation", "query_energy_usage"],
        "expected_response": "The response should compare generation vs consumption with analysis",
    },
    {
        "id": "comprehensive_optimization_1",
        "question": "Give me a comprehensive plan to optimize my energy usage for the next 3 days considering weather and pricing.",
        "expected_tools": ["get_weather_forecast", "get_electricity_prices", "get_recent_energy_summary", "search_energy_tips"],
        "expected_response": "The response should provide a detailed multi-day optimization plan",
    },
]

In [24]:
if len(test_cases) < 10:
    raise ValueError("You MUST have at least 10 test cases")

## 3. Run Agent Tests

In [25]:
CONTEXT = "Location: San Francisco, CA"

In [26]:
# Run the agent tests
# For each test case, call the agent and collect the response
# Store results for evaluation

print("=== Running Agent Tests ===")
test_results = []

for i, test_case in enumerate(test_cases):
    print(f"\nTest {i+1}: {test_case['id']}")
    print(f"Question: {test_case['question']}")
    print("-" * 50)
    
    try:
        # Call the agent
        response = ecohome_agent.invoke(
            question=test_case['question'],
            context=CONTEXT
        )
        
        # Store the result
        result = {
            'test_id': test_case['id'],
            'question': test_case['question'],
            'response': response,
            'expected_tools': test_case['expected_tools'],
            'expected_response': test_case['expected_response'],
            'timestamp': datetime.now().isoformat()
        }
        test_results.append(result)
                
    except Exception as e:
        print(f"Error: {e}")
        result = {
            'test_id': test_case['id'],
            'question': test_case['question'],
            'response': f"Error: {str(e)}",
            'expected_tools': test_case['expected_tools'],
            'expected_response': test_case['expected_response'],
            'timestamp': datetime.now().isoformat(),
            'error': str(e)
        }
        test_results.append(result)

print(f"\nCompleted {len(test_results)} tests")


=== Running Agent Tests ===

Test 1: ev_charging_1
Question: When should I charge my electric car tomorrow to minimize cost and maximize solar power?
--------------------------------------------------

Test 2: ev_charging_2
Question: I need to charge my EV from 20% to 80% battery. When is the cheapest time to do this today?
--------------------------------------------------

Test 3: thermostat_1
Question: What's the optimal thermostat setting for today to balance comfort and energy costs?
--------------------------------------------------

Test 4: appliance_scheduling_1
Question: When should I run my dishwasher and washing machine to save the most money?
--------------------------------------------------

Test 5: solar_maximization_1
Question: How can I maximize my use of solar power tomorrow?
--------------------------------------------------

Test 6: cost_analysis_1
Question: How much did I spend on electricity last week and what was my biggest energy expense?
-----------------------

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given



Test 8: savings_calculation_1
Question: If I reduce my air conditioning usage by 30%, how much money would I save per month? Please calculate the exact savings using my current AC usage data.
--------------------------------------------------

Test 9: solar_generation_1
Question: How much solar power did I generate last month and how does it compare to my usage?
--------------------------------------------------

Test 10: comprehensive_optimization_1
Question: Give me a comprehensive plan to optimize my energy usage for the next 3 days considering weather and pricing.
--------------------------------------------------

Completed 10 tests


In [27]:
test_results

[{'test_id': 'ev_charging_1',
  'question': 'When should I charge my electric car tomorrow to minimize cost and maximize solar power?',
  'response': {'messages': [SystemMessage(content='You are the EcoHome Energy Advisor, an AI assistant specialized in helping homeowners optimize their energy usage, reduce costs, and maximize the benefits of solar power.\n\nYour primary goals are:\n1. Help users minimize electricity costs by leveraging time-of-use pricing\n2. Maximize the use of solar-generated electricity\n3. Provide actionable recommendations for energy optimization\n4. Calculate potential savings from energy-efficient practices\n\nAvailable Tools:\n- get_weather_forecast: Get weather forecasts including solar irradiance data\n- get_electricity_prices: Get time-of-use electricity pricing by hour\n- query_energy_usage: Query historical energy consumption data\n- query_solar_generation: Query historical solar generation data\n- get_recent_energy_summary: Get summary of recent energy u

## 4. Evaluate Responses

In [28]:
# TODO: Implement evaluation functions
# Create functions to evaluate:
# - Final Response
# - Tool usage

def extract_tools_used(messages):
    """Extract list of tools used from agent messages"""
    from langchain_core.messages import AIMessage, ToolMessage
    tools_used = []
    for msg in messages:
        # Check if it's an AIMessage with tool calls
        if isinstance(msg, AIMessage) and hasattr(msg, 'tool_calls') and msg.tool_calls:
            for tool_call in msg.tool_calls:
                tool_name = tool_call.get("name", "")
                if tool_name and tool_name not in tools_used:
                    tools_used.append(tool_name)
    return tools_used

In [29]:
# TODO: Create a response evaluator
def evaluate_response(question, final_response, expected_response):
    """Evaluate a single response against expected response"""
    if isinstance(final_response, str) and final_response.startswith("Error:"):
        return {
            "score": 0,
            "accuracy": 0,
            "relevance": 0,
            "completeness": 0,
            "reasoning": 0,
            "feedback": "Error occurred during execution"
        }
    
    # Extract the final message content
    if isinstance(final_response, dict) and "messages" in final_response:
        messages = final_response["messages"]
        if messages:
            content = messages[-1].content if hasattr(messages[-1], 'content') else str(messages[-1])
        else:
            content = ""
    else:
        content = str(final_response)
    
    content_lower = content.lower()
    expected_lower = expected_response.lower()
    
    # Check for key terms in expected response
    expected_keywords = []
    if "time" in expected_lower or "hour" in expected_lower:
        expected_keywords.append("time")
    if "cost" in expected_lower or "price" in expected_lower or "saving" in expected_lower:
        expected_keywords.append("cost")
    if "solar" in expected_lower:
        expected_keywords.append("solar")
    if "recommend" in expected_lower or "suggest" in expected_lower:
        expected_keywords.append("recommendation")
    if "analysis" in expected_lower or "breakdown" in expected_lower:
        expected_keywords.append("analysis")
    
    # Score based on keyword presence
    keyword_score = 0
    for keyword in expected_keywords:
        if keyword in content_lower:
            keyword_score += 1
    
    keyword_accuracy = keyword_score / len(expected_keywords) if expected_keywords else 0.5
    
    # Relevance: Check if response addresses the question
    question_keywords = set(question.lower().split())
    response_keywords = set(content_lower.split())
    relevance = len(question_keywords.intersection(response_keywords)) / len(question_keywords) if question_keywords else 0.5
    relevance = min(1.0, relevance * 2)  # Scale up
    
    # Completeness: Check response length and structure
    word_count = len(content.split())
    completeness = min(1.0, word_count / 50)  # Expect at least 50 words for completeness
    
    # Reasoning: Check for explanation words
    reasoning_words = ["because", "since", "due to", "considering", "based on", "therefore", "as a result"]
    has_reasoning = any(word in content_lower for word in reasoning_words)
    reasoning = 0.8 if has_reasoning else 0.4
    
    # Overall score (weighted average)
    accuracy = keyword_accuracy
    overall_score = (accuracy * 0.3 + relevance * 0.3 + completeness * 0.2 + reasoning * 0.2)
    
    feedback = []
    if accuracy < 0.7:
        feedback.append("Missing some expected information")
    if relevance < 0.6:
        feedback.append("Response may not fully address the question")
    if completeness < 0.6:
        feedback.append("Response could be more comprehensive")
    if reasoning < 0.6:
        feedback.append("Response lacks clear reasoning")
    
    return {
        "score": round(overall_score, 2),
        "accuracy": round(accuracy, 2),
        "relevance": round(relevance, 2),
        "completeness": round(completeness, 2),
        "reasoning": round(reasoning, 2),
        "feedback": "; ".join(feedback) if feedback else "Good response"
    }

In [30]:
# TODO: Create a tool udage evaluator
def evaluate_tool_usage(messages, expected_tools):
    """Evaluate if the right tools were used"""
    if isinstance(messages, dict) and "messages" in messages:
        messages = messages["messages"]
    
    tools_used = extract_tools_used(messages)
    
    if not expected_tools:
        return {
            "score": 1.0,
            "tools_used": tools_used,
            "expected_tools": expected_tools,
            "missing_tools": [],
            "unexpected_tools": [],
            "feedback": "No specific tools expected"
        }
    
    # Check which expected tools were used
    missing_tools = [tool for tool in expected_tools if tool not in tools_used]
    unexpected_tools = [tool for tool in tools_used if tool not in expected_tools]
    
    # Calculate score
    if not expected_tools:
        score = 1.0
    else:
        used_expected = len(expected_tools) - len(missing_tools)
        score = used_expected / len(expected_tools)
    
    # Generate feedback
    feedback_parts = []
    if missing_tools:
        feedback_parts.append(f"Missing tools: {', '.join(missing_tools)}")
    if unexpected_tools:
        feedback_parts.append(f"Unexpected tools used: {', '.join(unexpected_tools)}")
    if not missing_tools and not unexpected_tools:
        feedback_parts.append("All expected tools were used correctly")
    
    return {
        "score": round(score, 2),
        "tools_used": tools_used,
        "expected_tools": expected_tools,
        "missing_tools": missing_tools,
        "unexpected_tools": unexpected_tools,
        "feedback": "; ".join(feedback_parts)
    }

In [31]:
# TODO: Generate a comprehensive evaluation report
# Calculate overall scores and metrics
# Identify strengths and weaknesses
# Provide recommendations for improvement
def generate_evaluation_report(test_results):
    """Generate a comprehensive evaluation report from test results"""
    
    if not test_results:
        return "No test results to evaluate"
    
    # Evaluate each test result
    evaluations = []
    for result in test_results:
        if isinstance(result.get('response'), str) and result['response'].startswith("Error:"):
            eval_result = {
                'test_id': result['test_id'],
                'response_eval': {"score": 0, "feedback": "Error occurred"},
                'tool_eval': {"score": 0, "feedback": "Error occurred"},
                'overall_score': 0
            }
        else:
            # Evaluate response
            response_eval = evaluate_response(
                result['question'],
                result['response'],
                result.get('expected_response', '')
            )
            
            # Evaluate tool usage
            tool_eval = evaluate_tool_usage(
                result['response'],
                result.get('expected_tools', [])
            )
            
            # Overall score (weighted: 60% response, 40% tools)
            overall_score = (response_eval['score'] * 0.6 + tool_eval['score'] * 0.4)
            
            eval_result = {
                'test_id': result['test_id'],
                'question': result['question'],
                'response_eval': response_eval,
                'tool_eval': tool_eval,
                'overall_score': round(overall_score, 2)
            }
        
        evaluations.append(eval_result)
    
    # Calculate aggregate metrics
    total_tests = len(evaluations)
    successful_tests = sum(1 for e in evaluations if e['overall_score'] > 0)
    
    avg_overall = sum(e['overall_score'] for e in evaluations) / total_tests if total_tests > 0 else 0
    avg_response = sum(e['response_eval']['score'] for e in evaluations) / total_tests if total_tests > 0 else 0
    avg_tool = sum(e['tool_eval']['score'] for e in evaluations) / total_tests if total_tests > 0 else 0
    
    # Calculate average sub-scores
    avg_accuracy = sum(e['response_eval'].get('accuracy', 0) for e in evaluations) / total_tests if total_tests > 0 else 0
    avg_relevance = sum(e['response_eval'].get('relevance', 0) for e in evaluations) / total_tests if total_tests > 0 else 0
    avg_completeness = sum(e['response_eval'].get('completeness', 0) for e in evaluations) / total_tests if total_tests > 0 else 0
    avg_reasoning = sum(e['response_eval'].get('reasoning', 0) for e in evaluations) / total_tests if total_tests > 0 else 0
    
    # Identify strengths and weaknesses
    high_scores = [e for e in evaluations if e['overall_score'] >= 0.8]
    low_scores = [e for e in evaluations if e['overall_score'] < 0.6]
    
    # Tool usage analysis
    all_tools_used = []
    for e in evaluations:
        all_tools_used.extend(e['tool_eval'].get('tools_used', []))
    tool_frequency = {}
    for tool in all_tools_used:
        tool_frequency[tool] = tool_frequency.get(tool, 0) + 1
    
    # Generate report
    report = []
    report.append("=" * 80)
    report.append("ECOHOME ENERGY ADVISOR - EVALUATION REPORT")
    report.append("=" * 80)
    report.append("")
    report.append(f"Total Tests: {total_tests}")
    report.append(f"Successful Tests: {successful_tests} ({successful_tests/total_tests*100:.1f}%)")
    report.append("")
    report.append("OVERALL METRICS")
    report.append("-" * 80)
    report.append(f"Average Overall Score: {avg_overall:.2f}/1.00")
    report.append(f"Average Response Score: {avg_response:.2f}/1.00")
    report.append(f"Average Tool Usage Score: {avg_tool:.2f}/1.00")
    report.append("")
    report.append("RESPONSE QUALITY BREAKDOWN")
    report.append("-" * 80)
    report.append(f"Average Accuracy: {avg_accuracy:.2f}/1.00")
    report.append(f"Average Relevance: {avg_relevance:.2f}/1.00")
    report.append(f"Average Completeness: {avg_completeness:.2f}/1.00")
    report.append(f"Average Reasoning: {avg_reasoning:.2f}/1.00")
    report.append("")
    
    if high_scores:
        report.append("STRENGTHS (Tests with score >= 0.8)")
        report.append("-" * 80)
        for e in high_scores[:5]:  # Top 5
            report.append(f"  • {e['test_id']}: {e['overall_score']:.2f}")
        report.append("")
    
    if low_scores:
        report.append("AREAS FOR IMPROVEMENT (Tests with score < 0.6)")
        report.append("-" * 80)
        for e in low_scores[:5]:  # Top 5
            report.append(f"  • {e['test_id']}: {e['overall_score']:.2f}")
            report.append(f"    Response: {e['response_eval'].get('feedback', 'N/A')}")
            report.append(f"    Tools: {e['tool_eval'].get('feedback', 'N/A')}")
        report.append("")
    
    if tool_frequency:
        report.append("TOOL USAGE FREQUENCY")
        report.append("-" * 80)
        sorted_tools = sorted(tool_frequency.items(), key=lambda x: x[1], reverse=True)
        for tool, count in sorted_tools:
            report.append(f"  • {tool}: {count} times")
        report.append("")
    
    report.append("DETAILED TEST RESULTS")
    report.append("-" * 80)
    for e in evaluations:
        report.append(f"\nTest: {e['test_id']}")
        report.append(f"  Question: {e.get('question', 'N/A')[:60]}...")
        report.append(f"  Overall Score: {e['overall_score']:.2f}/1.00")
        report.append(f"  Response Score: {e['response_eval']['score']:.2f}/1.00")
        report.append(f"    - {e['response_eval'].get('feedback', 'N/A')}")
        report.append(f"  Tool Usage Score: {e['tool_eval']['score']:.2f}/1.00")
        report.append(f"    - Used: {', '.join(e['tool_eval'].get('tools_used', []))}")
        report.append(f"    - Expected: {', '.join(e['tool_eval'].get('expected_tools', []))}")
        report.append(f"    - {e['tool_eval'].get('feedback', 'N/A')}")
    
    report.append("")
    report.append("=" * 80)
    report.append("RECOMMENDATIONS")
    report.append("=" * 80)
    
    recommendations = []
    if avg_accuracy < 0.7:
        recommendations.append("Improve accuracy by ensuring all expected information is included in responses")
    if avg_relevance < 0.7:
        recommendations.append("Enhance relevance by better understanding user questions and context")
    if avg_completeness < 0.7:
        recommendations.append("Increase completeness by providing more detailed and comprehensive answers")
    if avg_reasoning < 0.7:
        recommendations.append("Strengthen reasoning by explaining the 'why' behind recommendations")
    if avg_tool < 0.7:
        recommendations.append("Improve tool usage by ensuring all relevant tools are called when needed")
    if not recommendations:
        recommendations.append("Agent is performing well! Continue monitoring and fine-tuning.")
    
    for i, rec in enumerate(recommendations, 1):
        report.append(f"{i}. {rec}")
    
    report.append("")
    report.append("=" * 80)
    
    return "\n".join(report)

In [32]:
# Run evaluation on test results
if 'test_results' in locals() and test_results:
    print("Generating evaluation report...")
    report = generate_evaluation_report(test_results)
    print(report)
else:
    print("No test results available. Please run the agent tests first.")


Generating evaluation report...
ECOHOME ENERGY ADVISOR - EVALUATION REPORT

Total Tests: 10
Successful Tests: 10 (100.0%)

OVERALL METRICS
--------------------------------------------------------------------------------
Average Overall Score: 0.79/1.00
Average Response Score: 0.82/1.00
Average Tool Usage Score: 0.73/1.00

RESPONSE QUALITY BREAKDOWN
--------------------------------------------------------------------------------
Average Accuracy: 0.66/1.00
Average Relevance: 0.94/1.00
Average Completeness: 1.00/1.00
Average Reasoning: 0.72/1.00

STRENGTHS (Tests with score >= 0.8)
--------------------------------------------------------------------------------
  • ev_charging_1: 0.89
  • thermostat_1: 0.84
  • appliance_scheduling_1: 0.98
  • solar_maximization_1: 0.84
  • cost_analysis_1: 0.81

AREAS FOR IMPROVEMENT (Tests with score < 0.6)
--------------------------------------------------------------------------------
  • savings_calculation_1: 0.56
    Response: Missing some expecte