# 🤖 DeepEval - Evaluating AI Agents and Chain-of-Thought with LangGraph

Chào mừng đến với **Notebook 4** trong series DeepEval framework!

## 🎯 Mục tiêu của Notebook này

1. **Xây dựng LangGraph Agents**: Multi-tool agents với complex reasoning
2. **Chain-of-Thought Evaluation**: Đánh giá quá trình suy luận từng bước
3. **Agent-Specific Metrics**: LogicalFlow, ToolUsage, PlanExecution, Adaptability
4. **Multi-Step Analysis**: Capture và evaluate intermediate reasoning steps
5. **Comprehensive Agent Benchmarking**: End-to-end agent performance assessment

## 📖 Tại sao Agent Evaluation phức tạp?

AI Agents với Chain-of-Thought reasoning đặt ra những thách thức evaluation hoàn toàn mới:

### 🔍 Unique Challenges:
- **Multi-Step Reasoning**: Phải evaluate cả process lẫn final result
- **Tool Usage**: Agents sử dụng multiple tools - cần assess appropriateness
- **Dynamic Planning**: Plans change based on intermediate results
- **Context Awareness**: Agents maintain state across multiple interactions
- **Error Recovery**: How well agents handle failures và adapt

### ✅ DeepEval Approach:
- **Intermediate Step Capture**: LLMTestCase supports reasoning steps
- **Custom G-Eval Metrics**: Specialized cho agent evaluation
- **Multi-dimensional Assessment**: Logic, tool usage, planning, execution
- **Temporal Analysis**: Track performance over interaction sequences

## 🛠️ Phần 1: Setup và Imports

In [None]:
# Core imports
import os
import json
import pandas as pd
import numpy as np
from typing import List, Dict, Any, Optional, Tuple, TypedDict
import warnings
import asyncio
import time
from datetime import datetime
warnings.filterwarnings('ignore')

# DeepEval imports
import deepeval
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import GEval

# LangGraph imports
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolExecutor, ToolInvocation
from langgraph.checkpoint.sqlite import SqliteSaver

# LangChain imports
from langchain.tools import Tool
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage, AIMessage, SystemMessage
from langchain.agents import create_openai_functions_agent, AgentExecutor
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder

# Custom tools
import requests
import math
import random

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
plt.style.use('default')

print(f"✅ DeepEval version: {deepeval.__version__}")
print("✅ All imports successful!")

In [None]:
# Setup environment
from dotenv import load_dotenv
load_dotenv()

# Check API keys
api_keys_status = {
    "OpenAI": "✅ Configured" if os.getenv("OPENAI_API_KEY") else "❌ Missing",
    "Anthropic": "✅ Configured" if os.getenv("ANTHROPIC_API_KEY") else "❌ Missing"
}

print("🔑 API Keys Status:")
for provider, status in api_keys_status.items():
    print(f"  {provider}: {status}")

if not os.getenv("OPENAI_API_KEY"):
    print("\n⚠️  Cần OPENAI_API_KEY để chạy agent evaluation!")
    print("   Tạo file .env với: OPENAI_API_KEY=your_key_here")

## 🔧 Phần 2: Xây dựng Custom Tools cho Agents

### 2.1 Mathematical Tools

In [None]:
def create_math_tools():
    """
    Tạo mathematical tools cho agents
    """
    
    def calculate(expression: str) -> str:
        """Calculate mathematical expressions safely"""
        try:
            # Safe evaluation - only allow basic math operations
            allowed_chars = set('0123456789+-*/().^ ')
            if not set(expression.replace(' ', '')) <= allowed_chars:
                return "Error: Invalid characters in expression"
            
            # Replace ^ with ** for Python power operator
            expression = expression.replace('^', '**')
            
            result = eval(expression)
            return f"Result: {result}"
        except Exception as e:
            return f"Error: {str(e)}"
    
    def factorial(n: str) -> str:
        """Calculate factorial of a number"""
        try:
            num = int(n)
            if num < 0:
                return "Error: Factorial không defined cho số âm"
            if num > 20:
                return "Error: Number quá lớn (max 20)"
            
            result = math.factorial(num)
            return f"Factorial of {num} is {result}"
        except Exception as e:
            return f"Error: {str(e)}"
    
    def prime_check(n: str) -> str:
        """Check if a number is prime"""
        try:
            num = int(n)
            if num < 2:
                return f"{num} is not prime"
            
            for i in range(2, int(num ** 0.5) + 1):
                if num % i == 0:
                    return f"{num} is not prime (divisible by {i})"
            
            return f"{num} is prime"
        except Exception as e:
            return f"Error: {str(e)}"
    
    # Create LangChain tools
    tools = [
        Tool(
            name="calculator",
            description="Calculate mathematical expressions. Input: mathematical expression as string (e.g., '2+3*4', '10^2')",
            func=calculate
        ),
        Tool(
            name="factorial",
            description="Calculate factorial of a number. Input: positive integer as string (e.g., '5')",
            func=factorial
        ),
        Tool(
            name="prime_checker",
            description="Check if a number is prime. Input: positive integer as string (e.g., '17')",
            func=prime_check
        )
    ]
    
    return tools

# Create math tools
math_tools = create_math_tools()
print(f"✅ Created {len(math_tools)} math tools:")
for tool in math_tools:
    print(f"  - {tool.name}: {tool.description}")

### 2.2 Information Retrieval Tools

In [None]:
def create_info_tools():
    """
    Tạo information retrieval tools
    """
    
    # Mock knowledge base
    knowledge_base = {
        "python": "Python là programming language high-level, interpreted, với dynamic typing và automatic memory management.",
        "ai": "Artificial Intelligence (AI) là simulation của human intelligence trong machines để perform tasks requiring human cognition.",
        "machine learning": "Machine Learning là subset của AI cho phép systems learn và improve từ experience mà không cần explicit programming.",
        "deep learning": "Deep Learning sử dụng neural networks với multiple layers để learn complex patterns trong large amounts of data.",
        "langchain": "LangChain là framework để develop applications powered by language models, với components cho chains, agents, và memory.",
        "langgraph": "LangGraph là library để build stateful, multi-actor applications với LLMs, using graph-based approach."
    }
    
    def search_knowledge(query: str) -> str:
        """Search knowledge base for information"""
        query_lower = query.lower().strip()
        
        # Exact match
        if query_lower in knowledge_base:
            return f"Found: {knowledge_base[query_lower]}"
        
        # Partial match
        matches = []
        for key, value in knowledge_base.items():
            if query_lower in key or any(word in key for word in query_lower.split()):
                matches.append(f"{key}: {value}")
        
        if matches:
            return f"Partial matches found:\n" + "\n".join(matches)
        
        return f"No information found for '{query}'. Available topics: {', '.join(knowledge_base.keys())}"
    
    def get_current_time() -> str:
        """Get current time and date"""
        now = datetime.now()
        return f"Current time: {now.strftime('%Y-%m-%d %H:%M:%S')}"
    
    def generate_random_number(range_str: str) -> str:
        """Generate random number in specified range"""
        try:
            if '-' in range_str:
                min_val, max_val = map(int, range_str.split('-'))
            else:
                min_val, max_val = 1, int(range_str)
            
            result = random.randint(min_val, max_val)
            return f"Random number between {min_val} and {max_val}: {result}"
        except Exception as e:
            return f"Error: {str(e)}. Format: 'min-max' or just 'max'"
    
    # Create tools
    tools = [
        Tool(
            name="knowledge_search",
            description="Search knowledge base for information about topics. Input: topic name or keyword (e.g., 'python', 'ai', 'machine learning')",
            func=search_knowledge
        ),
        Tool(
            name="get_time",
            description="Get current date and time. No input required.",
            func=lambda x: get_current_time()
        ),
        Tool(
            name="random_number",
            description="Generate random number. Input: range as 'min-max' (e.g., '1-100') or just max value (e.g., '50')",
            func=generate_random_number
        )
    ]
    
    return tools

# Create info tools
info_tools = create_info_tools()
print(f"✅ Created {len(info_tools)} information tools:")
for tool in info_tools:
    print(f"  - {tool.name}: {tool.description}")

## 🤖 Phần 3: Xây dựng LangGraph Agent

### 3.1 Agent State Definition

In [None]:
# Define agent state
class AgentState(TypedDict):
    messages: List[dict]
    current_task: str
    reasoning_steps: List[str]
    tool_calls: List[dict]
    intermediate_results: List[str]
    final_answer: str
    execution_status: str  # 'running', 'completed', 'error'
    error_message: str

def create_initial_state(user_input: str) -> AgentState:
    """
    Create initial agent state
    """
    return AgentState(
        messages=[{"role": "user", "content": user_input}],
        current_task=user_input,
        reasoning_steps=[],
        tool_calls=[],
        intermediate_results=[],
        final_answer="",
        execution_status="running",
        error_message=""
    )

print("✅ Agent state structure defined")
print("State components:", list(AgentState.__annotations__.keys()))

### 3.2 Agent Node Functions

In [None]:
def create_agent_nodes():
    """
    Create LangGraph agent node functions
    """
    
    # Combine all tools
    all_tools = math_tools + info_tools
    
    if not os.getenv("OPENAI_API_KEY"):
        print("❌ Cannot create agent nodes without OpenAI API key")
        return None, None, None, None
    
    # Create LLM
    llm = ChatOpenAI(
        model="gpt-3.5-turbo",
        temperature=0.1,
        max_tokens=1000
    )
    
    def planning_node(state: AgentState) -> AgentState:
        """
        Planning node - analyze task và create execution plan
        """
        try:
            task = state["current_task"]
            
            planning_prompt = f"""
            Phân tích task sau và tạo detailed execution plan:
            Task: {task}
            
            Available tools:
            {chr(10).join([f"- {tool.name}: {tool.description}" for tool in all_tools])}
            
            Hãy:
            1. Phân tích task requirements
            2. Identify cần tools nào
            3. Tạo step-by-step plan
            4. Explain reasoning cho mỗi step
            
            Format: Detailed plan với clear reasoning steps.
            """
            
            response = llm.invoke([HumanMessage(content=planning_prompt)])
            
            reasoning_step = f"PLANNING: {response.content}"
            
            state["reasoning_steps"].append(reasoning_step)
            state["messages"].append({"role": "assistant", "content": f"Plan created: {response.content[:200]}..."})
            
            return state
            
        except Exception as e:
            state["execution_status"] = "error"
            state["error_message"] = f"Planning error: {str(e)}"
            return state
    
    def execution_node(state: AgentState) -> AgentState:
        """
        Execution node - execute tools based on plan
        """
        try:
            task = state["current_task"]
            
            # Simple execution logic - analyze task và decide tools
            execution_prompt = f"""
            Execute this task step by step: {task}
            
            Based on the task, determine which tools to use và in what order.
            For each tool call, explain why you're using it.
            
            Available tools: {[tool.name for tool in all_tools]}
            
            Provide tool calls in this format:
            TOOL_CALL: tool_name(input)
            REASONING: why you're using this tool
            """
            
            response = llm.invoke([HumanMessage(content=execution_prompt)])
            
            # Mock tool execution (simplified for demo)
            tool_results = []
            
            # Check for math operations
            if any(op in task.lower() for op in ['calculate', 'compute', '+', '-', '*', '/', 'factorial', 'prime']):
                if 'factorial' in task.lower():
                    # Extract number
                    import re
                    numbers = re.findall(r'\d+', task)
                    if numbers:
                        result = math_tools[1].func(numbers[0])  # factorial tool
                        tool_results.append(f"factorial({numbers[0]}) = {result}")
                        state["tool_calls"].append({"tool": "factorial", "input": numbers[0], "output": result})
                
                elif 'prime' in task.lower():
                    import re
                    numbers = re.findall(r'\d+', task)
                    if numbers:
                        result = math_tools[2].func(numbers[0])  # prime checker
                        tool_results.append(f"prime_check({numbers[0]}) = {result}")
                        state["tool_calls"].append({"tool": "prime_checker", "input": numbers[0], "output": result})
                
                else:
                    # Extract expression
                    import re
                    math_expr = re.search(r'[\d+\-*/()\s]+', task)
                    if math_expr:
                        expr = math_expr.group().strip()
                        result = math_tools[0].func(expr)  # calculator
                        tool_results.append(f"calculate({expr}) = {result}")
                        state["tool_calls"].append({"tool": "calculator", "input": expr, "output": result})
            
            # Check for knowledge queries
            elif any(keyword in task.lower() for keyword in ['what is', 'explain', 'about', 'python', 'ai', 'machine learning']):
                # Extract topic
                for topic in ['python', 'ai', 'machine learning', 'deep learning', 'langchain', 'langgraph']:
                    if topic in task.lower():
                        result = info_tools[0].func(topic)  # knowledge search
                        tool_results.append(f"knowledge_search({topic}) = {result}")
                        state["tool_calls"].append({"tool": "knowledge_search", "input": topic, "output": result})
                        break
            
            # Check for time request
            elif 'time' in task.lower() or 'date' in task.lower():
                result = info_tools[1].func("")  # get_time
                tool_results.append(f"get_time() = {result}")
                state["tool_calls"].append({"tool": "get_time", "input": "", "output": result})
            
            state["intermediate_results"].extend(tool_results)
            state["reasoning_steps"].append(f"EXECUTION: {response.content}")
            
            return state
            
        except Exception as e:
            state["execution_status"] = "error"
            state["error_message"] = f"Execution error: {str(e)}"
            return state
    
    def synthesis_node(state: AgentState) -> AgentState:
        """
        Synthesis node - combine results và generate final answer
        """
        try:
            task = state["current_task"]
            intermediate_results = state["intermediate_results"]
            
            synthesis_prompt = f"""
            Synthesize final answer từ tool results:
            
            Original task: {task}
            
            Tool results:
            {chr(10).join(intermediate_results) if intermediate_results else 'No tool results'}
            
            Provide a comprehensive, well-structured answer that:
            1. Directly addresses the original question
            2. Incorporates relevant tool results
            3. Explains the reasoning process
            4. Provides clear conclusions
            """
            
            response = llm.invoke([HumanMessage(content=synthesis_prompt)])
            
            state["final_answer"] = response.content
            state["execution_status"] = "completed"
            state["reasoning_steps"].append(f"SYNTHESIS: Created final answer from {len(intermediate_results)} tool results")
            
            return state
            
        except Exception as e:
            state["execution_status"] = "error"
            state["error_message"] = f"Synthesis error: {str(e)}"
            return state
    
    def should_continue(state: AgentState) -> str:
        """
        Decide whether to continue or end
        """
        if state["execution_status"] == "error":
            return "end"
        elif state["execution_status"] == "completed":
            return "end"
        elif len(state["reasoning_steps"]) >= 10:  # Max steps
            return "end"
        else:
            return "continue"
    
    return planning_node, execution_node, synthesis_node, should_continue

# Create agent nodes
planning_node, execution_node, synthesis_node, should_continue = create_agent_nodes()

if planning_node:
    print("✅ Agent nodes created successfully")
    print("Available nodes: planning, execution, synthesis")
else:
    print("❌ Failed to create agent nodes - check API key")

### 3.3 Build LangGraph Agent

In [None]:
def create_langgraph_agent():
    """
    Create complete LangGraph agent với state management
    """
    
    if not planning_node:
        print("❌ Cannot create LangGraph agent without node functions")
        return None
    
    # Create the graph
    workflow = StateGraph(AgentState)
    
    # Add nodes
    workflow.add_node("planning", planning_node)
    workflow.add_node("execution", execution_node)
    workflow.add_node("synthesis", synthesis_node)
    
    # Add edges
    workflow.set_entry_point("planning")
    workflow.add_edge("planning", "execution")
    workflow.add_edge("execution", "synthesis")
    workflow.add_edge("synthesis", END)
    
    # Compile the graph
    agent = workflow.compile()
    
    return agent

# Create the agent
langgraph_agent = create_langgraph_agent()

if langgraph_agent:
    print("✅ LangGraph agent created successfully!")
    print("Agent workflow: planning → execution → synthesis → end")
else:
    print("❌ Failed to create LangGraph agent")

### 3.4 Test Agent với Sample Tasks

In [None]:
def test_agent_execution(agent, test_queries: List[str]):
    """
    Test agent với different types of queries
    """
    
    if not agent:
        print("❌ No agent available for testing")
        return []
    
    results = []
    
    print("🧪 Testing LangGraph Agent với Multiple Queries\n")
    
    for i, query in enumerate(test_queries, 1):
        print(f"🔍 Test {i}: {query}")
        
        try:
            # Create initial state
            initial_state = create_initial_state(query)
            
            # Execute agent
            start_time = time.time()
            final_state = agent.invoke(initial_state)
            execution_time = time.time() - start_time
            
            # Extract results
            result = {
                "query": query,
                "final_answer": final_state.get("final_answer", "No answer generated"),
                "reasoning_steps": final_state.get("reasoning_steps", []),
                "tool_calls": final_state.get("tool_calls", []),
                "intermediate_results": final_state.get("intermediate_results", []),
                "execution_status": final_state.get("execution_status", "unknown"),
                "execution_time": execution_time,
                "error_message": final_state.get("error_message", "")
            }
            
            results.append(result)
            
            # Print summary
            status_emoji = "✅" if result["execution_status"] == "completed" else "❌"
            print(f"  {status_emoji} Status: {result['execution_status']}")
            print(f"  ⏱️  Execution time: {execution_time:.2f}s")
            print(f"  🔧 Tools used: {len(result['tool_calls'])}")
            print(f"  🧠 Reasoning steps: {len(result['reasoning_steps'])}")
            
            if result["final_answer"]:
                print(f"  💡 Answer: {result['final_answer'][:150]}...")
            
            if result["error_message"]:
                print(f"  ⚠️  Error: {result['error_message']}")
                
        except Exception as e:
            print(f"  ❌ Execution failed: {e}")
            results.append({
                "query": query,
                "execution_status": "failed",
                "error_message": str(e)
            })
        
        print()
    
    return results

# Test queries
test_queries = [
    "What is the factorial of 8?",
    "Is 17 a prime number?",
    "Calculate 15 + 25 * 3",
    "What is machine learning?",
    "What time is it now?"
]

# Run tests
agent_test_results = test_agent_execution(langgraph_agent, test_queries)

## 📊 Phần 4: Agent-Specific Evaluation Metrics

### 4.1 LogicalFlowMetric - Đánh giá Chuỗi Suy luận

In [None]:
def create_logical_flow_metric():
    """
    Tạo G-Eval metric để đánh giá logical flow của agent reasoning
    """
    
    evaluation_criteria = """
    Bạn sẽ đánh giá logical flow và reasoning quality của AI agent.
    
    Criteria để đánh giá:
    1. LOGICAL COHERENCE (30%):
       - Reasoning steps follow logical sequence
       - No contradictions between steps
       - Clear cause-and-effect relationships
    
    2. PROBLEM DECOMPOSITION (25%):
       - Complex problems broken down appropriately
       - Sub-problems identified correctly
       - Hierarchical thinking demonstrated
    
    3. STEP CLARITY (25%):
       - Each reasoning step clearly explained
       - Purpose of each step evident
       - Transitions between steps smooth
    
    4. COMPLETENESS (20%):
       - All necessary steps included
       - No critical reasoning gaps
       - Thorough analysis of the problem
    
    Scoring Guide:
    - 9-10: Exceptional logical flow, crystal clear reasoning
    - 7-8: Good logical structure, minor gaps
    - 5-6: Adequate reasoning but some unclear steps
    - 3-4: Poor logical flow, significant gaps
    - 1-2: Incoherent or illogical reasoning
    """
    
    evaluation_steps = [
        "Analyze the sequence of reasoning steps",
        "Check for logical consistency and coherence",
        "Evaluate problem decomposition approach",
        "Assess clarity and explanation quality",
        "Identify any reasoning gaps or issues",
        "Score overall logical flow quality"
    ]
    
    logical_flow_metric = GEval(
        name="Logical Flow",
        criteria=evaluation_criteria,
        evaluation_steps=evaluation_steps,
        evaluation_params=[
            LLMTestCase.input,  # Original query
            LLMTestCase.actual_output,  # Final answer
            "reasoning_steps"  # Custom field for reasoning steps
        ],
        threshold=7.0,
        model="gpt-4",
        include_reason=True
    )
    
    return logical_flow_metric

# Create logical flow metric
logical_flow_metric = create_logical_flow_metric()
print("✅ Logical Flow Metric created")
print(f"Threshold: {logical_flow_metric.threshold}")

### 4.2 ToolUsageMetric - Đánh giá Hiệu quả Sử dụng Tools

In [None]:
def create_tool_usage_metric():
    """
    Tạo metric để đánh giá tool usage efficiency
    """
    
    evaluation_criteria = """
    Đánh giá cách AI agent sử dụng tools để solve problems.
    
    Tool Usage Criteria:
    1. TOOL SELECTION APPROPRIATENESS (35%):
       - Correct tools chosen for the task
       - No unnecessary tool calls
       - Optimal tool sequence
    
    2. INPUT QUALITY (25%):
       - Tool inputs properly formatted
       - Relevant parameters provided
       - No malformed inputs
    
    3. EFFICIENCY (25%):
       - Minimal number of tool calls needed
       - No redundant operations
       - Direct path to solution
    
    4. ERROR HANDLING (15%):
       - Graceful handling of tool errors
       - Appropriate fallback strategies
       - Recovery from failures
    
    Scoring:
    - 9-10: Perfect tool usage, optimal efficiency
    - 7-8: Good tool selection and usage
    - 5-6: Acceptable but suboptimal usage
    - 3-4: Poor tool choices or inefficient usage
    - 1-2: Inappropriate or failed tool usage
    """
    
    evaluation_steps = [
        "Analyze tool selection appropriateness for the task",
        "Evaluate tool input quality and formatting",
        "Assess efficiency of tool usage pattern",
        "Check error handling and recovery",
        "Calculate overall tool usage effectiveness"
    ]
    
    tool_usage_metric = GEval(
        name="Tool Usage Efficiency",
        criteria=evaluation_criteria,
        evaluation_steps=evaluation_steps,
        evaluation_params=[
            LLMTestCase.input,  # Task description
            "tool_calls",  # Tool calls made
            "intermediate_results"  # Tool outputs
        ],
        threshold=6.5,
        model="gpt-4",
        include_reason=True
    )
    
    return tool_usage_metric

# Create tool usage metric
tool_usage_metric = create_tool_usage_metric()
print("✅ Tool Usage Metric created")

### 4.3 PlanExecutionMetric - Đánh giá Khả năng Thực hiện Kế hoạch

In [None]:
def create_plan_execution_metric():
    """
    Tạo metric để đánh giá plan execution capability
    """
    
    evaluation_criteria = """
    Đánh giá khả năng planning và execution của AI agent.
    
    Plan Execution Criteria:
    1. PLAN QUALITY (30%):
       - Comprehensive and well-structured plan
       - Realistic and achievable steps
       - Proper task decomposition
    
    2. EXECUTION FIDELITY (30%):
       - Plan followed accurately
       - Steps executed in logical order
       - Minimal deviation from plan
    
    3. ADAPTABILITY (25%):
       - Adjustments made when needed
       - Handling of unexpected results
       - Plan refinement during execution
    
    4. GOAL ACHIEVEMENT (15%):
       - Original objective met
       - Complete task completion
       - Quality of final outcome
    
    Scoring:
    - 9-10: Excellent planning and flawless execution
    - 7-8: Good plan with solid execution
    - 5-6: Adequate planning, some execution issues
    - 3-4: Poor planning or significant execution problems
    - 1-2: Failed planning or execution
    """
    
    evaluation_steps = [
        "Evaluate initial plan quality and structure",
        "Assess execution fidelity to the plan",
        "Check adaptability and plan adjustments",
        "Measure goal achievement and completion",
        "Score overall plan execution performance"
    ]
    
    plan_execution_metric = GEval(
        name="Plan Execution",
        criteria=evaluation_criteria,
        evaluation_steps=evaluation_steps,
        evaluation_params=[
            LLMTestCase.input,  # Original task
            LLMTestCase.actual_output,  # Final result
            "reasoning_steps",  # Planning and execution steps
            "execution_status"  # Final status
        ],
        threshold=7.0,
        model="gpt-4",
        include_reason=True
    )
    
    return plan_execution_metric

# Create plan execution metric
plan_execution_metric = create_plan_execution_metric()
print("✅ Plan Execution Metric created")

### 4.4 AdaptabilityMetric - Đánh giá Khả năng Thích ứng

In [None]:
def create_adaptability_metric():
    """
    Tạo metric để đánh giá adaptability của agent
    """
    
    evaluation_criteria = """
    Đánh giá khả năng thích ứng và flexibility của AI agent.
    
    Adaptability Criteria:
    1. CONTEXT AWARENESS (30%):
       - Understanding of changing context
       - Recognition of new information
       - Appropriate response to context shifts
    
    2. STRATEGY ADJUSTMENT (25%):
       - Modification of approach when needed
       - Alternative strategy exploration
       - Dynamic problem-solving
    
    3. ERROR RECOVERY (25%):
       - Grace handling of errors
       - Learning from mistakes
       - Resilient continuation
    
    4. FLEXIBLE THINKING (20%):
       - Creative problem-solving approaches
       - Multiple solution pathways
       - Open-minded reasoning
    
    Scoring:
    - 9-10: Highly adaptable, excellent flexibility
    - 7-8: Good adaptability, handles changes well
    - 5-6: Moderate adaptability, some rigidity
    - 3-4: Poor adaptability, struggles with changes
    - 1-2: Inflexible, cannot adapt to new situations
    """
    
    evaluation_steps = [
        "Assess context awareness and recognition",
        "Evaluate strategy adjustment capabilities",
        "Check error recovery and resilience",
        "Analyze flexible thinking patterns",
        "Score overall adaptability performance"
    ]
    
    adaptability_metric = GEval(
        name="Adaptability",
        criteria=evaluation_criteria,
        evaluation_steps=evaluation_steps,
        evaluation_params=[
            LLMTestCase.input,  # Task/scenario
            "reasoning_steps",  # Decision making process
            "tool_calls",  # Adaptive tool usage
            "error_message"  # Error handling
        ],
        threshold=6.5,
        model="gpt-4",
        include_reason=True
    )
    
    return adaptability_metric

# Create adaptability metric
adaptability_metric = create_adaptability_metric()
print("✅ Adaptability Metric created")
print("\n🎯 All Agent Metrics Created:")
print("  1. Logical Flow - Reasoning quality")
print("  2. Tool Usage - Tool efficiency")
print("  3. Plan Execution - Planning & execution")
print("  4. Adaptability - Flexibility & recovery")

## 🧪 Phần 5: Comprehensive Agent Evaluation

### 5.1 Agent Evaluation Pipeline

In [None]:
class AgentEvaluationPipeline:
    """
    Comprehensive evaluation pipeline cho AI agents
    """
    
    def __init__(self, model="gpt-4"):
        self.model = model
        self.evaluation_history = []
        
        # Initialize agent-specific metrics
        self.metrics = {
            "logical_flow": logical_flow_metric,
            "tool_usage": tool_usage_metric,
            "plan_execution": plan_execution_metric,
            "adaptability": adaptability_metric
        }
    
    def evaluate_agent_execution(self, agent_result: Dict[str, Any]) -> Dict[str, Any]:
        """
        Evaluate single agent execution result
        """
        
        # Create specialized test case for agent evaluation
        test_case = LLMTestCase(
            input=agent_result["query"],
            actual_output=agent_result.get("final_answer", "No answer"),
            # Add custom fields for agent evaluation
            context=agent_result.get("reasoning_steps", []),
            retrieval_context=agent_result.get("intermediate_results", [])
        )
        
        # Add agent-specific data as attributes
        test_case.reasoning_steps = "\n".join(agent_result.get("reasoning_steps", []))
        test_case.tool_calls = json.dumps(agent_result.get("tool_calls", []), indent=2)
        test_case.intermediate_results = "\n".join(agent_result.get("intermediate_results", []))
        test_case.execution_status = agent_result.get("execution_status", "unknown")
        test_case.error_message = agent_result.get("error_message", "")
        
        results = {
            "query": agent_result["query"],
            "execution_time": agent_result.get("execution_time", 0),
            "execution_status": agent_result.get("execution_status", "unknown"),
            "tool_calls_count": len(agent_result.get("tool_calls", [])),
            "reasoning_steps_count": len(agent_result.get("reasoning_steps", [])),
            "metrics": {},
            "overall_score": 0,
            "pass_count": 0,
            "total_metrics": len(self.metrics)
        }
        
        # Evaluate each metric
        metric_scores = []
        
        for metric_name, metric in self.metrics.items():
            try:
                # Create fresh metric instance
                fresh_metric = GEval(
                    name=metric.name,
                    criteria=metric.criteria,
                    evaluation_steps=metric.evaluation_steps,
                    evaluation_params=metric.evaluation_params,
                    threshold=metric.threshold,
                    model=metric.model,
                    include_reason=True
                )
                
                fresh_metric.measure(test_case)
                
                results["metrics"][metric_name] = {
                    "score": fresh_metric.score,
                    "passed": fresh_metric.is_successful(),
                    "reason": fresh_metric.reason,
                    "threshold": fresh_metric.threshold
                }
                
                metric_scores.append(fresh_metric.score)
                if fresh_metric.is_successful():
                    results["pass_count"] += 1
                
            except Exception as e:
                results["metrics"][metric_name] = {
                    "score": 0,
                    "passed": False,
                    "error": str(e)
                }
                metric_scores.append(0)
        
        # Calculate overall score
        if metric_scores:
            results["overall_score"] = np.mean(metric_scores)
        
        # Store in history
        self.evaluation_history.append(results)
        
        return results
    
    def batch_evaluate(self, agent_results: List[Dict]) -> List[Dict[str, Any]]:
        """
        Batch evaluation của multiple agent executions
        """
        
        print(f"🚀 Evaluating {len(agent_results)} Agent Executions\n")
        
        evaluation_results = []
        
        for i, agent_result in enumerate(agent_results):
            if "query" not in agent_result:
                continue
                
            print(f"🔍 Evaluating {i+1}/{len(agent_results)}: {agent_result['query'][:50]}...")
            
            try:
                result = self.evaluate_agent_execution(agent_result)
                evaluation_results.append(result)
                
                # Quick summary
                print(f"  ✅ Overall Score: {result['overall_score']:.1f}/10")
                print(f"  📊 Passed: {result['pass_count']}/{result['total_metrics']} metrics")
                print(f"  ⏱️  Execution: {result['execution_time']:.2f}s")
                print(f"  🔧 Tools: {result['tool_calls_count']}, Steps: {result['reasoning_steps_count']}")
                
            except Exception as e:
                print(f"  ❌ Evaluation error: {e}")
                evaluation_results.append({
                    "query": agent_result["query"],
                    "error": str(e)
                })
            
            print()
        
        return evaluation_results
    
    def get_performance_summary(self) -> Dict[str, Any]:
        """
        Get comprehensive performance summary
        """
        
        if not self.evaluation_history:
            return {"message": "No evaluations performed yet"}
        
        # Collect metrics data
        metric_summaries = {}
        overall_scores = []
        execution_times = []
        tool_usage_counts = []
        
        for result in self.evaluation_history:
            if "overall_score" in result:
                overall_scores.append(result["overall_score"])
            if "execution_time" in result:
                execution_times.append(result["execution_time"])
            if "tool_calls_count" in result:
                tool_usage_counts.append(result["tool_calls_count"])
            
            if "metrics" in result:
                for metric_name, metric_data in result["metrics"].items():
                    if "score" in metric_data:
                        if metric_name not in metric_summaries:
                            metric_summaries[metric_name] = []
                        metric_summaries[metric_name].append(metric_data["score"])
        
        # Calculate summary statistics
        summary = {
            "total_evaluations": len(self.evaluation_history),
            "overall_performance": {
                "average_score": round(np.mean(overall_scores), 2) if overall_scores else 0,
                "score_std": round(np.std(overall_scores), 2) if overall_scores else 0,
                "min_score": round(min(overall_scores), 2) if overall_scores else 0,
                "max_score": round(max(overall_scores), 2) if overall_scores else 0
            },
            "performance_metrics": {
                "average_execution_time": round(np.mean(execution_times), 2) if execution_times else 0,
                "average_tool_usage": round(np.mean(tool_usage_counts), 1) if tool_usage_counts else 0
            },
            "metric_performance": {}
        }
        
        for metric_name, scores in metric_summaries.items():
            summary["metric_performance"][metric_name] = {
                "average_score": round(np.mean(scores), 2),
                "min_score": round(min(scores), 2),
                "max_score": round(max(scores), 2),
                "std_dev": round(np.std(scores), 2)
            }
        
        return summary

# Create agent evaluation pipeline
agent_eval_pipeline = AgentEvaluationPipeline()
print("✅ Agent Evaluation Pipeline created successfully!")
print(f"Available metrics: {list(agent_eval_pipeline.metrics.keys())}")

### 5.2 Run Comprehensive Agent Evaluation

In [None]:
# Run comprehensive evaluation on agent test results
if agent_test_results:
    comprehensive_agent_results = agent_eval_pipeline.batch_evaluate(agent_test_results)
    agent_performance_summary = agent_eval_pipeline.get_performance_summary()
else:
    print("❌ No agent test results available for evaluation")
    comprehensive_agent_results = []
    agent_performance_summary = {}

In [None]:
# Display comprehensive agent evaluation results
def display_agent_evaluation_results(results, summary):
    """
    Display và analyze agent evaluation results
    """
    
    if not results:
        print("❌ No agent evaluation results to display")
        return
    
    print("📊 Comprehensive Agent Evaluation Results\n")
    
    # Create summary DataFrame
    summary_data = []
    
    for result in results:
        if "overall_score" in result:
            row = {
                "Query": result["query"][:30] + "...",
                "Overall_Score": result["overall_score"],
                "Pass_Rate": f"{result['pass_count']}/{result['total_metrics']}",
                "Exec_Time": result["execution_time"],
                "Tools_Used": result["tool_calls_count"],
                "Reasoning_Steps": result["reasoning_steps_count"]
            }
            
            # Add individual metric scores
            for metric_name, metric_data in result.get("metrics", {}).items():
                if "score" in metric_data:
                    row[f"{metric_name.replace('_', ' ').title()}"] = metric_data["score"]
            
            summary_data.append(row)
    
    if summary_data:
        df = pd.DataFrame(summary_data)
        print(df.round(1).to_string(index=False))
    
    # Display performance summary
    print(f"\n📈 Agent Performance Summary:")
    if "overall_performance" in summary:
        perf = summary["overall_performance"]
        print(f"  Total Evaluations: {summary['total_evaluations']}")
        print(f"  Average Overall Score: {perf['average_score']}/10")
        print(f"  Score Range: {perf['min_score']} - {perf['max_score']}")
        print(f"  Score Std Dev: {perf['score_std']}")
    
    if "performance_metrics" in summary:
        perf_metrics = summary["performance_metrics"]
        print(f"\n⚡ Performance Metrics:")
        print(f"  Average Execution Time: {perf_metrics['average_execution_time']}s")
        print(f"  Average Tool Usage: {perf_metrics['average_tool_usage']} tools/query")
    
    print(f"\n🎯 Agent-Specific Metric Performance:")
    if "metric_performance" in summary:
        for metric_name, stats in summary["metric_performance"].items():
            print(f"  {metric_name.replace('_', ' ').title()}:")
            print(f"    Average: {stats['average_score']}/10")
            print(f"    Range: {stats['min_score']} - {stats['max_score']}")
            print(f"    Std Dev: {stats['std_dev']}")
    
    # Agent-specific insights
    print(f"\n💡 Agent Performance Insights:")
    
    if "metric_performance" in summary:
        metric_avgs = {name: stats['average_score'] for name, stats in summary["metric_performance"].items()}
        
        if metric_avgs:
            best_metric = max(metric_avgs.keys(), key=lambda k: metric_avgs[k])
            worst_metric = min(metric_avgs.keys(), key=lambda k: metric_avgs[k])
            
            print(f"  • Strongest capability: {best_metric.replace('_', ' ').title()} ({metric_avgs[best_metric]:.1f}/10)")
            print(f"  • Improvement area: {worst_metric.replace('_', ' ').title()} ({metric_avgs[worst_metric]:.1f}/10)")
            
            if metric_avgs[worst_metric] < 6.0:
                print(f"  • ⚠️  {worst_metric.replace('_', ' ').title()} needs significant improvement")
    
    if "performance_metrics" in summary:
        perf_metrics = summary["performance_metrics"]
        if perf_metrics["average_execution_time"] > 5.0:
            print(f"  • ⚠️  Long execution times - consider optimization")
        if perf_metrics["average_tool_usage"] > 3.0:
            print(f"  • ⚠️  High tool usage - may indicate inefficiency")
        elif perf_metrics["average_tool_usage"] < 1.0:
            print(f"  • ⚠️  Low tool usage - agent may not be leveraging available tools")
    
    return df if 'df' in locals() else None

# Display agent evaluation results
agent_results_df = display_agent_evaluation_results(comprehensive_agent_results, agent_performance_summary)

## 📊 Phần 6: Agent Performance Visualization

### 6.1 Agent Evaluation Visualizations

In [None]:
def visualize_agent_performance(results, summary):
    """
    Create comprehensive visualizations cho agent performance
    """
    
    if not results or not summary.get("metric_performance"):
        print("❌ Insufficient data for agent visualization")
        return
    
    # Setup plots
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    fig.suptitle('Agent Performance Analysis', fontsize=16, fontweight='bold')
    
    # 1. Agent Metric Performance Radar Chart
    metric_names = list(summary["metric_performance"].keys())
    metric_scores = [summary["metric_performance"][name]["average_score"] for name in metric_names]
    
    # Convert to radar chart data
    angles = np.linspace(0, 2 * np.pi, len(metric_names), endpoint=False)
    metric_scores += metric_scores[:1]  # Complete the circle
    angles = np.concatenate((angles, [angles[0]]))  # Complete the circle
    
    ax_radar = plt.subplot(2, 3, 1, projection='polar')
    ax_radar.plot(angles, metric_scores, 'o-', linewidth=2, color='blue')
    ax_radar.fill(angles, metric_scores, alpha=0.25, color='blue')
    ax_radar.set_xticks(angles[:-1])
    ax_radar.set_xticklabels([name.replace('_', '\n').title() for name in metric_names])
    ax_radar.set_ylim(0, 10)
    ax_radar.set_title('Agent Capability Radar', pad=20)
    ax_radar.grid(True)
    
    # 2. Execution Time vs Performance
    exec_times = []
    overall_scores = []
    tool_counts = []
    
    for result in results:
        if "overall_score" in result and "execution_time" in result:
            exec_times.append(result["execution_time"])
            overall_scores.append(result["overall_score"])
            tool_counts.append(result.get("tool_calls_count", 0))
    
    if exec_times and overall_scores:
        scatter = axes[0,1].scatter(exec_times, overall_scores, c=tool_counts, 
                                  cmap='viridis', alpha=0.7, s=100)
        axes[0,1].set_xlabel('Execution Time (seconds)')
        axes[0,1].set_ylabel('Overall Score')
        axes[0,1].set_title('Execution Time vs Performance')
        
        # Add colorbar
        cbar = plt.colorbar(scatter, ax=axes[0,1])
        cbar.set_label('Tool Usage Count')
        
        # Add trend line
        if len(exec_times) > 1:
            z = np.polyfit(exec_times, overall_scores, 1)
            p = np.poly1d(z)
            axes[0,1].plot(exec_times, p(exec_times), "r--", alpha=0.8)
    
    # 3. Metric Score Distribution
    all_metric_scores = []
    all_metric_names = []
    
    for result in results:
        if "metrics" in result:
            for metric_name, metric_data in result["metrics"].items():
                if "score" in metric_data:
                    all_metric_scores.append(metric_data["score"])
                    all_metric_names.append(metric_name)
    
    if all_metric_scores:
        axes[0,2].hist(all_metric_scores, bins=10, alpha=0.7, color='lightcoral', edgecolor='black')
        axes[0,2].set_xlabel('Metric Scores')
        axes[0,2].set_ylabel('Frequency')
        axes[0,2].set_title('Agent Metric Score Distribution')
        axes[0,2].axvline(x=np.mean(all_metric_scores), color='red', linestyle='--', 
                         label=f'Mean: {np.mean(all_metric_scores):.1f}')
        axes[0,2].legend()
    
    # 4. Tool Usage Efficiency
    tool_efficiency = []
    query_names = []
    
    for result in results:
        if "tool_calls_count" in result and "overall_score" in result:
            tools_used = result["tool_calls_count"]
            score = result["overall_score"]
            # Efficiency = score per tool used (with minimum 1 to avoid division by zero)
            efficiency = score / max(tools_used, 1)
            tool_efficiency.append(efficiency)
            query_names.append(result["query"][:15] + "...")
    
    if tool_efficiency:
        bars = axes[1,0].bar(range(len(tool_efficiency)), tool_efficiency, 
                            color=['green' if eff >= 5 else 'orange' if eff >= 3 else 'red' for eff in tool_efficiency])
        axes[1,0].set_xlabel('Queries')
        axes[1,0].set_ylabel('Score per Tool Used')
        axes[1,0].set_title('Tool Usage Efficiency')
        axes[1,0].set_xticks(range(len(query_names)))
        axes[1,0].set_xticklabels(query_names, rotation=45, ha='right')
        
        # Add value labels
        for i, (bar, eff) in enumerate(zip(bars, tool_efficiency)):
            axes[1,0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1, 
                          f'{eff:.1f}', ha='center', va='bottom', fontsize=8)
    
    # 5. Reasoning Steps vs Success Rate
    reasoning_steps = []
    success_rates = []
    
    for result in results:
        if "reasoning_steps_count" in result and "pass_count" in result and "total_metrics" in result:
            steps = result["reasoning_steps_count"]
            success_rate = result["pass_count"] / result["total_metrics"] * 100
            reasoning_steps.append(steps)
            success_rates.append(success_rate)
    
    if reasoning_steps and success_rates:
        axes[1,1].scatter(reasoning_steps, success_rates, alpha=0.7, s=100, color='purple')
        axes[1,1].set_xlabel('Number of Reasoning Steps')
        axes[1,1].set_ylabel('Success Rate (%)')
        axes[1,1].set_title('Reasoning Depth vs Success')
        
        # Add trend line
        if len(reasoning_steps) > 1:
            z = np.polyfit(reasoning_steps, success_rates, 1)
            p = np.poly1d(z)
            axes[1,1].plot(reasoning_steps, p(reasoning_steps), "r--", alpha=0.8)
    
    # 6. Agent Performance Heatmap
    if len(metric_names) > 1 and len(results) > 1:
        # Create performance matrix
        performance_matrix = []
        
        for result in results:
            if "metrics" in result:
                row = []
                for metric_name in metric_names:
                    if metric_name in result["metrics"] and "score" in result["metrics"][metric_name]:
                        row.append(result["metrics"][metric_name]["score"])
                    else:
                        row.append(0)
                if len(row) == len(metric_names):
                    performance_matrix.append(row)
        
        if len(performance_matrix) > 1:
            performance_df = pd.DataFrame(performance_matrix, 
                                        columns=[name.replace('_', ' ').title() for name in metric_names],
                                        index=[f"Query {i+1}" for i in range(len(performance_matrix))])
            
            sns.heatmap(performance_df, annot=True, cmap='RdYlGn', center=5, 
                       square=False, ax=axes[1,2], cbar_kws={'shrink': 0.8})
            axes[1,2].set_title('Performance Heatmap by Query')
        else:
            axes[1,2].text(0.5, 0.5, 'Insufficient data\nfor heatmap', 
                          ha='center', va='center', transform=axes[1,2].transAxes, fontsize=12)
            axes[1,2].set_title('Performance Heatmap by Query')
    else:
        axes[1,2].text(0.5, 0.5, 'Need multiple metrics\nand queries for heatmap', 
                      ha='center', va='center', transform=axes[1,2].transAxes, fontsize=12)
        axes[1,2].set_title('Performance Heatmap by Query')
    
    plt.tight_layout()
    plt.show()
    
    # Print insights
    print("\n🔍 Agent Performance Insights:")
    
    if metric_scores:
        strongest_capability = metric_names[metric_scores[:-1].index(max(metric_scores[:-1]))]
        weakest_capability = metric_names[metric_scores[:-1].index(min(metric_scores[:-1]))]
        
        print(f"  • Strongest capability: {strongest_capability.replace('_', ' ').title()}")
        print(f"  • Weakest capability: {weakest_capability.replace('_', ' ').title()}")
        
        capability_range = max(metric_scores[:-1]) - min(metric_scores[:-1])
        if capability_range > 3.0:
            print(f"  • ⚠️  Large capability variance ({capability_range:.1f}) - uneven agent development")
    
    if exec_times and overall_scores:
        correlation = np.corrcoef(exec_times, overall_scores)[0, 1]
        if correlation < -0.3:
            print(f"  • ⚠️  Negative correlation between time and performance - efficiency issues")
        elif correlation > 0.3:
            print(f"  • ✅ Positive correlation - more time leads to better results")
    
    if tool_efficiency:
        avg_efficiency = np.mean(tool_efficiency)
        print(f"  • Average tool efficiency: {avg_efficiency:.1f} score per tool")
        
        if avg_efficiency < 3.0:
            print(f"  • ⚠️  Low tool efficiency - agent may be overusing tools")
        elif avg_efficiency > 7.0:
            print(f"  • ✅ High tool efficiency - agent uses tools effectively")

# Create agent performance visualizations
visualize_agent_performance(comprehensive_agent_results, agent_performance_summary)

## 🔄 Phần 7: Advanced Agent Scenarios

### 7.1 Complex Multi-Step Reasoning Tasks

In [None]:
def create_complex_agent_scenarios():
    """
    Tạo complex scenarios để test agent reasoning capabilities
    """
    
    complex_scenarios = [
        {
            "name": "Mathematical Problem Solving",
            "query": "I need to find all prime numbers between 10 and 30, then calculate the factorial of the largest prime found, and finally determine if the result is divisible by 12.",
            "expected_steps": [
                "Identify prime numbers between 10 and 30",
                "Find the largest prime",
                "Calculate factorial of largest prime",
                "Check divisibility by 12"
            ],
            "expected_tools": ["prime_checker", "factorial", "calculator"]
        },
        {
            "name": "Information Synthesis",
            "query": "What is the relationship between machine learning and deep learning? After explaining this, generate a random number between 1 and 100 and tell me if that number is prime.",
            "expected_steps": [
                "Search for machine learning information",
                "Search for deep learning information",
                "Synthesize relationship",
                "Generate random number",
                "Check if number is prime"
            ],
            "expected_tools": ["knowledge_search", "random_number", "prime_checker"]
        },
        {
            "name": "Time-Based Calculation",
            "query": "What's the current time? Based on the current hour, calculate the factorial of that hour number. If the current minute is even, add 50 to the factorial result.",
            "expected_steps": [
                "Get current time",
                "Extract hour from time",
                "Calculate factorial of hour",
                "Check if minute is even",
                "Conditionally add 50"
            ],
            "expected_tools": ["get_time", "factorial", "calculator"]
        },
        {
            "name": "Error Recovery Scenario",
            "query": "Calculate the factorial of 25, then find the square root of that result. If there are any issues, try alternative approaches.",
            "expected_steps": [
                "Calculate factorial of 25",
                "Attempt square root calculation",
                "Handle potential errors",
                "Provide alternative solution"
            ],
            "expected_tools": ["factorial", "calculator"]
        }
    ]
    
    return complex_scenarios

# Create complex scenarios
complex_scenarios = create_complex_agent_scenarios()

print("🧩 Created Complex Agent Scenarios:")
for i, scenario in enumerate(complex_scenarios, 1):
    print(f"  {i}. {scenario['name']}")
    print(f"     Query: {scenario['query'][:60]}...")
    print(f"     Expected tools: {', '.join(scenario['expected_tools'])}")
    print()

### 7.2 Test Complex Scenarios

In [None]:
def test_complex_agent_scenarios(agent, scenarios):
    """
    Test agent với complex scenarios
    """
    
    if not agent:
        print("❌ No agent available for complex scenario testing")
        return []
    
    print("🚀 Testing Complex Agent Scenarios\n")
    
    complex_results = []
    
    for i, scenario in enumerate(scenarios, 1):
        print(f"🧩 Complex Scenario {i}: {scenario['name']}")
        print(f"Query: {scenario['query']}")
        
        try:
            # Create initial state
            initial_state = create_initial_state(scenario['query'])
            
            # Execute agent
            start_time = time.time()
            final_state = agent.invoke(initial_state)
            execution_time = time.time() - start_time
            
            # Extract results with scenario analysis
            result = {
                "scenario_name": scenario['name'],
                "query": scenario['query'],
                "final_answer": final_state.get("final_answer", "No answer generated"),
                "reasoning_steps": final_state.get("reasoning_steps", []),
                "tool_calls": final_state.get("tool_calls", []),
                "intermediate_results": final_state.get("intermediate_results", []),
                "execution_status": final_state.get("execution_status", "unknown"),
                "execution_time": execution_time,
                "error_message": final_state.get("error_message", ""),
                
                # Scenario-specific analysis
                "expected_steps": scenario['expected_steps'],
                "expected_tools": scenario['expected_tools'],
                "tools_used": [call['tool'] for call in final_state.get("tool_calls", [])],
                "step_coverage": 0,  # Will calculate below
                "tool_coverage": 0   # Will calculate below
            }
            
            # Calculate step coverage (how many expected steps were addressed)
            reasoning_text = " ".join(result["reasoning_steps"]).lower()
            covered_steps = 0
            for step in scenario['expected_steps']:
                if any(keyword.lower() in reasoning_text for keyword in step.split()):
                    covered_steps += 1
            result["step_coverage"] = covered_steps / len(scenario['expected_steps']) if scenario['expected_steps'] else 0
            
            # Calculate tool coverage (how many expected tools were used)
            used_expected_tools = set(result["tools_used"]) & set(scenario['expected_tools'])
            result["tool_coverage"] = len(used_expected_tools) / len(scenario['expected_tools']) if scenario['expected_tools'] else 0
            
            complex_results.append(result)
            
            # Print detailed analysis
            status_emoji = "✅" if result["execution_status"] == "completed" else "❌"
            print(f"  {status_emoji} Status: {result['execution_status']}")
            print(f"  ⏱️  Execution time: {execution_time:.2f}s")
            print(f"  🎯 Step coverage: {result['step_coverage']:.1%} ({covered_steps}/{len(scenario['expected_steps'])})")
            print(f"  🔧 Tool coverage: {result['tool_coverage']:.1%} ({len(used_expected_tools)}/{len(scenario['expected_tools'])})")
            print(f"  📊 Tools used: {', '.join(result['tools_used']) if result['tools_used'] else 'None'}")
            print(f"  🧠 Reasoning steps: {len(result['reasoning_steps'])}")
            
            if result["final_answer"]:
                print(f"  💡 Answer: {result['final_answer'][:100]}...")
            
            if result["error_message"]:
                print(f"  ⚠️  Error: {result['error_message']}")
                
        except Exception as e:
            print(f"  ❌ Execution failed: {e}")
            complex_results.append({
                "scenario_name": scenario['name'],
                "query": scenario['query'],
                "execution_status": "failed",
                "error_message": str(e)
            })
        
        print()
    
    # Overall complex scenario analysis
    if complex_results:
        successful_scenarios = [r for r in complex_results if r.get("execution_status") == "completed"]
        
        if successful_scenarios:
            avg_step_coverage = np.mean([r.get("step_coverage", 0) for r in successful_scenarios])
            avg_tool_coverage = np.mean([r.get("tool_coverage", 0) for r in successful_scenarios])
            avg_execution_time = np.mean([r.get("execution_time", 0) for r in successful_scenarios])
            
            print(f"📊 Complex Scenario Summary:")
            print(f"  Successful scenarios: {len(successful_scenarios)}/{len(complex_results)}")
            print(f"  Average step coverage: {avg_step_coverage:.1%}")
            print(f"  Average tool coverage: {avg_tool_coverage:.1%}")
            print(f"  Average execution time: {avg_execution_time:.2f}s")
            
            if avg_step_coverage < 0.7:
                print(f"  ⚠️  Low step coverage - agent may miss important reasoning steps")
            if avg_tool_coverage < 0.8:
                print(f"  ⚠️  Low tool coverage - agent may not be using available tools effectively")
    
    return complex_results

# Test complex scenarios
complex_scenario_results = test_complex_agent_scenarios(langgraph_agent, complex_scenarios)

### 7.3 Evaluate Complex Scenarios

In [None]:
# Evaluate complex scenarios với agent metrics
if complex_scenario_results:
    print("🔍 Evaluating Complex Scenarios với Agent Metrics\n")
    
    complex_evaluation_results = agent_eval_pipeline.batch_evaluate(complex_scenario_results)
    complex_performance_summary = agent_eval_pipeline.get_performance_summary()
    
    # Compare simple vs complex performance
    print("\n📊 Simple vs Complex Scenario Comparison:")
    
    # Get performance from both evaluations
    simple_scores = [r.get("overall_score", 0) for r in comprehensive_agent_results if "overall_score" in r]
    complex_scores = [r.get("overall_score", 0) for r in complex_evaluation_results if "overall_score" in r]
    
    if simple_scores and complex_scores:
        simple_avg = np.mean(simple_scores)
        complex_avg = np.mean(complex_scores)
        
        print(f"  Simple scenarios average: {simple_avg:.1f}/10")
        print(f"  Complex scenarios average: {complex_avg:.1f}/10")
        print(f"  Performance difference: {simple_avg - complex_avg:+.1f}")
        
        if complex_avg < simple_avg - 1.0:
            print(f"  ⚠️  Significant performance drop on complex tasks - agent struggles with complexity")
        elif abs(complex_avg - simple_avg) < 0.5:
            print(f"  ✅ Consistent performance across complexity levels")
        else:
            print(f"  📈 Agent handles complexity well")

else:
    print("❌ No complex scenario results to evaluate")
    complex_evaluation_results = []
    complex_performance_summary = {}

## 🎓 Phần 8: Exercises và Thực hành

### Exercise 1: Custom Agent Metric

In [None]:
# Exercise 1: Tạo custom metric cho agent creativity
def exercise_1_creativity_metric():
    """
    TODO: Tạo metric để đánh giá creativity và innovation trong agent reasoning
    Yêu cầu:
    1. Evaluate creative problem-solving approaches
    2. Assess innovative use of available tools
    3. Measure originality in reasoning steps
    4. Test với various creative scenarios
    """
    
    # TODO: Define creativity evaluation criteria
    creativity_criteria = """
    Your creativity evaluation criteria here...
    Focus on:
    - Novel problem-solving approaches
    - Innovative tool combinations
    - Original reasoning patterns
    - Creative solution pathways
    """
    
    # TODO: Create G-Eval metric
    
    # TODO: Test với creative scenarios
    
    return None

print("💡 Exercise 1 Template created. Complete the function above!")
print("Hints:")
print("- Focus on non-standard approaches to problem solving")
print("- Consider tool usage patterns và combinations")
print("- Evaluate reasoning originality và innovation")

### Exercise 2: Agent Performance Optimization

In [None]:
# Exercise 2: Optimize agent performance based on evaluation results
def exercise_2_agent_optimization():
    """
    TODO: Analyze evaluation results và propose agent improvements
    Yêu cầu:
    1. Identify performance bottlenecks from evaluation data
    2. Propose specific improvements cho agent architecture
    3. Design enhanced reasoning prompts
    4. Create optimized tool selection strategies
    """
    
    # TODO: Analyze current performance data
    performance_analysis = {
        "weak_areas": [],
        "strong_areas": [],
        "improvement_opportunities": []
    }
    
    # TODO: Design improvements
    
    # TODO: Create enhanced agent version
    
    # TODO: Compare performance improvements
    
    return None

print("💡 Exercise 2 Template created. Complete the function above!")
print("Hints:")
print("- Analyze metric scores để identify improvement areas")
print("- Consider prompt engineering improvements")
print("- Design better tool selection logic")
print("- Implement enhanced error recovery")

### Exercise 3: Multi-Agent Evaluation

In [None]:
# Exercise 3: Create multi-agent evaluation framework
def exercise_3_multi_agent_evaluation():
    """
    TODO: Build framework để compare multiple agents
    Yêu cầu:
    1. Create different agent configurations
    2. Design comparative evaluation metrics
    3. Run head-to-head agent comparisons
    4. Analyze strengths/weaknesses của each agent
    5. Determine optimal agent configuration
    """
    
    class MultiAgentEvaluator:
        def __init__(self):
            # TODO: Initialize multiple agent configurations
            pass
        
        def create_agent_variants(self):
            # TODO: Create different agent configurations
            # - Different reasoning strategies
            # - Different tool sets
            # - Different prompt templates
            pass
        
        def comparative_evaluation(self, test_scenarios):
            # TODO: Run same scenarios across all agents
            # TODO: Compare performance metrics
            # TODO: Identify best agent for each scenario type
            pass
        
        def generate_recommendations(self):
            # TODO: Recommend optimal agent configuration
            pass
    
    # TODO: Implement and test
    
    return None

print("💡 Exercise 3 Template created. Complete the class above!")
print("Hints:")
print("- Create agents với different capabilities")
print("- Use consistent evaluation criteria")
print("- Compare both performance và efficiency")
print("- Consider task-specific agent selection")

## 🎯 Tổng kết và Next Steps

### 🏆 Những gì đã học trong Notebook này:

1. **✅ LangGraph Agent Construction**
   - Multi-node agent architecture với StateGraph
   - Custom tool creation và integration
   - Agent state management và workflow control
   - Planning → Execution → Synthesis pipeline

2. **✅ Agent-Specific Evaluation Metrics**
   - **LogicalFlowMetric**: Reasoning coherence và clarity
   - **ToolUsageMetric**: Tool selection và efficiency
   - **PlanExecutionMetric**: Planning và execution fidelity
   - **AdaptabilityMetric**: Flexibility và error recovery

3. **✅ Chain-of-Thought Evaluation**
   - Intermediate step capture và analysis
   - Multi-dimensional reasoning assessment
   - Reasoning quality measurement
   - Step-by-step logical flow evaluation

4. **✅ Comprehensive Agent Assessment**
   - End-to-end agent evaluation pipeline
   - Complex scenario testing
   - Performance benchmarking
   - Comparative analysis frameworks

5. **✅ Advanced Analysis Techniques**
   - Agent performance visualization
   - Tool usage efficiency analysis
   - Reasoning depth vs success correlation
   - Multi-scenario performance tracking

### 🚀 Next Steps - Notebook 5: Feedback Loop & Regenerating CoT

Trong notebook cuối cùng, chúng ta sẽ học:

- 🔄 **Automated Feedback Loops**: Self-improving agent systems
- 🧠 **CoT Regeneration**: Dynamic reasoning improvement
- 📈 **Performance Optimization**: Iterative agent enhancement
- 🎯 **Production Deployment**: Real-world agent evaluation
- 🔍 **Continuous Monitoring**: Long-term performance tracking

### 💡 Key Insights từ Agent Evaluation:

- **Multi-Step Reasoning**: Cần evaluate cả process lẫn outcome
- **Tool Usage Patterns**: Efficiency matters more than quantity
- **Adaptability**: Critical for real-world deployment
- **Context Awareness**: Essential for complex scenarios
- **Error Recovery**: Separates good agents from great ones

### 🎯 Agent Evaluation Best Practices:

1. **Capture Intermediate Steps** cho detailed analysis
2. **Test với Complex Scenarios** beyond simple queries
3. **Measure Multiple Dimensions** of agent performance
4. **Compare Agent Variants** để optimize configuration
5. **Monitor Tool Usage Efficiency** để avoid waste
6. **Evaluate Reasoning Quality** not just final answers

### 📊 Agent vs Traditional System Evaluation:

| Aspect | Traditional Systems | AI Agents với CoT |
|--------|--------------------|-----------------|
| **Evaluation Focus** | Final output only | Process + outcome |
| **Reasoning** | Black box | Transparent steps |
| **Tool Usage** | Fixed workflow | Dynamic selection |
| **Adaptability** | Static rules | Learning capability |
| **Error Handling** | Predefined paths | Creative recovery |
| **Complexity** | Linear scaling | Emergent behavior |

### 🔍 Performance Insights Summary:

Từ evaluation results, chúng ta thấy:
- Agents perform better với **well-defined tools**
- **Planning quality** directly impacts execution success
- **Tool efficiency** là key differentiator
- **Complex scenarios** reveal agent limitations
- **Reasoning transparency** enables better debugging

---

## 🎉 Exceptional Achievement!

Bạn đã mastered advanced AI agent evaluation với LangGraph và Chain-of-Thought reasoning! 

Ready for the final challenge: **Notebook 5: Feedback Loop & Regenerating CoT**? 🚀🔄