In [22]:
import os

from dotenv import load_dotenv

load_dotenv(os.path.join("..", ".env"), override=True)

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Evaluation & Testing Deep Agents

<img src="./assets/agent_header.png" width="800" style="display:block; margin-left:0;">

Building reliable agents requires systematic testing and evaluation. Unlike traditional software where you test specific code paths, agent evaluation focuses on emergent behaviors—did the agent solve the task? Did it use tools efficiently? Did it maintain context appropriately?

This lesson covers practical evaluation techniques using LangSmith, building on the full deep agent from the previous notebook. You'll learn:

1. **Creating Evaluation Datasets** - Structuring test cases for agent behavior
2. **Custom Evaluators** - Writing functions to assess TODO usage, tool selection, and task completion
3. **Running Evaluations** - Using LangSmith's `evaluate()` API for systematic testing
4. **Regression Testing** - Tracking agent performance across versions

<!-- The style below reduces the gap between items in the same bulleted list. Run once per notebook -->
<style>
/* JupyterLab + classic notebook */
.jp-RenderedHTMLCommon ul, .text_cell_render ul { margin-top: .25em; margin-bottom: .35em; padding-left: 1.2em; }
.jp-RenderedHTMLCommon ul ul, .text_cell_render ul ul { margin-top: .15em; margin-bottom: .15em; padding-left: 1.0em; }
.jp-RenderedHTMLCommon li, .text_cell_render li { margin: .1em 0; }
</style>

### Why Evaluate Agents?

Agent behavior is non-deterministic and complex. Evaluation helps:
- **Catch Regressions**: Ensure prompt changes don't break existing capabilities
- **Optimize Costs**: Identify unnecessary tool calls or excessive token usage
- **Validate Patterns**: Confirm agents use TODO lists and files appropriately
- **Compare Approaches**: A/B test different prompts, models, or architectures

Production systems like [Manus](https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus) and [Claude Code](https://www.anthropic.com/engineering/claude-code-best-practices) rely heavily on evaluation to maintain quality as they evolve.

### Setup: LangSmith Client

First, verify LangSmith is configured correctly:

In [2]:
import os
from langsmith import Client

# Verify environment
assert os.getenv("LANGSMITH_API_KEY"), "LANGSMITH_API_KEY not found in environment"
assert os.getenv("LANGSMITH_PROJECT"), "LANGSMITH_PROJECT not found in environment"

# Initialize client
client = Client()

print(f"✓ LangSmith configured")
print(f"  Project: {os.getenv('LANGSMITH_PROJECT')}")
print(f"  URL: https://smith.langchain.com")

✓ LangSmith configured
  Project: deep-agents-from-scratch
  URL: https://smith.langchain.com


### Create Test Agent

We'll use a simplified version of the deep agent with TODO and file tools:

In [4]:
from langchain.chat_models import init_chat_model
from langchain_sambanova import ChatSambaNova
from langchain_core.tools import tool
from langgraph.prebuilt import create_react_agent

from deep_agents_from_scratch.state import DeepAgentState
from deep_agents_from_scratch.todo_tools import write_todos, read_todos
from deep_agents_from_scratch.file_tools import ls, read_file, write_file

# Mock search tool
search_results = {
    "langgraph": """LangGraph is a framework for building stateful, multi-actor applications with LLMs. 
    Key features: state management, cyclic graphs, persistence, human-in-the-loop.""",
    "mcp": """Model Context Protocol (MCP) is an open standard for connecting AI models to external tools and data sources.""",
    "default": "Search results for your query."
}

@tool(parse_docstring=True)
def web_search(query: str):
    """Search the web for information.
    
    Args:
        query: Search query
        
    Returns:
        Search results
    """
    query_lower = query.lower()
    for key, result in search_results.items():
        if key in query_lower:
            return result
    return search_results["default"]

# Create agent
model = ChatSambaNova(model="Llama-4-Maverick-17B-128E-Instruct", temperature=0.0)

tools = [write_todos, read_todos, web_search, ls, read_file, write_file]

AGENT_PROMPT = """You are a helpful research assistant with access to:
- TODO list management (write_todos, read_todos)
- Web search (web_search)
- File system (ls, read_file, write_file)

For complex tasks:
1. Create a TODO list at the start
2. Search for information and save to files
3. Mark tasks complete as you progress
"""

agent = create_react_agent(
    model,
    tools,
    prompt=AGENT_PROMPT,
    state_schema=DeepAgentState
)

print("✓ Agent created with TODO, search, and file tools")

✓ Agent created with TODO, search, and file tools


### Creating Evaluation Datasets

A good evaluation dataset covers:
- **Simple tasks**: Single-step queries that don't need TODOs
- **Complex tasks**: Multi-step workflows requiring planning
- **Edge cases**: Error handling, missing files, unclear requests

Each example has:
- `input`: The user's query (as messages)
- `expected`: Expected behaviors (not exact outputs, but patterns)

In [12]:
# Define test cases
test_cases = [
    {
        "input": {
            "messages": [{"role": "user", "content": "What is 2+2?"}]
        },
        "expected": {
            "should_create_todos": False,
            "should_use_search": False,
            "should_create_files": False,
            "task_completed": True
        },
        "description": "Simple math - no tools needed"
    },
    {
        "input": {
            "messages": [{"role": "user", "content": "Research LangGraph and create a quick summary file"}]
        },
        "expected": {
            "should_create_todos": True,
            "should_use_search": True,
            "should_create_files": True,
            "min_todos": 2,
            "expected_tools": ["write_todos", "web_search", "write_file"],
            "task_completed": True
        },
        "description": "Complex research task"
    },
    {
        "input": {
            "messages": [{"role": "user", "content": "Search for MCP, save the results, and quickly summarize the key points"}]
        },
        "expected": {
            "should_create_todos": True,
            "should_use_search": True,
            "should_create_files": True,
            "min_todos": 3,
            "expected_tools": ["write_todos", "web_search", "write_file"],
            "task_completed": True
        },
        "description": "Multi-step research with file save"
    },
]

print(f"Defined {len(test_cases)} test cases")
for i, tc in enumerate(test_cases):
    print(f"  {i+1}. {tc['description']}")

Defined 3 test cases
  1. Simple math - no tools needed
  2. Complex research task
  3. Multi-step research with file save


### Upload Dataset to LangSmith

Now we'll create a dataset in LangSmith and upload our test cases:

In [13]:
import uuid
from datetime import datetime

# Create unique dataset name
dataset_name = f"deep-agent-evaluation-{datetime.now().strftime('%Y%m%d-%H%M%S')}"

# Create dataset
try:
    dataset = client.create_dataset(
        dataset_name=dataset_name,
        description="Evaluation dataset for deep agents with TODO, search, and file tools"
    )
    print(f"✓ Created dataset: {dataset_name}")
    
    # Upload examples
    for tc in test_cases:
        client.create_example(
            inputs=tc["input"],
            outputs=tc["expected"],
            dataset_id=dataset.id,
            metadata={"description": tc["description"]}
        )
    
    print(f"✓ Uploaded {len(test_cases)} examples")
    print(f"  View at: https://smith.langchain.com")
    
except Exception as e:
    print(f"⚠ Error creating dataset: {e}")
    print(f"  Using local test cases only")

✓ Created dataset: deep-agent-evaluation-20251103-225203
✓ Uploaded 3 examples
  View at: https://smith.langchain.com


### Custom Evaluators

Evaluators are functions that score agent runs. They receive:
- `run`: The agent execution trace (includes outputs, inputs, metadata)
- `example`: The test case (includes expected outputs)

And return a dict with:
- `key`: Metric name
- `score`: Float between 0.0 and 1.0
- `comment`: Optional explanation

Let's create evaluators for our deep agent patterns:

In [14]:
from langsmith.schemas import Run, Example

def evaluate_todo_usage(run: Run, example: Example) -> dict:
    """Evaluate if agent appropriately used TODO list."""
    # Get final state
    outputs = run.outputs or {}
    todos = outputs.get("todos", [])
    
    # Get expectations
    expected = example.outputs or {}
    should_create = expected.get("should_create_todos", False)
    min_todos = expected.get("min_todos", 0)
    
    # Evaluate
    if should_create:
        if not todos:
            return {
                "key": "todo_usage",
                "score": 0.0,
                "comment": "Should have created TODO list for complex task"
            }
        elif len(todos) < min_todos:
            return {
                "key": "todo_usage",
                "score": 0.5,
                "comment": f"Created {len(todos)} TODOs but expected at least {min_todos}"
            }
        else:
            # Check if tasks were completed
            completed = sum(1 for t in todos if t.get("status") == "completed")
            completion_rate = completed / len(todos) if todos else 0
            
            return {
                "key": "todo_usage",
                "score": 1.0,
                "comment": f"Created {len(todos)} TODOs, {completed} completed ({completion_rate:.0%})"
            }
    else:
        # Simple task - TODOs not needed
        if todos:
            return {
                "key": "todo_usage",
                "score": 0.7,
                "comment": "Created unnecessary TODO list for simple task"
            }
        else:
            return {
                "key": "todo_usage",
                "score": 1.0,
                "comment": "Correctly avoided TODO overhead"
            }

print("✓ Defined evaluate_todo_usage")

✓ Defined evaluate_todo_usage


In [15]:
def evaluate_tool_selection(run: Run, example: Example) -> dict:
    """Evaluate if agent used the right tools."""
    # Extract tool calls from messages
    outputs = run.outputs or {}
    messages = outputs.get("messages", [])
    
    tools_called = []
    for msg in messages:
        # Handle both dict and object formats
        if isinstance(msg, dict):
            tool_calls = msg.get("tool_calls", [])
        else:
            tool_calls = getattr(msg, "tool_calls", [])
        
        if tool_calls:
            for tc in tool_calls:
                if isinstance(tc, dict):
                    tools_called.append(tc.get("name"))
                else:
                    tools_called.append(tc.get("name") if hasattr(tc, "get") else None)
    
    # Get expected tools
    expected = example.outputs or {}
    expected_tools = set(expected.get("expected_tools", []))
    actual_tools = set(tools_called)
    
    if not expected_tools:
        # No specific tools expected
        return {"key": "tool_selection", "score": 1.0, "comment": "No specific tools required"}
    
    # Check coverage
    missing = expected_tools - actual_tools
    extra = actual_tools - expected_tools
    
    if not missing and not extra:
        return {
            "key": "tool_selection",
            "score": 1.0,
            "comment": f"Perfect tool selection: {list(actual_tools)}"
        }
    elif not missing:
        return {
            "key": "tool_selection",
            "score": 0.8,
            "comment": f"Called extra tools: {list(extra)}"
        }
    else:
        return {
            "key": "tool_selection",
            "score": 0.0,
            "comment": f"Missing required tools: {list(missing)}"
        }

print("✓ Defined evaluate_tool_selection")

✓ Defined evaluate_tool_selection


In [16]:
def evaluate_file_operations(run: Run, example: Example) -> dict:
    """Evaluate if agent properly used file system."""
    outputs = run.outputs or {}
    files = outputs.get("files", {})
    
    expected = example.outputs or {}
    should_create = expected.get("should_create_files", False)
    
    if should_create:
        if not files:
            return {
                "key": "file_operations",
                "score": 0.0,
                "comment": "Should have created files for research results"
            }
        else:
            # Check if files have content
            empty_files = [name for name, content in files.items() if not content.strip()]
            
            if empty_files:
                return {
                    "key": "file_operations",
                    "score": 0.5,
                    "comment": f"Created files but some are empty: {empty_files}"
                }
            else:
                total_chars = sum(len(content) for content in files.values())
                return {
                    "key": "file_operations",
                    "score": 1.0,
                    "comment": f"Created {len(files)} files with {total_chars} chars"
                }
    else:
        # Files not needed
        if files:
            return {
                "key": "file_operations",
                "score": 0.7,
                "comment": "Created unnecessary files"
            }
        else:
            return {
                "key": "file_operations",
                "score": 1.0,
                "comment": "Correctly avoided file operations"
            }

print("✓ Defined evaluate_file_operations")

✓ Defined evaluate_file_operations


In [17]:
def evaluate_task_completion(run: Run, example: Example) -> dict:
    """Evaluate if agent completed the task."""
    outputs = run.outputs or {}
    messages = outputs.get("messages", [])
    
    if not messages:
        return {
            "key": "task_completion",
            "score": 0.0,
            "comment": "No messages generated"
        }
    
    # Check if last message is from AI (not a tool call)
    last_msg = messages[-1]
    
    # Handle both dict and object formats
    if isinstance(last_msg, dict):
        msg_type = last_msg.get("type", "")
        has_content = bool(last_msg.get("content", ""))
        has_tool_calls = bool(last_msg.get("tool_calls", []))
    else:
        msg_type = last_msg.__class__.__name__
        has_content = bool(getattr(last_msg, "content", ""))
        has_tool_calls = bool(getattr(last_msg, "tool_calls", []))
    
    # Agent completed if:
    # - Last message is from AI/Assistant
    # - Has content (not empty)
    # - No pending tool calls
    is_ai = "ai" in msg_type.lower() or msg_type == "assistant"
    
    if is_ai and has_content and not has_tool_calls:
        return {
            "key": "task_completion",
            "score": 1.0,
            "comment": "Agent provided final response"
        }
    else:
        return {
            "key": "task_completion",
            "score": 0.0,
            "comment": f"No final response (last message: {msg_type})"
        }

print("✓ Defined evaluate_task_completion")

✓ Defined evaluate_task_completion


### Running Evaluations

Now we'll run our agent on each test case and apply the evaluators. We'll do this locally first to see the results:

In [18]:
# Manual evaluation (without LangSmith API)
print("Running manual evaluation...\n")

results = []

for i, tc in enumerate(test_cases):
    print(f"[{i+1}/{len(test_cases)}] {tc['description']}")
    
    # Run agent
    try:
        # Initialize state with empty todos and files
        initial_state = {
            **tc["input"],
            "todos": [],
            "files": {}
        }
        
        result = agent.invoke(initial_state)
        
        # Create mock Run and Example for evaluators
        class MockRun:
            def __init__(self, outputs):
                self.outputs = outputs
        
        class MockExample:
            def __init__(self, outputs):
                self.outputs = outputs
        
        mock_run = MockRun(result)
        mock_example = MockExample(tc["expected"])
        
        # Run evaluators
        evals = [
            evaluate_todo_usage(mock_run, mock_example),
            evaluate_tool_selection(mock_run, mock_example),
            evaluate_file_operations(mock_run, mock_example),
            evaluate_task_completion(mock_run, mock_example),
        ]
        
        # Calculate average score
        avg_score = sum(e["score"] for e in evals) / len(evals)
        
        # Print results
        print(f"  Overall: {avg_score:.1%}")
        for e in evals:
            emoji = "✓" if e["score"] >= 0.8 else "⚠" if e["score"] >= 0.5 else "✗"
            print(f"    {emoji} {e['key']}: {e['score']:.1%} - {e['comment']}")
        
        results.append({
            "test_case": tc["description"],
            "score": avg_score,
            "evaluations": evals
        })
        
    except Exception as e:
        print(f"  ✗ Error: {e}")
        results.append({
            "test_case": tc["description"],
            "score": 0.0,
            "error": str(e)
        })
    
    print()

# Summary
print("=" * 80)
print("EVALUATION SUMMARY")
print("=" * 80)
overall_avg = sum(r["score"] for r in results) / len(results)
print(f"Overall Score: {overall_avg:.1%}")
print(f"\nResults by test case:")
for r in results:
    print(f"  {r['score']:.1%} - {r['test_case']}")

Running manual evaluation...

[1/3] Simple math - no tools needed
  Overall: 100.0%
    ✓ todo_usage: 100.0% - Correctly avoided TODO overhead
    ✓ tool_selection: 100.0% - No specific tools required
    ✓ file_operations: 100.0% - Correctly avoided file operations
    ✓ task_completion: 100.0% - Agent provided final response

[2/3] Complex research task
  ✗ Error: Error code: 400 - {'error': 'Encountered JSONDecodeError:"Expecting property name enclosed in double quotes: line 1 column 2 (char 1)" when trying to decode function call string: {\'content\': \'Search for information about LangGraph\', \'status\': \'completed\'}: line 1 column 2 (char 1)', 'error_code': None, 'error_model_output': '{"name": "write_todos", "parameters": {"todos": [{"content": "Search for information about LangGraph", "status": "completed"}, {"content": "Create a summary of LangGraph", "status": "in_progress"}, {"content": "Write the summary to a file", "status": "pending"}]}}\n<|python_start|>\n<|python_sta

### LangSmith Evaluation API

For production workflows, use LangSmith's `evaluate()` API. This provides:
- Automatic result tracking
- Comparison across runs
- Team collaboration
- Historical trends

Here's how to run the same evaluation using LangSmith:

In [21]:
from langsmith.evaluation import evaluate

# Wrapper for agent to match evaluate() expectations
def agent_runner(inputs: dict) -> dict:
    """Wrapper to run agent with proper initialization."""
    # Initialize state
    initial_state = {
        **inputs,
        "todos": [],
        "files": {}
    }
    return agent.invoke(initial_state)

try:
    # Run evaluation
    eval_results = evaluate(
        agent_runner,
        data=dataset_name,  # Use the dataset we created earlier
        evaluators=[
            evaluate_todo_usage,
            evaluate_tool_selection,
            evaluate_file_operations,
            evaluate_task_completion,
        ],
        experiment_prefix="deep-agent-eval",
        max_concurrency=1,  # Sequential for reproducibility
        metadata={
            "model": "claude-sonnet-4",
            "version": "1.0",
            "notebook": "5_evaluation_testing"
        }
    )
    
    print(f"✓ Evaluation complete!")
    print(f"  Experiment: {eval_results.experiment_name}")
    print(f"  View results at: https://smith.langchain.com")
    
except Exception as e:
    print(f"⚠ LangSmith evaluation failed: {e}")
    print(f"  Using manual evaluation results from above")

View the evaluation results for experiment: 'deep-agent-eval-8c3505c0' at:
https://smith.langchain.com/o/ded507f6-6806-4cc6-9f5b-b82571db4416/datasets/6bd54ab0-07ef-4aec-8445-c5bb7b5148d2/compare?selectedSessions=aff0c66c-bccc-49ed-88dd-dac0f53e5c6a




0it [00:00, ?it/s]

Error running target function: Error code: 400 - {'error': 'Encountered JSONDecodeError:"Expecting property name enclosed in double quotes: line 1 column 2 (char 1)" when trying to decode function call string: {\'content\': \'Search for information about LangGraph\', \'status\': \'completed\'}: line 1 column 2 (char 1)', 'error_code': None, 'error_model_output': '{"name": "write_todos", "parameters": {"todos": [{"content": "Search for information about LangGraph", "status": "completed"}, {"content": "Create a summary of LangGraph", "status": "in_progress"}, {"content": "Write the summary to a file", "status": "pending"}]}}\n<|python_start|>\n<|python_start|>assistant<|header_end|>\n\n{"name": "write_file", "parameters": {"file_path": "langgraph_summary.txt", "content": "LangGraph is a framework for building stateful, multi-actor applications with LLMs. Key features: state management, cyclic graphs, persistence, human-in-the-loop."}}\n{"name": "write_todos", "parameters": {"todos": [{"c

✓ Evaluation complete!
  Experiment: deep-agent-eval-8c3505c0
  View results at: https://smith.langchain.com


### Regression Testing Pattern

For continuous integration, create a regression test suite that runs on every change:

In [None]:
def run_regression_suite(agent_version: str = "dev", threshold: float = 0.85):
    """Run regression tests and fail if performance drops.
    
    Args:
        agent_version: Version identifier for tracking
        threshold: Minimum acceptable average score (0.0 to 1.0)
        
    Returns:
        bool: True if tests pass, False otherwise
    """
    print(f"Running regression suite for version: {agent_version}")
    print(f"Pass threshold: {threshold:.1%}\n")
    
    # Run manual evaluation
    test_results = []
    
    for tc in test_cases:
        initial_state = {
            **tc["input"],
            "todos": [],
            "files": {}
        }
        
        result = agent.invoke(initial_state)
        
        # Mock objects for evaluators
        class MockRun:
            def __init__(self, outputs):
                self.outputs = outputs
        
        class MockExample:
            def __init__(self, outputs):
                self.outputs = outputs
        
        mock_run = MockRun(result)
        mock_example = MockExample(tc["expected"])
        
        # Evaluate
        evals = [
            evaluate_todo_usage(mock_run, mock_example),
            evaluate_tool_selection(mock_run, mock_example),
            evaluate_file_operations(mock_run, mock_example),
            evaluate_task_completion(mock_run, mock_example),
        ]
        
        avg_score = sum(e["score"] for e in evals) / len(evals)
        test_results.append(avg_score)
    
    # Calculate overall
    overall_score = sum(test_results) / len(test_results)
    passed = overall_score >= threshold
    
    # Report
    print("=" * 80)
    print(f"REGRESSION TEST {'PASSED' if passed else 'FAILED'}")
    print("=" * 80)
    print(f"Overall Score: {overall_score:.1%} (threshold: {threshold:.1%})")
    print(f"\nBreakdown:")
    for i, score in enumerate(test_results):
        status = "✓" if score >= threshold else "✗"
        print(f"  {status} Test {i+1}: {score:.1%}")
    
    return passed

# Run it
passed = run_regression_suite(agent_version="workshop-demo", threshold=0.75)

if not passed:
    print("\n⚠ Regression detected - investigate before deploying!")

### Advanced: Comparing Agent Versions

To compare different prompts, models, or architectures, run evaluations with different configurations:

In [None]:
# Example: Compare different prompts
prompts_to_test = [
    {
        "name": "baseline",
        "prompt": AGENT_PROMPT
    },
    {
        "name": "verbose",
        "prompt": AGENT_PROMPT + "\n\nIMPORTANT: Always create TODOs for ANY task, even simple ones."
    },
]

print("Comparing agent prompts...\n")

comparison_results = {}

for config in prompts_to_test:
    print(f"Testing: {config['name']}")
    
    # Create agent with this prompt
    test_agent = create_react_agent(
        model,
        tools,
        prompt=config["prompt"],
        state_schema=DeepAgentState
    )
    
    # Run on one test case
    tc = test_cases[0]  # Simple math test
    initial_state = {**tc["input"], "todos": [], "files": {}}
    result = test_agent.invoke(initial_state)
    
    # Evaluate
    class MockRun:
        def __init__(self, outputs):
            self.outputs = outputs
    
    class MockExample:
        def __init__(self, outputs):
            self.outputs = outputs
    
    todo_eval = evaluate_todo_usage(MockRun(result), MockExample(tc["expected"]))
    
    print(f"  TODO usage: {todo_eval['score']:.1%} - {todo_eval['comment']}")
    comparison_results[config["name"]] = todo_eval["score"]
    print()

print("Comparison Summary:")
for name, score in comparison_results.items():
    print(f"  {name}: {score:.1%}")

### Key Takeaways

**Evaluation Datasets:**
- Cover simple, complex, and edge-case scenarios
- Define expected *behaviors*, not exact outputs
- Include metadata for test organization
- Store in LangSmith for team access

**Custom Evaluators:**
- Return `{key, score, comment}` dicts
- Score from 0.0 (fail) to 1.0 (perfect)
- Check agent *patterns*: TODO usage, tool selection, file operations
- Provide actionable comments for debugging

**Running Evaluations:**
- Manual: Good for debugging single runs
- LangSmith API: Best for tracking over time
- Sequential execution (`max_concurrency=1`) for reproducibility
- Add metadata for version tracking

**Regression Testing:**
- Set score thresholds for pass/fail
- Run on every code change
- Compare across agent versions
- Block deployments if quality drops

**Production Patterns:**
- Start with 10-20 core test cases
- Add failing examples as you find bugs
- Use LLM-as-judge for qualitative metrics
- Track costs and latency alongside accuracy

This evaluation framework ensures your deep agents maintain quality as they evolve!