## Trajectory Evaluation - Understanding How Agents Think and Execute Tasks

This tutorial demonstrates how to evaluate agent trajectories - the sequences of tool calls, reasoning steps, and decisions agents make to complete tasks. Trajectory evaluation helps you understand not just what agents produce, but how they think and execute tasks.

### What You'll Learn
- Understand agent trajectories (tool calls, reasoning steps, sequences)
- Visualize execution trajectories (Session → Trace → Spans)
- Use TrajectoryEvaluator with custom rubrics
- Evaluate optimal, suboptimal, and incorrect trajectories
- Implement trajectory scoring functions (exact_match, in_order, any_order)
- Analyze HOW agents think, not just WHAT they produce

### Tutorial Details

| Information         | Details                                                                       |
|:--------------------|:----------------------------------------------------------------------------------|
| Tutorial type       | Intermediate - Understanding and evaluating agent execution paths                 |
| Tutorial components | Multi-agent system, TrajectoryEvaluator, trajectory visualization                |
| Tutorial vertical   | Agent Evaluation                                                                  |
| Example complexity  | Medium                                                                            |
| SDK used            | Strands Agents, Strands Evals                                                     |

### Understanding Trajectories

A **trajectory** is the complete sequence of actions an agent takes to solve a task, including tool calls, reasoning steps, and execution flow.

#### Why Evaluate Trajectories?

| Output Evaluation Alone | Trajectory Evaluation Adds |
|:------------------------|:---------------------------|
| Multiple paths can produce same output | Shows efficiency of execution |
| Can't detect inefficient reasoning | Reveals correctness of tool usage |
| Misses optimization opportunities | Identifies where agent can improve |

Use trajectory evaluation during development, optimization, and production monitoring to validate agent design and detect reasoning quality degradation.

#### Trajectory Hierarchy

```
Session → Trace (per invocation) → Spans (individual steps: tool calls, LLM calls)
```

#### Three Trajectory Scenarios

| Scenario | Description | Characteristics |
|:---------|:------------|:----------------|
| Optimal | Correct tools in correct order | Efficient, minimal steps, expected behavior |
| Suboptimal | Extra or redundant steps | Correct answer but inefficient, redundant calls |
| Incorrect | Wrong tools or wrong order | May produce wrong output, requires fixes |

### Environment Setup

Configure AWS region and model settings for this tutorial.

In [None]:
import boto3

# AWS Configuration (inline - no config.py)
session = boto3.Session()
AWS_REGION = session.region_name or 'us-east-1'
DEFAULT_MODEL = 'us.anthropic.claude-3-7-sonnet-20250219-v1:0'

### Setup and Imports

Import all necessary libraries for agent creation, trajectory evaluation, and visualization.

In [None]:
# Standard imports
from typing import List, Dict, Any

# Strands imports
from strands import Agent
from strands.multiagent import GraphBuilder

# Strands Evals imports
from strands_evals import Experiment, Case
from strands_evals.evaluators import TrajectoryEvaluator

# Display utilities
from IPython.display import Markdown, display

### Multi-Agent System for Trajectory Analysis

We'll create a multi-agent business decision system with parallel execution. This system demonstrates clear trajectory patterns as different agents collaborate on analysis tasks.

#### Architecture

```
financial_advisor (entry)
     ├──> technical_architect ──┐
     └──> market_researcher ────┴──> risk_analyst (final)
```

#### Agent Code

Multi-agent code adapted from: /strands-samples/01-tutorials/02-multi-agent-systems/03-graph-agent/

In [None]:
# Create specialized agents for parallel analysis
financial_advisor = Agent(
    name="financial_advisor",
    system_prompt="You are a financial advisor focused on cost-benefit analysis, budget implications, and ROI calculations. Provide concise financial assessment in 2-3 sentences.",
    model=DEFAULT_MODEL
)

technical_architect = Agent(
    name="technical_architect",
    system_prompt="You are a technical architect who evaluates feasibility, implementation challenges, and technical risks. Provide concise technical assessment in 2-3 sentences.",
    model=DEFAULT_MODEL
)

market_researcher = Agent(
    name="market_researcher",
    system_prompt="You are a market researcher who analyzes market conditions, user needs, and competitive landscape. Provide concise market assessment in 2-3 sentences.",
    model=DEFAULT_MODEL
)

risk_analyst = Agent(
    name="risk_analyst",
    system_prompt="You are a risk analyst who synthesizes input from finance, technical, and market experts to identify potential risks and provide a final recommendation in 2-3 sentences.",
    model=DEFAULT_MODEL
)

# Build the agent graph with parallel execution
builder = GraphBuilder()

# Add nodes
builder.add_node(financial_advisor, "finance_expert")
builder.add_node(technical_architect, "tech_expert")
builder.add_node(market_researcher, "market_expert")
builder.add_node(risk_analyst, "risk_analyst")

# Add edges - parallel execution pattern
builder.add_edge("finance_expert", "tech_expert")
builder.add_edge("finance_expert", "market_expert")
builder.add_edge("tech_expert", "risk_analyst")
builder.add_edge("market_expert", "risk_analyst")

# Set entry point
builder.set_entry_point("finance_expert")

# Build the graph
decision_graph = builder.build()

### Test the Multi-Agent System

Before evaluating trajectories, let's verify the system works and examine its execution flow.

In [None]:
# Test the multi-agent system
test_query = "Should we invest $500K in developing an AI-powered customer service chatbot?"
result = decision_graph(test_query)

print(f"\nFinal Decision: {result}\n")
print("="*80)

# Show execution flow
print("\nExecution Order (Trajectory):")
for node in result.execution_order:
    print(f"  - {node.node_id}")

### Understanding Trajectory Scoring Functions

| Scorer | Use Case | Scoring Logic |
|:-------|:---------|:--------------|
| **Exact Match** | Strict compliance (medical, financial) | 1.0 if exact sequence, 0.0 otherwise |
| **In-Order Match** | Sequence matters, extra steps OK | Proportion of expected steps in correct order |
| **Any-Order Match** | Only presence matters | Proportion of expected steps present |

#### Implementation

In [None]:
def exact_match_scorer(expected_trajectory: List[str], actual_trajectory: List[str]) -> float:
    """
    Score 1.0 if trajectories match exactly, 0.0 otherwise.
    
    Args:
        expected_trajectory: List of expected step identifiers
        actual_trajectory: List of actual step identifiers from execution
    
    Returns:
        1.0 if exact match, 0.0 otherwise
    """
    return 1.0 if expected_trajectory == actual_trajectory else 0.0


def in_order_match_scorer(expected_trajectory: List[str], actual_trajectory: List[str]) -> float:
    """
    Score based on proportion of expected steps present in correct order.
    Extra steps are allowed but don't affect score.
    
    Args:
        expected_trajectory: List of expected step identifiers
        actual_trajectory: List of actual step identifiers from execution
    
    Returns:
        Proportion of expected steps found in order (0.0 to 1.0)
    """
    if not expected_trajectory:
        return 1.0
    
    expected_idx = 0
    matches = 0
    
    for actual_step in actual_trajectory:
        if expected_idx < len(expected_trajectory) and actual_step == expected_trajectory[expected_idx]:
            matches += 1
            expected_idx += 1
    
    return matches / len(expected_trajectory)


def any_order_match_scorer(expected_trajectory: List[str], actual_trajectory: List[str]) -> float:
    """
    Score based on proportion of expected steps present, regardless of order.
    
    Args:
        expected_trajectory: List of expected step identifiers
        actual_trajectory: List of actual step identifiers from execution
    
    Returns:
        Proportion of expected steps found (0.0 to 1.0)
    """
    if not expected_trajectory:
        return 1.0
    
    actual_set = set(actual_trajectory)
    matches = sum(1 for step in expected_trajectory if step in actual_set)
    
    return matches / len(expected_trajectory)

### Trajectory Extraction Helper

Create a helper function to extract trajectories from agent execution results.

In [None]:
def extract_trajectory(result) -> List[str]:
    """
    Extract the execution trajectory from a graph result.
    
    Args:
        result: Graph execution result
    
    Returns:
        List of node IDs in execution order
    """
    return [node.node_id for node in result.execution_order]

### Scenario 1: Optimal Trajectory

An optimal trajectory uses the correct tools in the correct order with minimal steps. This represents the expected, efficient execution path.

#### Expected Trajectory
```
finance_expert → tech_expert → market_expert → risk_analyst
```

This is the natural flow for our multi-agent decision system.

In [None]:
# Define optimal test case
optimal_case = Case(
    name="Optimal Decision Path",
    input="Should we invest $1M in expanding our cloud infrastructure to support 10x user growth?",
    expected_output="A comprehensive analysis considering financial, technical, and market factors with a clear recommendation."
)

# Expected trajectory for this case
expected_optimal_trajectory = [
    "finance_expert",
    "tech_expert",
    "market_expert",
    "risk_analyst"
]

print(f"\nExpected trajectory: {' → '.join(expected_optimal_trajectory)}")

In [None]:
# Execute the optimal case
optimal_result = decision_graph(optimal_case.input)
optimal_trajectory = extract_trajectory(optimal_result)

print(f"\nInput: {optimal_case.input}")
print(f"\nFinal Output: {optimal_result}")
print("\n" + "="*80)
print("\nTrajectory Analysis:")
print(f"  Expected: {' → '.join(expected_optimal_trajectory)}")
print(f"  Actual:   {' → '.join(optimal_trajectory)}")

# Score the trajectory
exact_score = exact_match_scorer(expected_optimal_trajectory, optimal_trajectory)
in_order_score = in_order_match_scorer(expected_optimal_trajectory, optimal_trajectory)
any_order_score = any_order_match_scorer(expected_optimal_trajectory, optimal_trajectory)

print(f"\nTrajectory Scores:")
print(f"  Exact Match:    {exact_score:.2f} (requires exact sequence)")
print(f"  In-Order Match: {in_order_score:.2f} (allows extra steps)")
print(f"  Any-Order Match: {any_order_score:.2f} (ignores order)")

# Interpretation
if exact_score == 1.0:
    interpretation = "**OPTIMAL**: Trajectory perfectly matches expected path"
elif in_order_score >= 0.75:
    interpretation = "**GOOD**: Most expected steps present in correct order"
else:
    interpretation = "**SUBOPTIMAL**: Significant deviations from expected path"

print(f"\nInterpretation: {interpretation}")

### Scenario 2: Suboptimal Trajectory

A suboptimal trajectory reaches the correct answer but takes unnecessary steps or makes redundant tool calls. This might indicate inefficiency or confusion in the agent's decision-making.

#### Simulation Note
Since our graph agent follows a deterministic execution pattern, we'll simulate a suboptimal trajectory by analyzing what it would look like if the agent made redundant calls.

In [None]:
# Simulate a suboptimal trajectory scenario
suboptimal_case = Case(
    name="Suboptimal Decision Path - Extra Analysis Steps",
    input="Evaluate a $250K investment in updating our mobile app with new features.",
    expected_output="A comprehensive analysis with clear recommendation."
)

# Expected optimal trajectory
expected_suboptimal_trajectory = [
    "finance_expert",
    "tech_expert",
    "market_expert",
    "risk_analyst"
]

# Simulate what a suboptimal trajectory might look like with redundant steps
# In a real scenario, this might happen if agent repeats analysis or backtracks
simulated_suboptimal_trajectory = [
    "finance_expert",
    "finance_expert",  # Redundant call
    "tech_expert",
    "market_expert",
    "market_expert",   # Redundant call
    "risk_analyst"
]

print("Suboptimal trajectory scenario defined")
print(f"\nExpected optimal: {' → '.join(expected_suboptimal_trajectory)}")
print(f"Simulated actual: {' → '.join(simulated_suboptimal_trajectory)}")
print("\nNote: Simulated trajectory shows redundant calls to finance_expert and market_expert")

In [None]:
# Analyze the suboptimal trajectory
print(f"\nCase: {suboptimal_case.name}")
print(f"Input: {suboptimal_case.input}")
print("\n" + "="*80)
print("\nTrajectory Analysis:")
print(f"  Expected: {' → '.join(expected_suboptimal_trajectory)}")
print(f"  Simulated: {' → '.join(simulated_suboptimal_trajectory)}")

# Score the simulated trajectory
exact_score = exact_match_scorer(expected_suboptimal_trajectory, simulated_suboptimal_trajectory)
in_order_score = in_order_match_scorer(expected_suboptimal_trajectory, simulated_suboptimal_trajectory)
any_order_score = any_order_match_scorer(expected_suboptimal_trajectory, simulated_suboptimal_trajectory)

print(f"\nTrajectory Scores:")
print(f"  Exact Match:    {exact_score:.2f} (requires exact sequence)")
print(f"  In-Order Match: {in_order_score:.2f} (allows extra steps)")
print(f"  Any-Order Match: {any_order_score:.2f} (ignores order)")

# Analysis
analysis = """
**Analysis of Suboptimal Trajectory:**

- **Exact Match: 0.00** - Trajectory doesn't match expected sequence exactly
- **In-Order Match: 1.00** - All expected steps are present in correct order (redundant steps ignored)
- **Any-Order Match: 1.00** - All expected steps are present

**Key Observations:**
- Redundant calls to finance_expert (called twice)
- Redundant calls to market_expert (called twice)
- Total steps: 6 (expected 4)
- Efficiency: 67% (4/6 necessary steps)

**Impact:**
- Increased latency (extra LLM calls)
- Higher costs (more token usage)
- Correct final output but inefficient path

**Recommendations:**
- Investigate why redundant calls occur
- Add caching or memoization for repeated analyses
- Review agent orchestration logic
"""

display(Markdown(analysis))

### Scenario 3: Incorrect Trajectory

An incorrect trajectory uses wrong tools or wrong order, potentially leading to incorrect outputs. This might indicate fundamental issues with agent design or understanding.

#### Simulation Note
We'll simulate an incorrect trajectory where the agent skips critical analysis steps or uses tools in wrong order.

In [None]:
# Simulate an incorrect trajectory scenario
incorrect_case = Case(
    name="Incorrect Decision Path - Missing Critical Analysis",
    input="Should we acquire a competitor for $5M?",
    expected_output="A comprehensive analysis covering all decision factors."
)

# Expected optimal trajectory
expected_incorrect_trajectory = [
    "finance_expert",
    "tech_expert",
    "market_expert",
    "risk_analyst"
]

# Simulate what an incorrect trajectory might look like
# Missing market analysis and jumping straight to risk assessment
simulated_incorrect_trajectory = [
    "finance_expert",
    "risk_analyst",     # Wrong: skipped technical and market analysis
    "tech_expert"       # Wrong: analyzing after making risk decision
]

print("Incorrect trajectory scenario defined")
print(f"\nExpected: {' → '.join(expected_incorrect_trajectory)}")
print(f"Simulated: {' → '.join(simulated_incorrect_trajectory)}")
print("\nNote: Simulated trajectory skips market_expert and analyzes in wrong order")

In [None]:
# Analyze the incorrect trajectory
print(f"\nCase: {incorrect_case.name}")
print(f"Input: {incorrect_case.input}")
print("\n" + "="*80)
print("\nTrajectory Analysis:")
print(f"  Expected: {' → '.join(expected_incorrect_trajectory)}")
print(f"  Simulated: {' → '.join(simulated_incorrect_trajectory)}")

# Score the simulated trajectory
exact_score = exact_match_scorer(expected_incorrect_trajectory, simulated_incorrect_trajectory)
in_order_score = in_order_match_scorer(expected_incorrect_trajectory, simulated_incorrect_trajectory)
any_order_score = any_order_match_scorer(expected_incorrect_trajectory, simulated_incorrect_trajectory)

print(f"\nTrajectory Scores:")
print(f"  Exact Match:    {exact_score:.2f} (requires exact sequence)")
print(f"  In-Order Match: {in_order_score:.2f} (allows extra steps)")
print(f"  Any-Order Match: {any_order_score:.2f} (ignores order)")

# Analysis
analysis = """
**Analysis of Incorrect Trajectory:**

- **Exact Match: 0.00** - Completely different sequence
- **In-Order Match: 0.50** - Only 2 of 4 expected steps present in order (finance_expert, tech_expert)
- **Any-Order Match: 0.50** - Only 2 of 4 expected steps present (missing market_expert)

**Key Observations:**
- **Missing step**: market_expert analysis completely skipped
- **Wrong order**: risk_analyst called before tech_expert
- **Incomplete analysis**: Decision made without market perspective
- Coverage: 50% (2/4 critical analyses performed)

**Critical Issues:**
1. **Missing market analysis**: No competitive or demand assessment
2. **Premature risk decision**: Made before gathering all inputs
3. **Wrong execution order**: Technical analysis after risk assessment is too late

**Impact:**
- **High risk**: Decision made without complete information
- **Low confidence**: Missing critical market perspective
- **Potential errors**: Risk analysis based on incomplete data

**Recommendations:**
- **Critical fix required**: Ensure all required analyses are performed
- Review orchestration logic to enforce dependencies
- Add validation checks before risk assessment
- Consider making market analysis a hard requirement
"""

display(Markdown(analysis))

### Trajectory Evaluation with TrajectoryEvaluator

Now let's use the built-in TrajectoryEvaluator to automatically evaluate trajectories. The TrajectoryEvaluator uses LLM-based assessment to score trajectory quality.

In [None]:
# Create custom rubric for trajectory evaluation
trajectory_rubric = """
Evaluate the agent's execution trajectory based on the following criteria:

Score 1 (Poor):
- Missing critical analysis steps
- Wrong order of operations
- Skipped essential expert consultations
- Incomplete information gathering

Score 2 (Fair):
- Some required steps present but incomplete
- Partially correct order
- Missing one key analysis
- Redundant or inefficient steps

Score 3 (Good):
- Most required steps present
- Generally correct order
- Minor inefficiencies acceptable
- All critical analyses performed

Score 4 (Very Good):
- All required steps present
- Correct order maintained
- Efficient execution
- Proper expert collaboration

Score 5 (Excellent):
- Optimal execution path
- Perfect order and completeness
- No unnecessary steps
- Maximum efficiency and correctness
"""

# Create TrajectoryEvaluator with custom rubric
trajectory_evaluator = TrajectoryEvaluator(
    rubric=trajectory_rubric,
    model=DEFAULT_MODEL, 
    include_inputs=True
)

In [None]:
# Create evaluation dataset with all three scenarios
evaluation_cases = [
    optimal_case,
    suboptimal_case,
    incorrect_case
]

dataset = Experiment(
    cases=evaluation_cases,
    evaluators=[trajectory_evaluator]
)

In [None]:
# Define agent task function for evaluation
def agent_task(case: Case) -> dict:
    """
    Execute the multi-agent decision system and return the result with trajectory.
    """
    result = decision_graph(case.input)
    
    return {
        "output": str(result),
        "trajectory": [node.node_id for node in result.execution_order]
    }

In [None]:
# Run trajectory evaluation
report = dataset.run_evaluations(agent_task)

### Evaluation Results

Display the trajectory evaluation results with detailed analysis.

In [None]:
# Display evaluation report
report[0].run_display()

### Visualizing Trajectory Metrics

Let's create a summary visualization comparing all three trajectory scenarios.

In [None]:
# Compile trajectory metrics for comparison
trajectories_summary = [
    {
        "scenario": "Optimal",
        "expected": expected_optimal_trajectory,
        "actual": optimal_trajectory,
        "exact_match": exact_match_scorer(expected_optimal_trajectory, optimal_trajectory),
        "in_order_match": in_order_match_scorer(expected_optimal_trajectory, optimal_trajectory),
        "any_order_match": any_order_match_scorer(expected_optimal_trajectory, optimal_trajectory),
        "efficiency": len(expected_optimal_trajectory) / len(optimal_trajectory) if optimal_trajectory else 0
    },
    {
        "scenario": "Suboptimal",
        "expected": expected_suboptimal_trajectory,
        "actual": simulated_suboptimal_trajectory,
        "exact_match": exact_match_scorer(expected_suboptimal_trajectory, simulated_suboptimal_trajectory),
        "in_order_match": in_order_match_scorer(expected_suboptimal_trajectory, simulated_suboptimal_trajectory),
        "any_order_match": any_order_match_scorer(expected_suboptimal_trajectory, simulated_suboptimal_trajectory),
        "efficiency": len(expected_suboptimal_trajectory) / len(simulated_suboptimal_trajectory)
    },
    {
        "scenario": "Incorrect",
        "expected": expected_incorrect_trajectory,
        "actual": simulated_incorrect_trajectory,
        "exact_match": exact_match_scorer(expected_incorrect_trajectory, simulated_incorrect_trajectory),
        "in_order_match": in_order_match_scorer(expected_incorrect_trajectory, simulated_incorrect_trajectory),
        "any_order_match": any_order_match_scorer(expected_incorrect_trajectory, simulated_incorrect_trajectory),
        "efficiency": len([s for s in simulated_incorrect_trajectory if s in expected_incorrect_trajectory]) / len(expected_incorrect_trajectory)
    }
]

# Create comparison table
comparison_md = """
## Trajectory Comparison Summary

| Scenario | Exact Match | In-Order Match | Any-Order Match | Efficiency |
|:---------|:------------|:---------------|:----------------|:-----------|
"""

for t in trajectories_summary:
    comparison_md += f"| {t['scenario']} | {t['exact_match']:.2f} | {t['in_order_match']:.2f} | {t['any_order_match']:.2f} | {t['efficiency']:.2%} |\n"

comparison_md += """

### Key Insights

**Optimal Trajectory:**
- Perfect scores across all metrics (1.00)
- 100% efficiency - no wasted steps
- Represents ideal execution pattern

**Suboptimal Trajectory:**
- Exact match fails due to extra steps
- In-order and any-order matches are perfect (1.00)
- 67% efficiency due to redundant calls
- Correct but inefficient execution

**Incorrect Trajectory:**
- All metrics show significant degradation
- Only 50% of expected steps present
- 50% efficiency - missing critical analyses
- Represents problematic execution requiring fixes

### Choosing the Right Scorer

**Use Exact Match when:**
- Strict compliance is required (medical, financial)
- Regulatory requirements demand specific steps
- Safety-critical applications

**Use In-Order Match when:**
- Sequence matters but flexibility is acceptable
- Extra logging or validation steps are okay
- Focus is on correctness over efficiency

**Use Any-Order Match when:**
- Steps are independent
- Order doesn't affect outcome
- Parallel execution is possible
"""

display(Markdown(comparison_md))

### Best Practices for Trajectory Evaluation

#### When to Evaluate Trajectories

| Situation | Why |
|:----------|:----|
| Debugging unexpected behavior | Identify where agent deviates |
| Optimizing performance | Find inefficiencies and redundant steps |
| Ensuring compliance | Validate required workflow steps |
| Comparing implementations | Benchmark different agent versions |

#### Choosing Evaluation Criteria

| Factor | Consideration |
|:-------|:--------------|
| Domain requirements | Some domains require strict order (medical, legal) |
| Efficiency goals | Balance correctness with performance |
| Cost constraints | Redundant steps increase API costs |

#### Combining Trajectory and Output Evaluation

Evaluate both trajectory (HOW) and output (WHAT) for complete visibility:
```python
combined_score = (trajectory_score * 0.4) + (output_score * 0.6)
```

#### Production Monitoring

Monitor: trajectory lengths, tool usage patterns, execution times, failure points.

### Summary

You've successfully learned how to evaluate agent trajectories using Strands Evals. You now understand:

- **What trajectories are**: Sequences of tool calls and reasoning steps agents use to solve tasks
- **Why trajectory evaluation matters**: Understanding HOW agents think, not just WHAT they produce
- **Trajectory hierarchy**: Session → Trace → Spans structure in Strands
- **Three trajectory patterns**:
  - Optimal: Correct tools in correct order
  - Suboptimal: Extra or redundant steps
  - Incorrect: Wrong tools or wrong order
- **Scoring functions**:
  - exact_match_scorer: Strict sequence matching
  - in_order_match_scorer: Allows extra steps
  - any_order_match_scorer: Ignores order
- **TrajectoryEvaluator**: LLM-based trajectory assessment with custom rubrics
- **Best practices**: When and how to evaluate trajectories effectively

Trajectory evaluation is essential for understanding agent behavior, optimizing performance, and ensuring reliable execution. By evaluating both the process (trajectory) and the result (output), you gain complete visibility into agent systems.