# Public Benchmark Evaluation for Agent Systems

This notebook evaluates agent performance on public benchmarks:
- **BFCL-style**: Tool calling accuracy (Berkeley Function-Calling Leaderboard)
- **PlanBench-style**: Planning quality and multi-step reasoning

## Learning Objectives
1. Understand public benchmark evaluation methodologies
2. Measure tool calling accuracy and argument validation
3. Evaluate planning quality with trajectory metrics
4. Compare agent performance across benchmark types

## Execution Modes
- **DEMO**: 10 examples (5 BFCL + 5 PlanBench), ~2-3 minutes, <$1 cost
- **FULL**: 50 examples (25 BFCL + 25 PlanBench), ~8-10 minutes, <$3 cost

## Prerequisites
- OpenAI API key configured
- Completed trajectory evaluation tutorial
- Understanding of tool calling and planning concepts

## 1. Setup and Imports

In [None]:
# Standard library imports
import json
import os
import sys
from collections import Counter, defaultdict
from datetime import datetime
from pathlib import Path
from typing import Any

# Third-party imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from openai import OpenAI
from tqdm import tqdm

# Add backend to path
backend_path = Path(".").resolve().parent / "backend"
if str(backend_path) not in sys.path:
    sys.path.insert(0, str(backend_path))

# Import trajectory evaluation
from trajectory_evaluation import TrajectoryEvaluator, TrajectoryVisualizer

print("‚úÖ Setup complete")
print(f"Current directory: {Path.cwd()}")
print(f"Backend path: {backend_path}")

## 2. Configuration and Mode Selection

In [None]:
# ============================================
# MODE SELECTION
# ============================================
# Change MODE to "FULL" for comprehensive evaluation
MODE = "DEMO"  # Options: "DEMO" or "FULL"

# Configure based on mode
if MODE == "DEMO":
    N_BFCL = 5  # BFCL-style tool calling examples
    N_PLANBENCH = 5  # PlanBench-style planning examples
    MODEL = "gpt-3.5-turbo"  # Cheaper model
    ESTIMATED_COST = "$0.50-$1.00"
    ESTIMATED_TIME = "2-3 minutes"
elif MODE == "FULL":
    N_BFCL = 25
    N_PLANBENCH = 25
    MODEL = "gpt-4o-mini"  # Better quality
    ESTIMATED_COST = "$2.00-$3.00"
    ESTIMATED_TIME = "8-10 minutes"
else:
    raise ValueError(f"Invalid MODE: {MODE}. Must be 'DEMO' or 'FULL'")

# API Configuration
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    raise ValueError(
        "OPENAI_API_KEY not found. Set it with: export OPENAI_API_KEY='your-key'"
    )

client = OpenAI(api_key=api_key)

# Display configuration
print(f"üîß MODE: {MODE}")
print(f"üìä Samples: {N_BFCL} BFCL + {N_PLANBENCH} PlanBench = {N_BFCL + N_PLANBENCH} total")
print(f"ü§ñ Model: {MODEL}")
print(f"üí∞ Estimated cost: {ESTIMATED_COST}")
print(f"‚è±Ô∏è  Estimated time: {ESTIMATED_TIME}")
print("\n‚ö†Ô∏è  WARNING: This will make API calls and incur costs.")
print("Continue only if you accept these charges.")

## 3. Load Benchmark Data

In [None]:
def load_benchmark_data(benchmark_type: str, n_samples: int) -> list[dict[str, Any]]:
    """Load benchmark data with stratified sampling.

    Args:
        benchmark_type: "bfcl" or "planbench"
        n_samples: Number of samples to load

    Returns:
        List of benchmark test cases

    Raises:
        TypeError: If inputs are invalid types
        ValueError: If benchmark_type is invalid or data loading fails
    """
    # Step 1: Type checking
    if not isinstance(benchmark_type, str):
        raise TypeError("benchmark_type must be a string")
    if not isinstance(n_samples, int):
        raise TypeError("n_samples must be an int")

    # Step 2: Input validation
    if benchmark_type not in ["bfcl", "planbench"]:
        raise ValueError("benchmark_type must be 'bfcl' or 'planbench'")
    if n_samples <= 0:
        raise ValueError("n_samples must be positive")

    # Step 3: Load data file
    data_dir = Path("data")
    if benchmark_type == "bfcl":
        filepath = data_dir / "agent_tool_call_benchmark.json"
    else:
        filepath = data_dir / "agent_planning_benchmark.json"

    if not filepath.exists():
        raise ValueError(f"Benchmark file not found: {filepath}")

    with open(filepath) as f:
        data = json.load(f)

    test_cases = data.get("test_cases", [])
    if not test_cases:
        raise ValueError(f"No test cases found in {filepath}")

    # Step 4: Stratified sampling by difficulty
    # Group by difficulty
    by_difficulty = defaultdict(list)
    for case in test_cases:
        difficulty = case.get("difficulty", "unknown")
        by_difficulty[difficulty].append(case)

    # Calculate samples per difficulty (proportional)
    total_cases = len(test_cases)
    samples = []
    for difficulty, cases in by_difficulty.items():
        proportion = len(cases) / total_cases
        n_difficulty = max(1, int(n_samples * proportion))
        samples.extend(cases[:n_difficulty])

    # Step 5: Ensure we have exactly n_samples
    if len(samples) > n_samples:
        samples = samples[:n_samples]
    elif len(samples) < n_samples:
        # Add more from remaining cases
        remaining = [c for c in test_cases if c not in samples]
        samples.extend(remaining[: n_samples - len(samples)])

    return samples[:n_samples]


# Load benchmark data
print("üì• Loading benchmark data...")
bfcl_data = load_benchmark_data("bfcl", N_BFCL)
planbench_data = load_benchmark_data("planbench", N_PLANBENCH)

print(f"‚úÖ Loaded {len(bfcl_data)} BFCL test cases")
print(f"‚úÖ Loaded {len(planbench_data)} PlanBench test cases")

# Show sample
print("\nüìã BFCL Sample:")
print(json.dumps(bfcl_data[0], indent=2))
print("\nüìã PlanBench Sample:")
print(json.dumps(planbench_data[0], indent=2))

## 4. BFCL Evaluation Engine

In [None]:
def evaluate_bfcl_case(
    test_case: dict[str, Any], client: OpenAI, model: str
) -> dict[str, Any]:
    """Evaluate a single BFCL tool calling test case.

    Args:
        test_case: BFCL benchmark test case
        client: OpenAI client
        model: Model name

    Returns:
        Evaluation results with accuracy scores

    Raises:
        TypeError: If inputs are invalid types
        ValueError: If test case is missing required fields
    """
    # Step 1: Type checking
    if not isinstance(test_case, dict):
        raise TypeError("test_case must be a dict")
    if not isinstance(model, str):
        raise TypeError("model must be a string")

    # Step 2: Extract test case data
    task = test_case.get("task")
    expected_tool = test_case.get("tool_call", {}).get("tool")
    expected_args = test_case.get("tool_call", {}).get("args", {})
    labels = test_case.get("labels", {})

    if not task or not expected_tool:
        raise ValueError("Test case missing required fields: task or tool_call")

    # Step 3: Generate agent response with tool calling
    try:
        # Define available tools (simplified recipe domain)
        tools = [
            {
                "type": "function",
                "function": {
                    "name": "search_recipes",
                    "description": "Search for recipes based on criteria",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "ingredients": {
                                "type": "array",
                                "items": {"type": "string"},
                                "description": "Ingredient filters",
                            },
                            "dietary_restrictions": {
                                "type": "array",
                                "items": {"type": "string"},
                                "description": "Dietary restrictions (vegan, gluten-free, etc.)",
                            },
                            "max_cook_time": {
                                "type": "integer",
                                "description": "Maximum cooking time in minutes",
                            },
                        },
                    },
                },
            }
        ]

        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": task}],
            tools=tools,
            tool_choice="auto",
        )

        # Extract tool call
        message = response.choices[0].message
        tool_calls = message.tool_calls if hasattr(message, "tool_calls") else None

        if not tool_calls or len(tool_calls) == 0:
            # No tool call made
            return {
                "id": test_case.get("id"),
                "task": task,
                "tool_selection_correct": False,
                "args_correct": False,
                "overall_correct": False,
                "error": "No tool call made",
                "expected": {"tool": expected_tool, "args": expected_args},
                "predicted": None,
            }

        # Parse tool call
        tool_call = tool_calls[0]
        predicted_tool = tool_call.function.name
        predicted_args = json.loads(tool_call.function.arguments)

        # Step 4: Evaluate accuracy
        tool_correct = predicted_tool == expected_tool
        args_correct = predicted_args == expected_args
        overall_correct = tool_correct and args_correct

        # Step 5: Return results
        return {
            "id": test_case.get("id"),
            "task": task,
            "tool_selection_correct": tool_correct,
            "args_correct": args_correct,
            "overall_correct": overall_correct,
            "expected": {"tool": expected_tool, "args": expected_args},
            "predicted": {"tool": predicted_tool, "args": predicted_args},
            "difficulty": test_case.get("difficulty"),
        }

    except Exception as e:
        # Handle API errors gracefully
        return {
            "id": test_case.get("id"),
            "task": task,
            "tool_selection_correct": False,
            "args_correct": False,
            "overall_correct": False,
            "error": str(e),
            "expected": {"tool": expected_tool, "args": expected_args},
            "predicted": None,
        }


# Run BFCL evaluation
print("\nüîß Running BFCL evaluation...")
bfcl_results = []
for case in tqdm(bfcl_data, desc="BFCL"):
    result = evaluate_bfcl_case(case, client, MODEL)
    bfcl_results.append(result)

# Calculate aggregate metrics
n_tool_correct = sum(r["tool_selection_correct"] for r in bfcl_results)
n_args_correct = sum(r["args_correct"] for r in bfcl_results)
n_overall_correct = sum(r["overall_correct"] for r in bfcl_results)
n_total = len(bfcl_results)

print(f"\nüìä BFCL Results:")
print(f"  Tool Selection Accuracy: {n_tool_correct}/{n_total} ({100*n_tool_correct/n_total:.1f}%)")
print(f"  Args Validation Accuracy: {n_args_correct}/{n_total} ({100*n_args_correct/n_total:.1f}%)")
print(f"  Overall Accuracy: {n_overall_correct}/{n_total} ({100*n_overall_correct/n_total:.1f}%)")

## 5. PlanBench Evaluation Engine

In [None]:
def evaluate_planbench_case(
    test_case: dict[str, Any], client: OpenAI, model: str, evaluator: TrajectoryEvaluator
) -> dict[str, Any]:
    """Evaluate a single PlanBench planning test case.

    Args:
        test_case: PlanBench test case
        client: OpenAI client
        model: Model name
        evaluator: TrajectoryEvaluator for metric calculation

    Returns:
        Evaluation results with trajectory metrics

    Raises:
        TypeError: If inputs are invalid types
        ValueError: If test case is missing required fields
    """
    # Step 1: Type checking
    if not isinstance(test_case, dict):
        raise TypeError("test_case must be a dict")
    if not isinstance(model, str):
        raise TypeError("model must be a string")

    # Step 2: Extract test case data
    task = test_case.get("task")
    goal = test_case.get("goal")
    gold_plan = test_case.get("gold_plan", {})
    gold_steps = gold_plan.get("steps", [])

    if not task or not gold_steps:
        raise ValueError("Test case missing required fields: task or gold_plan.steps")

    # Convert gold plan to trajectory (list of tool names)
    reference_trajectory = [step["tool"] for step in gold_steps]

    # Step 3: Generate agent plan
    try:
        prompt = f"""Task: {task}
Goal: {goal}

Generate a step-by-step plan to accomplish this goal. For each step, specify:
1. The tool to use (search_recipes, filter_results, etc.)
2. The arguments for that tool
3. The rationale for this step

Format your response as a JSON array of steps:
[{{"tool": "tool_name", "args": {{...}}, "rationale": "..."}}]
"""

        response = client.chat.completions.create(
            model=model, messages=[{"role": "user", "content": prompt}], temperature=0.0
        )

        # Parse response
        content = response.choices[0].message.content

        # Extract JSON (handle markdown code blocks)
        if "```json" in content:
            content = content.split("```json")[1].split("```")[0].strip()
        elif "```" in content:
            content = content.split("```")[1].split("```")[0].strip()

        predicted_plan = json.loads(content)
        predicted_trajectory = [step["tool"] for step in predicted_plan]

        # Step 4: Calculate trajectory metrics
        metrics = {
            "exact_match": evaluator.exact_match(reference_trajectory, predicted_trajectory),
            "in_order_match": evaluator.in_order_match(
                reference_trajectory, predicted_trajectory
            ),
            "any_order_match": evaluator.any_order_match(
                reference_trajectory, predicted_trajectory
            ),
            "precision": evaluator.precision(reference_trajectory, predicted_trajectory),
            "recall": evaluator.recall(reference_trajectory, predicted_trajectory),
            "single_tool_use": evaluator.single_tool_use(
                reference_trajectory, predicted_trajectory
            ),
        }

        # Step 5: Return results
        return {
            "id": test_case.get("id"),
            "task": task,
            "goal": goal,
            "reference_trajectory": reference_trajectory,
            "predicted_trajectory": predicted_trajectory,
            "metrics": metrics,
            "difficulty": test_case.get("difficulty"),
        }

    except Exception as e:
        # Handle errors gracefully
        return {
            "id": test_case.get("id"),
            "task": task,
            "goal": goal,
            "reference_trajectory": reference_trajectory,
            "predicted_trajectory": [],
            "metrics": {
                "exact_match": 0.0,
                "in_order_match": 0.0,
                "any_order_match": 0.0,
                "precision": 0.0,
                "recall": 0.0,
                "single_tool_use": 0.0,
            },
            "error": str(e),
            "difficulty": test_case.get("difficulty"),
        }


# Initialize evaluator
evaluator = TrajectoryEvaluator()

# Run PlanBench evaluation
print("\nüìù Running PlanBench evaluation...")
planbench_results = []
for case in tqdm(planbench_data, desc="PlanBench"):
    result = evaluate_planbench_case(case, client, MODEL, evaluator)
    planbench_results.append(result)

# Calculate aggregate metrics
avg_metrics = defaultdict(float)
for result in planbench_results:
    for metric, value in result["metrics"].items():
        avg_metrics[metric] += value

n_planbench = len(planbench_results)
for metric in avg_metrics:
    avg_metrics[metric] /= n_planbench

print(f"\nüìä PlanBench Results (Average Trajectory Metrics):")
for metric, value in avg_metrics.items():
    print(f"  {metric}: {value:.3f}")

## 6. Results Analysis and Aggregation

In [None]:
# Aggregate results
results_summary = {
    "mode": MODE,
    "timestamp": datetime.now().isoformat(),
    "model": MODEL,
    "n_samples": {"bfcl": N_BFCL, "planbench": N_PLANBENCH, "total": N_BFCL + N_PLANBENCH},
    "bfcl_results": {
        "tool_selection_accuracy": n_tool_correct / n_total,
        "args_validation_accuracy": n_args_correct / n_total,
        "overall_accuracy": n_overall_correct / n_total,
        "details": bfcl_results,
    },
    "planbench_results": {
        "avg_trajectory_metrics": dict(avg_metrics),
        "details": planbench_results,
    },
}

# Create summary DataFrame
summary_df = pd.DataFrame(
    {
        "Benchmark": ["BFCL (Tool Calling)", "PlanBench (Planning)"],
        "N_Samples": [N_BFCL, N_PLANBENCH],
        "Primary Metric": [
            f"{100*n_overall_correct/n_total:.1f}% Overall Accuracy",
            f"{100*avg_metrics['any_order_match']:.1f}% Any-Order Match",
        ],
        "Tool Selection": [
            f"{100*n_tool_correct/n_total:.1f}%",
            f"{100*avg_metrics['precision']:.1f}% Precision",
        ],
        "Efficiency": [
            f"{100*n_args_correct/n_total:.1f}% Args",
            f"{100*avg_metrics['single_tool_use']:.1f}% Single-Tool",
        ],
    }
)

print("\n" + "=" * 80)
print("üìä BENCHMARK EVALUATION SUMMARY")
print("=" * 80)
print(summary_df.to_string(index=False))
print("=" * 80)

## 7. Visualizations

In [None]:
# Create visualizations
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# 7.1 Benchmark Comparison Bar Chart
ax1 = axes[0]
benchmarks = ["BFCL\nOverall", "BFCL\nTool", "BFCL\nArgs", "PlanBench\nAny-Order", "PlanBench\nPrecision", "PlanBench\nRecall"]
accuracies = [
    100 * n_overall_correct / n_total,
    100 * n_tool_correct / n_total,
    100 * n_args_correct / n_total,
    100 * avg_metrics["any_order_match"],
    100 * avg_metrics["precision"],
    100 * avg_metrics["recall"],
]
colors = ["#3498db", "#5dade2", "#85c1e9", "#e74c3c", "#ec7063", "#f1948a"]
bars = ax1.bar(benchmarks, accuracies, color=colors, alpha=0.8)
ax1.set_ylabel("Accuracy (%)", fontsize=12)
ax1.set_title(f"Benchmark Comparison ({MODE} Mode)", fontsize=14, fontweight="bold")
ax1.set_ylim(0, 100)
ax1.axhline(y=70, color="gray", linestyle="--", alpha=0.5, label="70% Threshold")
ax1.legend()
ax1.grid(axis="y", alpha=0.3)

# Add value labels on bars
for bar, acc in zip(bars, accuracies):
    height = bar.get_height()
    ax1.text(
        bar.get_x() + bar.get_width() / 2.0,
        height + 2,
        f"{acc:.1f}%",
        ha="center",
        va="bottom",
        fontsize=10,
    )

# 7.2 Trajectory Metrics Radar Chart
ax2 = axes[1]
visualizer = TrajectoryVisualizer()
radar_data = visualizer.generate_radar_chart(avg_metrics)
labels = [label.replace("_", "\n") for label in radar_data["labels"]]
values = radar_data["values"]

# Radar chart setup
angles = np.linspace(0, 2 * np.pi, len(labels), endpoint=False).tolist()
values_plot = values + [values[0]]  # Close the polygon
angles_plot = angles + [angles[0]]

ax2 = plt.subplot(132, projection="polar")
ax2.plot(angles_plot, values_plot, "o-", linewidth=2, color="#e74c3c")
ax2.fill(angles_plot, values_plot, alpha=0.25, color="#e74c3c")
ax2.set_xticks(angles)
ax2.set_xticklabels(labels, fontsize=9)
ax2.set_ylim(0, 1)
ax2.set_yticks([0.25, 0.5, 0.75, 1.0])
ax2.set_yticklabels(["25%", "50%", "75%", "100%"], fontsize=8)
ax2.set_title("PlanBench Trajectory Metrics", fontsize=14, fontweight="bold", pad=20)
ax2.grid(True)

# 7.3 Error Type Distribution
ax3 = axes[2]
error_types = ["BFCL\nTool Error", "BFCL\nArgs Error", "PlanBench\nLow Recall", "PlanBench\nLow Precision"]
error_counts = [
    n_total - n_tool_correct,
    n_total - n_args_correct,
    int(n_planbench * (1 - avg_metrics["recall"])),
    int(n_planbench * (1 - avg_metrics["precision"])),
]
colors_err = ["#e74c3c", "#f39c12", "#9b59b6", "#3498db"]
ax3.barh(error_types, error_counts, color=colors_err, alpha=0.8)
ax3.set_xlabel("Error Count", fontsize=12)
ax3.set_title("Error Type Distribution", fontsize=14, fontweight="bold")
ax3.grid(axis="x", alpha=0.3)

# Add value labels
for i, count in enumerate(error_counts):
    ax3.text(count + 0.2, i, str(count), va="center", fontsize=10)

plt.tight_layout()
plt.savefig("results/benchmark_evaluation_visualizations.png", dpi=300, bbox_inches="tight")
plt.show()

print("‚úÖ Visualizations saved to: results/benchmark_evaluation_visualizations.png")

## 8. Export Results

In [None]:
# Ensure results directory exists
results_dir = Path("results")
results_dir.mkdir(exist_ok=True)

# Export to JSON
output_path = results_dir / "benchmark_results.json"
with open(output_path, "w") as f:
    json.dump(results_summary, f, indent=2)

print(f"‚úÖ Results exported to: {output_path}")
print(f"üìÅ File size: {output_path.stat().st_size / 1024:.1f} KB")

## 9. Validation and Quality Checks

In [None]:
# Validation checks
print("\nüîç Running validation checks...\n")

checks_passed = 0
total_checks = 7

# Check 1: Data completeness
if len(bfcl_results) == N_BFCL and len(planbench_results) == N_PLANBENCH:
    print("‚úÖ Check 1: Data completeness - All test cases evaluated")
    checks_passed += 1
else:
    print(
        f"‚ùå Check 1: Data completeness - Missing results (BFCL: {len(bfcl_results)}/{N_BFCL}, PlanBench: {len(planbench_results)}/{N_PLANBENCH})"
    )

# Check 2: BFCL accuracy threshold
if n_overall_correct / n_total >= 0.5:  # 50% threshold for basic functionality
    print(f"‚úÖ Check 2: BFCL accuracy - Above 50% threshold ({100*n_overall_correct/n_total:.1f}%)")
    checks_passed += 1
else:
    print(
        f"‚ö†Ô∏è  Check 2: BFCL accuracy - Below 50% threshold ({100*n_overall_correct/n_total:.1f}%)"
    )

# Check 3: PlanBench any-order match
if avg_metrics["any_order_match"] >= 0.4:  # 40% threshold
    print(
        f"‚úÖ Check 3: PlanBench any-order match - Above 40% threshold ({100*avg_metrics['any_order_match']:.1f}%)"
    )
    checks_passed += 1
else:
    print(
        f"‚ö†Ô∏è  Check 3: PlanBench any-order match - Below 40% threshold ({100*avg_metrics['any_order_match']:.1f}%)"
    )

# Check 4: Results JSON schema
required_keys = ["mode", "timestamp", "model", "n_samples", "bfcl_results", "planbench_results"]
if all(key in results_summary for key in required_keys):
    print("‚úÖ Check 4: Results JSON schema - All required fields present")
    checks_passed += 1
else:
    print("‚ùå Check 4: Results JSON schema - Missing required fields")

# Check 5: No critical errors
n_bfcl_errors = sum(1 for r in bfcl_results if "error" in r)
n_planbench_errors = sum(1 for r in planbench_results if "error" in r)
if n_bfcl_errors == 0 and n_planbench_errors == 0:
    print("‚úÖ Check 5: Error rate - No critical errors")
    checks_passed += 1
else:
    print(
        f"‚ö†Ô∏è  Check 5: Error rate - {n_bfcl_errors} BFCL errors, {n_planbench_errors} PlanBench errors"
    )

# Check 6: Trajectory metrics valid range
all_valid = all(0.0 <= v <= 1.0 for v in avg_metrics.values())
if all_valid:
    print("‚úÖ Check 6: Trajectory metrics - All values in valid range [0.0, 1.0]")
    checks_passed += 1
else:
    print("‚ùå Check 6: Trajectory metrics - Some values out of range")

# Check 7: Output files exist
json_exists = (results_dir / "benchmark_results.json").exists()
viz_exists = (results_dir / "benchmark_evaluation_visualizations.png").exists()
if json_exists and viz_exists:
    print("‚úÖ Check 7: Output files - All files generated successfully")
    checks_passed += 1
else:
    print(f"‚ùå Check 7: Output files - Missing files (JSON: {json_exists}, Viz: {viz_exists})")

# Final summary
print(f"\n{'=' * 60}")
print(f"Validation Summary: {checks_passed}/{total_checks} checks passed")
if checks_passed == total_checks:
    print("‚úÖ All validation checks passed!")
elif checks_passed >= total_checks - 2:
    print("‚ö†Ô∏è  Most checks passed with minor issues")
else:
    print("‚ùå Multiple validation issues detected")
print(f"{'=' * 60}")

## 10. Key Insights and Recommendations

### BFCL (Tool Calling) Insights:
- **Tool Selection**: Measures if agent picks the right tool for the task
- **Args Validation**: Checks if tool arguments are correct and complete
- **Common Failures**:
  - Missing required arguments
  - Incorrect argument types
  - Wrong tool selection for ambiguous tasks

### PlanBench (Planning) Insights:
- **Any-Order Match**: Most forgiving metric - checks if all steps are present
- **In-Order Match**: Validates correct step sequencing
- **Efficiency**: Single-tool-use metric penalizes redundant steps
- **Common Failures**:
  - Missing intermediate steps
  - Incorrect step ordering
  - Over-planning (too many redundant steps)

### Production Recommendations:
1. **Baseline**: Achieve >70% BFCL accuracy and >60% PlanBench any-order match
2. **Model Selection**: Use GPT-4 for planning tasks, GPT-3.5 may struggle
3. **Prompt Engineering**: Provide clear tool descriptions and examples
4. **Fallback Strategy**: Implement retry logic for low-confidence predictions
5. **Continuous Eval**: Re-run benchmarks after prompt/model changes

### Next Steps:
- Review error cases to identify patterns
- Fine-tune prompts for low-performing categories
- Implement autoraters for production monitoring
- Expand benchmark coverage with domain-specific test cases