# Advanced Metrics Deep Dive

This notebook provides an in-depth look at Toolscore's metrics and how to interpret them.

## What You'll Learn

1. Understanding each metric in detail
2. When to use which metric
3. How to optimize for specific metrics
4. Real-world metric interpretation

In [None]:
import sys
sys.path.insert(0, '../..')  # For development

from toolscore import evaluate_trace
from toolscore.adapters.base import ToolCall
from toolscore.metrics import (
    calculate_invocation_accuracy,
    calculate_selection_accuracy,
    calculate_edit_distance,
    calculate_argument_f1,
    calculate_redundant_call_rate
)
import json

## 1. Invocation Accuracy

**Question**: Did the agent invoke tools when needed and refrain when not needed?

In [None]:
# Scenario 1: Perfect match
gold = [ToolCall(tool="search"), ToolCall(tool="summarize")]
trace = [ToolCall(tool="search"), ToolCall(tool="summarize")]

accuracy = calculate_invocation_accuracy(gold, trace)
print(f"Perfect match: {accuracy:.1%}")

# Scenario 2: Missing invocations
trace_missing = [ToolCall(tool="search")]  # Missing summarize
accuracy_missing = calculate_invocation_accuracy(gold, trace_missing)
print(f"Missing tool: {accuracy_missing:.1%}")

# Scenario 3: Extra invocations
trace_extra = [ToolCall(tool="search"), ToolCall(tool="summarize"), ToolCall(tool="translate")]
accuracy_extra = calculate_invocation_accuracy(gold, trace_extra)
print(f"Extra tool: {accuracy_extra:.1%}")

### When to Use Invocation Accuracy

- Detecting if agent is **over/under-using** tools
- Ensuring agent knows **when** to use tools
- Benchmarking different prompting strategies

## 2. Selection Accuracy

**Question**: Did the agent choose the correct tools?

In [None]:
# Scenario 1: All correct
gold = [ToolCall(tool="read"), ToolCall(tool="write")]
trace = [ToolCall(tool="read"), ToolCall(tool="write")]
accuracy = calculate_selection_accuracy(gold, trace)
print(f"All correct: {accuracy:.1%}")

# Scenario 2: Half wrong
trace_half = [ToolCall(tool="read"), ToolCall(tool="delete")]  # Wrong second tool
accuracy_half = calculate_selection_accuracy(gold, trace_half)
print(f"Half wrong: {accuracy_half:.1%}")

# Scenario 3: All wrong
trace_wrong = [ToolCall(tool="copy"), ToolCall(tool="move")]
accuracy_wrong = calculate_selection_accuracy(gold, trace_wrong)
print(f"All wrong: {accuracy_wrong:.1%}")

### When to Use Selection Accuracy

- Measuring if agent picks **appropriate tools**
- Comparing different models' tool understanding
- Identifying confusing tool names

## 3. Sequence Edit Distance

**Question**: Did the agent call tools in the right order?

In [None]:
# Scenario 1: Perfect order
gold = [ToolCall(tool="A"), ToolCall(tool="B"), ToolCall(tool="C")]
trace = [ToolCall(tool="A"), ToolCall(tool="B"), ToolCall(tool="C")]
result = calculate_edit_distance(gold, trace)
print(f"Perfect order: distance={result['edit_distance']}, accuracy={result['sequence_accuracy']:.1%}")

# Scenario 2: Swapped order
trace_swap = [ToolCall(tool="B"), ToolCall(tool="A"), ToolCall(tool="C")]
result_swap = calculate_edit_distance(gold, trace_swap)
print(f"Swapped: distance={result_swap['edit_distance']}, accuracy={result_swap['sequence_accuracy']:.1%}")

# Scenario 3: Missing step
trace_missing = [ToolCall(tool="A"), ToolCall(tool="C")]  # Skipped B
result_missing = calculate_edit_distance(gold, trace_missing)
print(f"Missing step: distance={result_missing['edit_distance']}, accuracy={result_missing['sequence_accuracy']:.1%}")

### When to Use Sequence Metrics

- Workflows where **order matters** (e.g., authenticate → query → logout)
- Multi-step planning evaluation
- Identifying if agent understands dependencies

## 4. Argument F1 Score

**Question**: How well did arguments match?

In [None]:
# Scenario 1: Perfect match
gold = [ToolCall(tool="search", args={"query": "Python", "limit": 10})]
trace = [ToolCall(tool="search", args={"query": "Python", "limit": 10})]
result = calculate_argument_f1(gold, trace)
print(f"Perfect: P={result['precision']:.1%}, R={result['recall']:.1%}, F1={result['f1']:.1%}")

# Scenario 2: Missing argument
trace_missing = [ToolCall(tool="search", args={"query": "Python"})]  # Missing limit
result_missing = calculate_argument_f1(gold, trace_missing)
print(f"Missing arg: P={result_missing['precision']:.1%}, R={result_missing['recall']:.1%}, F1={result_missing['f1']:.1%}")

# Scenario 3: Extra arguments
trace_extra = [ToolCall(tool="search", args={"query": "Python", "limit": 10, "sort": "date"})]
result_extra = calculate_argument_f1(gold, trace_extra)
print(f"Extra arg: P={result_extra['precision']:.1%}, R={result_extra['recall']:.1%}, F1={result_extra['f1']:.1%}")

### Understanding Precision vs Recall

- **Precision**: Of arguments provided, how many were correct?
- **Recall**: Of required arguments, how many were provided?
- **F1**: Balanced measure (harmonic mean)

## 5. Redundant Call Rate

**Question**: Were there unnecessary duplicate calls?

In [None]:
# Scenario 1: No redundant calls
gold = [ToolCall(tool="A"), ToolCall(tool="B")]
trace = [ToolCall(tool="A"), ToolCall(tool="B")]
result = calculate_redundant_call_rate(gold, trace)
print(f"No redundancy: {result['redundant_count']}/{result['total_calls']} ({result['redundant_rate']:.1%})")

# Scenario 2: Extra unnecessary calls
trace_extra = [ToolCall(tool="A"), ToolCall(tool="B"), ToolCall(tool="C"), ToolCall(tool="D")]
result_extra = calculate_redundant_call_rate(gold, trace_extra)
print(f"With redundancy: {result_extra['redundant_count']}/{result_extra['total_calls']} ({result_extra['redundant_rate']:.1%})")

### When to Use Redundancy Rate

- Optimizing **efficiency** and **cost**
- Detecting repetitive behavior
- Identifying prompt improvements

## 6. Real-World Example: Comparing Models

Let's compare hypothetical GPT-4 vs Claude performance:

In [None]:
# Load example files
gold_file = "../gold_calls.json"

# Evaluate both traces
result_openai = evaluate_trace(gold_file, "../trace_openai.json", format="openai")
result_anthropic = evaluate_trace(gold_file, "../trace_anthropic.json", format="anthropic")

# Compare metrics
print("=== Model Comparison ===")
print("\nInvocation Accuracy:")
print(f"  OpenAI:    {result_openai.metrics['invocation_accuracy']:.1%}")
print(f"  Anthropic: {result_anthropic.metrics['invocation_accuracy']:.1%}")

print("\nSelection Accuracy:")
print(f"  OpenAI:    {result_openai.metrics['selection_accuracy']:.1%}")
print(f"  Anthropic: {result_anthropic.metrics['selection_accuracy']:.1%}")

print("\nSequence Accuracy:")
print(f"  OpenAI:    {result_openai.metrics['sequence_metrics']['sequence_accuracy']:.1%}")
print(f"  Anthropic: {result_anthropic.metrics['sequence_metrics']['sequence_accuracy']:.1%}")

## 7. Metric Selection Guide

| Use Case | Primary Metrics | Secondary Metrics |
|----------|----------------|------------------|
| Function calling correctness | Selection Accuracy, Argument F1 | Invocation Accuracy |
| Multi-step planning | Sequence Edit Distance | Selection Accuracy |
| Cost optimization | Redundant Call Rate | Invocation Accuracy |
| Model comparison | All metrics | Side-effect validation |
| Prompt engineering | Selection Accuracy, Argument F1 | Redundant Call Rate |

## 8. Tips for Improvement

### Low Selection Accuracy?
- Improve tool descriptions
- Reduce number of similar tools
- Add examples to tool documentation

### Low Argument F1?
- Make parameter names clearer
- Provide examples in tool schema
- Use stricter typing

### High Redundancy?
- Improve context management
- Add explicit stop conditions
- Review prompt for clarity

## Summary

In this notebook, you learned:

✅ Deep understanding of each metric

✅ When to use which metrics

✅ How to interpret metric scores

✅ How to compare models systematically

✅ Practical tips for improvement

## Next Steps

- Apply these metrics to your own agents
- Create custom metrics for specific needs
- Read the [metrics API documentation](https://toolscore.readthedocs.io/en/latest/api/metrics.html)