# Toolscore Quick Start Tutorial

This notebook demonstrates the basics of using Toolscore to evaluate LLM tool usage.

## What You'll Learn

1. How to load gold standards and traces
2. How to run evaluations
3. How to interpret metrics
4. How to generate reports

## Setup

First, make sure Toolscore is installed:

In [None]:
# Install if needed
# !pip install tool-scorer

import sys
sys.path.insert(0, '../..')  # For development

from toolscore import evaluate_trace
import json

## 1. Load Example Files

Toolscore comes with example files to get started quickly.

In [None]:
# Paths to example files
gold_file = "../gold_calls.json"
trace_file = "../trace_openai.json"

# Let's look at the gold standard
with open(gold_file) as f:
    gold_data = json.load(f)

print("Gold Standard:")
print(json.dumps(gold_data, indent=2))

In [None]:
# Look at the trace
with open(trace_file) as f:
    trace_data = json.load(f)

print("Trace Data:")
print(json.dumps(trace_data, indent=2))

## 2. Run Evaluation

Now let's evaluate the trace against the gold standard:

In [None]:
# Run evaluation
result = evaluate_trace(
    gold_file=gold_file,
    trace_file=trace_file,
    format="openai"  # Specify format for faster processing
)

print(f"✅ Evaluation complete!")
print(f"   Expected calls: {len(result.gold_calls)}")
print(f"   Actual calls: {len(result.trace_calls)}")

## 3. View Metrics

Let's examine the key metrics:

In [None]:
metrics = result.metrics

print("=== Core Metrics ===")
print(f"Invocation Accuracy: {metrics['invocation_accuracy']:.1%}")
print(f"Selection Accuracy:  {metrics['selection_accuracy']:.1%}")

seq = metrics['sequence_metrics']
print(f"\nSequence Accuracy:   {seq['sequence_accuracy']:.1%}")
print(f"Edit Distance:       {seq['edit_distance']}")

args = metrics['argument_metrics']
print(f"\nArgument F1:         {args['f1']:.1%}")
print(f"Argument Precision:  {args['precision']:.1%}")
print(f"Argument Recall:     {args['recall']:.1%}")

eff = metrics['efficiency_metrics']
print(f"\nRedundant Calls:     {eff['redundant_count']}/{eff['total_calls']}")
print(f"Redundant Rate:      {eff['redundant_rate']:.1%}")

## 4. Understanding Metrics

### What do these metrics mean?

- **Invocation Accuracy**: Did the agent invoke tools when needed?
- **Selection Accuracy**: Did it choose the correct tools?
- **Sequence Accuracy**: Did it call tools in the right order?
- **Argument F1**: How well did arguments match?
- **Redundant Call Rate**: Were there unnecessary duplicate calls?

## 5. Generate Reports

You can generate HTML and JSON reports:

In [None]:
from toolscore.reports import generate_html_report, generate_json_report

# Generate JSON report
json_path = generate_json_report(result, "quickstart_report.json")
print(f"JSON report saved to: {json_path}")

# Generate HTML report
html_path = generate_html_report(result, "quickstart_report.html")
print(f"HTML report saved to: {html_path}")
print(f"\nOpen {html_path} in your browser to view the interactive report!")

## 6. Try Different Formats

Toolscore supports multiple trace formats:

In [None]:
# Try Anthropic format
anthropic_trace = "../trace_anthropic.json"

result_anthropic = evaluate_trace(
    gold_file=gold_file,
    trace_file=anthropic_trace,
    format="anthropic"
)

print("Anthropic Trace Evaluation:")
print(f"Selection Accuracy: {result_anthropic.metrics['selection_accuracy']:.1%}")

In [None]:
# Auto-detect format (recommended for flexibility)
result_auto = evaluate_trace(
    gold_file=gold_file,
    trace_file=trace_file,
    format="auto"  # Toolscore will detect the format
)

print("Auto-detected format evaluation:")
print(f"Selection Accuracy: {result_auto.metrics['selection_accuracy']:.1%}")

## Next Steps

- Check out `02_custom_formats.ipynb` to learn about custom trace formats
- See `03_advanced_metrics.ipynb` for deep dives into metrics
- Read the [documentation](https://toolscore.readthedocs.io/) for complete API reference

## Summary

In this tutorial, you learned:

✅ How to load gold standards and traces

✅ How to run evaluations with `evaluate_trace()`

✅ How to interpret key metrics

✅ How to generate HTML/JSON reports

✅ How to work with different trace formats