# Scoring Model Outputs with Scorebook

This notebook demonstrates how to use Scorebook's `score()` function to evaluate pre-computed model outputs.

## When to use `score()`

- You already have model outputs and want to compute metrics
- You want to re-score existing results with different metrics
- You're importing evaluation results from another framework



## Setup

First, let's import the necessary modules:

In [None]:
from pprint import pprint
from scorebook import score
from scorebook.metrics import Accuracy

## Prepare Your Data

The `score()` function expects a list of items, where each item is a dictionary with:
- `input`: The input to the model (optional, for reference)
- `output`: The model's prediction
- `label`: The ground truth answer

In [None]:
# Example: Pre-computed model outputs
items = [
    {
        "input": "What is 2 + 2?",
        "output": "4",
        "label": "4"
    },
    {
        "input": "What is the capital of France?",
        "output": "Paris",
        "label": "Paris"
    },
    {
        "input": "Who wrote Romeo and Juliet?",
        "output": "William Shakespeare",
        "label": "William Shakespeare"
    },
    {
        "input": "What is 5 * 6?",
        "output": "30",
        "label": "30"
    },
    {
        "input": "What is the largest planet in our solar system?",
        "output": "Jupiter",
        "label": "Jupiter"
    },
]

print(f"Prepared {len(items)} items for scoring")

## Score the Results

Now we'll use the `score()` function to compute accuracy metrics:

In [None]:
results = score(
    items=items,
    metrics=Accuracy,
    dataset_name="basic_questions",
    model_name="example-model",
    upload_results=False,  # Set to True to upload to Trismik
)

pprint(results)

## Understanding the Results

The results dictionary contains:
- `aggregates`: Overall metrics (e.g., accuracy across all items)
- `items`: Per-item scores and predictions
- `metadata`: Information about the dataset and model

In [None]:
# View aggregate metrics
print("\nAggregate Metrics:")
print(f"Accuracy: {results['aggregates']['Accuracy']:.2%}")

# View per-item scores
print("\nPer-Item Scores:")
for i, item in enumerate(results['items'][:3], 1):
    print(f"\nItem {i}:")
    print(f"  Output: {item['output']}")
    print(f"  Label: {item['label']}")
    print(f"  Accuracy: {item['Accuracy']}")

## Using Multiple Metrics

You can score with multiple metrics at once:

In [None]:
from scorebook.metrics import Precision

# Binary classification example
binary_items = [
    {"output": "positive", "label": "positive"},
    {"output": "negative", "label": "negative"},
    {"output": "positive", "label": "negative"},
    {"output": "negative", "label": "positive"},
]

multi_metric_results = score(
    items=binary_items,
    metrics=[Accuracy, Precision],
    dataset_name="sentiment",
    model_name="example-classifier",
    upload_results=False,
)

print("\nMultiple Metrics Results:")
for metric_name, value in multi_metric_results['aggregates'].items():
    print(f"{metric_name}: {value:.2%}")

## Next Steps

- Try the **Evaluate** notebook to learn how to run inference and scoring together
- See the **Upload Results** notebook to upload your scores to Trismik's dashboard
- Explore custom metrics in the Scorebook documentation