# Evaluating Models with Scorebook

This notebook demonstrates how to use Scorebook's `evaluate()` function to run inference and compute metrics in a single step.

## When to use `evaluate()`

- You want to run inference on a dataset and score the results
- You're comparing different models on the same dataset
- You want to track hyperparameters alongside results

## Prerequisites

This example uses a local HuggingFace model. For cloud models (OpenAI, Anthropic), see the examples directory.

## Setup

Import necessary modules:

In [None]:
from pprint import pprint
from typing import Any, List
import transformers

from scorebook import EvalDataset, evaluate

## Initialize Your Model

Set up a HuggingFace pipeline for inference:

In [None]:
model_name = "microsoft/Phi-4-mini-instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_name,
    model_kwargs={"torch_dtype": "auto"},
    device_map="auto",
)

print(f"Model loaded: {model_name}")

## Define Your Inference Function

Create a function that processes inputs and returns outputs:

In [None]:
def inference(inputs: List[Any], **hyperparameters: Any) -> List[Any]:
    """Process inputs through the model.
    
    Args:
        inputs: List of input values from the dataset
        hyperparameters: Model hyperparameters (e.g., temperature, system_message)
        
    Returns:
        List of model outputs
    """
    outputs = []
    
    for input_val in inputs:
        # Build messages for the model
        messages = [
            {
                "role": "system",
                "content": hyperparameters.get("system_message", "You are a helpful assistant.")
            },
            {"role": "user", "content": str(input_val)},
        ]
        
        # Run inference
        result = pipeline(
            messages,
            max_new_tokens=hyperparameters.get("max_new_tokens", 100),
        )
        
        # Extract the answer
        output = str(result[0]["generated_text"][-1]["content"])
        outputs.append(output)
    
    return outputs

## Load Your Dataset

Create an evaluation dataset from a JSON file:

In [None]:
# Create a sample dataset
import json
from pathlib import Path

# Sample data
sample_data = [
    {"question": "What is 2 + 2?", "answer": "4"},
    {"question": "What is the capital of France?", "answer": "Paris"},
    {"question": "Who wrote Romeo and Juliet?", "answer": "William Shakespeare"},
]

# Save to temporary file
temp_dataset = Path("temp_dataset.json")
with open(temp_dataset, "w") as f:
    json.dump(sample_data, f)

# Load as EvalDataset
dataset = EvalDataset.from_json(
    path=str(temp_dataset),
    metrics="accuracy",
    input="question",
    label="answer",
)

print(f"Loaded dataset with {len(dataset.items)} items")

## Run Evaluation

Use `evaluate()` to run inference and compute metrics:

In [None]:
results = evaluate(
    inference,
    dataset,
    hyperparameters={
        "system_message": "Answer the question directly and concisely.",
        "max_new_tokens": 50,
    },
    return_aggregates=True,
    return_items=True,
    return_output=True,
    upload_results=False,  # Set to True to upload to Trismik
)

pprint(results)

## Analyze Results

Examine the outputs and metrics:

In [None]:
# Overall accuracy
print(f"\nOverall Accuracy: {results['aggregates']['accuracy']:.2%}")

# Per-item results
print("\nPer-Item Results:")
for i, item in enumerate(results['items'], 1):
    print(f"\nQuestion {i}: {item['input']}")
    print(f"  Model Output: {item['output']}")
    print(f"  Expected: {item['label']}")
    print(f"  Correct: {'✓' if item['accuracy'] == 1.0 else '✗'}")

## Hyperparameter Sweeps

Evaluate with different hyperparameters to find optimal settings:

In [None]:
# Test different system messages
system_messages = [
    "Answer briefly.",
    "Answer the question directly and concisely.",
    "Provide a detailed answer.",
]

sweep_results = []

for msg in system_messages:
    result = evaluate(
        inference,
        dataset,
        hyperparameters={"system_message": msg},
        return_aggregates=True,
        upload_results=False,
    )
    sweep_results.append({
        "system_message": msg,
        "accuracy": result['aggregates']['accuracy']
    })

print("\nHyperparameter Sweep Results:")
for r in sweep_results:
    print(f"  {r['system_message'][:30]:30s} → {r['accuracy']:.2%}")

## Cleanup

In [None]:
# Remove temporary dataset file
temp_dataset.unlink()
print("Cleanup complete")

## Next Steps

- Try the **Adaptive Evaluations** notebook for efficient testing with fewer questions
- See the **Upload Results** notebook to track results in Trismik's dashboard
- Explore batch processing for faster evaluation of large datasets