# _Classic_ Evaluations with Scorebook

Scorebook, developed by Trismik, is an open-source Python library for model evaluation. It supports both Trismik’s adaptive testing and traditional classical evaluations. In a classical evaluation, a model runs inference on every item in a dataset, and the results are scored using Scorebook’s built-in metrics — including accuracy, precision, recall, ROUGE, BLEU, and BERT — to produce a complete performance report.

Custom metrics can be implemented and integrated to suit specific evaluation needs.

Scorebook also enables evaluation across a grid or list of hyperparameter configurations, streamlining model optimization.

Evaluation results can be automatically uploaded to the Scorebook dashboard, organized by project, for storing, managing, and visualizing model evaluation experiments.

## Evaluation Datasets

A scorebook evaluation requires an evaluation dataset, represented by the `EvalDataset` class. Evaluation datasets can be constructed via a number of factory methods. In this example we will create a basic evaluation dataset from a list of evaluation items.

In [None]:
from scorebook import EvalDataset
from scorebook.metrics.accuracy import Accuracy

# Create a sample dataset from a list of multiple-choice questions
evaluation_items = [
    {"question": "What is 2 + 2?", "answer": "4"},
    {"question": "What is the capital of France?", "answer": "Paris"},
    {"question": "Who wrote Romeo and Juliet?", "answer": "William Shakespeare"},
]

# Create an EvalDataset from the list
dataset = EvalDataset.from_list(
    name = "sample_multiple_choice",
    metrics = Accuracy,
    items = evaluation_items,
    input = "question",
    label = "answer",
)

print(f"Created dataset with {len(dataset.items)} items")

## Inference Functions

To evaluate a model with Scorebook, it must be encapsulated within an inference function. An inference function must accept a list of model inputs, pass these to the model for inference, collect and return outputs generated.

For this example, we will use the Microsoft Phi-4 Mini Instruct model, via Hugging Face's transformers package.

In [None]:
import transformers

# Instantiate a model
pipeline = transformers.pipeline(
    "text-generation",
    model="microsoft/Phi-4-mini-instruct",
    model_kwargs={"torch_dtype": "auto"},
    device_map="auto"
)

In [None]:
from typing import Any, List

# Define an inference function
def inference_function(inputs: List[Any], **hyperparameters: Any) -> List[Any]:
    """Process a list of inputs through a model.

    Args:
        inputs: List of input values from the dataset
        hyperparameters: Model hyperparameters (passed automatically by Scorebook)

    Returns:
        List of model outputs (predictions)
    """
    inference_outputs = []
    for model_input in inputs:

        # Wrap inputs in the model's message format
        messages = [
            {
                "role": "system",
                "content": hyperparameters.get("system_message"),
            },
            {"role": "user", "content": model_input},
        ]

        # Run inference on the item
        output = pipeline(messages, temperature=hyperparameters.get("temperature"))

        # Extract and collect the output generated from the model's response
        inference_outputs.append(output[0]["generated_text"][-1]["content"])

    return inference_outputs

## Hyperparameter Sweeps

In [None]:
# Define hyperparameters with lists to create a grid search
hyperparameters = {
    "temperature": [0.6, 0.7, 0.8, 0.9],
    "top_p": [0.8, 0.9],
}

## Running an Evaluation

In [None]:
from scorebook import evaluate

# Run evaluation across all hyperparameter combinations
results = evaluate(
    inference_function,                # Your inference function
    dataset,                           # Your evaluation dataset
    hyperparameters=hyperparameters,   # Hyperparameter grid
)

## Evaluation Results