# _Classical_ Evaluations with Scorebook

Scorebook, developed by Trismik, is an open-source Python library for model evaluation. It supports both Trismik’s adaptive testing and traditional classical evaluations. In a classical evaluation, a model runs inference on every item in a dataset, and the results are scored using Scorebook’s built-in metrics, such as accuracy, to produce evaluation results. Evaluation results can be automatically uploaded to the Scorebook dashboard, organized by project, for storing, managing, and visualizing model evaluation experiments.

## Prerequisites

- **Trismik API key**: Generate a Trismik API key from the [Trismik dashboard's settings page](https://app.trismik.com/settings).
- **Trismik Project Id**: We recommend you use the project id generated in the [Getting Started Quick-Start Guide]().

### Install Scorebook


In [None]:
!pip install scorebook



### Setup Credentials

Enter your trismik API key and project id below.

In [None]:
# Set your credentials here
TRISMIK_API_KEY = "your-trismik-api-key-here"
TRISMIK_PROJECT_ID = "your-trismik-project-id-key-here"

### Login with Trismik API Key

In [None]:
from scorebook import login

login(TRISMIK_API_KEY)
print("✓ Logged in to Trismik")



## Evaluation Datasets

A scorebook evaluation requires an evaluation dataset, represented by the `EvalDataset` class. Evaluation datasets can be constructed via a number of factory methods. In this example we will create a basic evaluation dataset from a list of evaluation items.

In [None]:
from scorebook import EvalDataset
from scorebook.metrics.accuracy import Accuracy

# Create a sample dataset from a list of multiple-choice questions
evaluation_items = [
    {"question": "What is 2 + 2?", "answer": "4"},
    {"question": "What is the capital of France?", "answer": "Paris"},
    {"question": "Who wrote Romeo and Juliet?", "answer": "William Shakespeare"},
    {"question": "What is the chemical symbol for gold?", "answer": "Au"}
]

# Create an EvalDataset from the list
dataset = EvalDataset.from_list(
    name = "sample_multiple_choice",
    metrics = Accuracy,
    items = evaluation_items,
    input = "question",
    label = "answer",
)

print(f"✓ Created dataset with {len(dataset.items)} items")

## Preparing Models for Evaluation

To evaluate a model with Scorebook, it must be encapsulated within an inference function. An inference function must accept a list of model inputs, pass these to the model for inference, collect and return outputs generated.

### Instantiate a Local Qwen Model

For this quick-start guide, we will use the lightweight Qwen2.5 0.5B instruct model, via Hugging Face's transformers package.

In [None]:
import transformers

# Instantiate a model
pipeline = transformers.pipeline(
    "text-generation",
    model="Qwen/Qwen2.5-0.5B-Instruct",
    model_kwargs={"torch_dtype": "auto"},
    device_map="auto",
)

print("✓ Transformers pipeline instantiated")

### Define an Inference Function

An inference function can be defined to encapsulate any model, local or cloud-hosted. There is flexibility in how an inference function can be defined, the only requirements are the function signature. An inference function must,

Accept:

- A list of model inputs.
- Hyperparameters which can be optionally accessed via kwargs.

Return

- A list of parsed model outputs for scoring.


In [None]:
from typing import Any, List

# Define an inference function for the Qwen model.
def qwen(inputs: List[Any], **hyperparameters: Any) -> List[Any]:
    """Run inference on a list of inputs using the 0.5B Qwen model."""
    inference_outputs = []

    for model_input in inputs:
        messages = [
            {"role": "system", "content": hyperparameters.get("system_message", "You are a helpful assistant.")},
            {"role": "user", "content": str(model_input)},
        ]

        output = pipeline(
            messages,
            temperature = hyperparameters.get("temperature", 0.7),
            top_p = hyperparameters.get("top_p", 0.9),
            top_k = hyperparameters.get("top_k", 50),
            max_new_tokens = 512,
            do_sample = hyperparameters.get("temperature", 0.7) > 0,
        )

        inference_outputs.append(output[0]["generated_text"][-1]["content"])

    return inference_outputs

print("✓ Inference function for Qwen2.5 0.5B defined")
print(qwen(["Hello!"]))

## Running an Evaluation

Running a scorebook evaluation with `evaluate` only requires an inference function and a dataset. When uploading results to Trismik's dashboard, an experiment and project id are also required. We can also specify in hyperparameters, which are passed to the inference function.

In [None]:
from scorebook import evaluate

# Run evaluation
results = evaluate(
    inference= qwen,
    datasets = dataset,
    hyperparameters = {
        'temperature': 0.9,
        'top_p': 0.8,
        'top_k': 40,
        'system_message': "Answer the question directly, provide no additional context."
    },
    experiment_id = "Qwen-Classical-Evaluation",
    project_id = TRISMIK_PROJECT_ID,
)

print("Qwen2.5 0.5B Evaluation Results:")
print(f"accuracy: {results[0]['accuracy']}")

The results are encapsulated within a list of dictionaries, with a dict for each evaluation run. the above example only excecutes a single run as only 1 dataset, and hyperparameter configuration is evaluated.

---

## Next Steps

- [Scorebook Docs](https://docs.trismik.com/scorebook/introduction-to-scorebook/): Scorebook's full documentation.
- [Scorebook Repository](https://github.com/trismik/scorebook): Scorebook is an open-source library, view the code and more examples.