# ScoreBook Showcase
This notebook demonstrates how to use Trismik's ScoreBook library to evaluate large language models (LLMs). ScoreBook defines clear contracts for data loading, inference, and metrics, allowing users to easily extend the library to support any dataset, inference function, or evaluation metric. Out of the box, ScoreBook supports a variety of Hugging Face datasets and pre-defined metrics, but users can customize it to fit their needs. The library facilitates intuitive and efficient LLM experimentation with features like grouping evaluations, batch inference, and hyperparameter grid sweeps, while handling orchestration and submission to the dashboard.

---
## Getting Started
To show how ScoreBook can be used to easily evaluate a model of your choice by scoring it against a dataset. In this basic example we will use a model and simple example dataset.

In [1]:
import nest_asyncio
nest_asyncio.apply()

In [2]:
from scorebook import EvalDataset, evaluate
from scorebook.types.inference_pipeline import InferencePipeline
import transformers

# Create an evaluation dataset from a list of dictionaries
data = [
    {"question": "What is 2 + 2?", "answer": "4"},
    {"question": "What is the capital of France?", "answer": "Paris"},
    {"question": "Who wrote Romeo and Juliet?", "answer": "William Shakespeare"}
]

# Create the evaluation dataset
test_dataset = EvalDataset.from_list(
    name="test_eval",
    label="answer",
    metrics="accuracy",
    data=data
)

# In this example we use a simple Hugging Face text-generation pipeline for inference (use any compatible model you like).
pipeline = transformers.pipeline("text-generation", model="microsoft/Phi-4-mini-instruct")

def inference_function(eval_items: list[dict], **hyperparameters) -> list:
    """Direct inference function that handles preprocessing and postprocessing internally"""
    results = []
    for item in eval_items:

        prompt = item["question"]                           # 1) Preprocessing
        output = pipeline(prompt)                           # 2) Inference
        result = output[0]["generated_text"]                # 3) Postprocessing
        results.append(result)

    return results

# Run the evaluation: ScoreBook calls your inference pipeline, compares predictions to labels, and returns results.
evaluation_results = evaluate(
    inference_function,  # the inference function
    test_dataset,        # the evaluation dataset
    score_type = "all"
)
print(evaluation_results)

  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00, 57.26it/s]
Device set to use mps:0
Datasets      0%|                                        | 0/1
Hyperparams   0%|                                        | 0/1[A
Hyperparams 100%|████████████████████████████████████████| 1/1[A
Datasets    100%|████████████████████████████████████████| 1/1[A

{'aggregate': [{'dataset_name': 'test_eval', 'accuracy': 0.0}], 'per_sample': [{'item_id': 0, 'dataset_name': 'test_eval', 'accuracy': False}, {'item_id': 1, 'dataset_name': 'test_eval', 'accuracy': False}, {'item_id': 2, 'dataset_name': 'test_eval', 'accuracy': False}]}





---
## ScoreBook Components
When working with scorebook, there are 5 core components that should be considered and utilized:
- Evaluation Datasets
- Models
- Metrics
- The Evaluate Function
- Evaluation Results

The typical workflow for ScoreBook involves:
1) Creating an evaluation dataset from local files of from hugging face
2) Creating a callable responsible for returning a model's output for each item in the evaluation dataset
3) Assigning metrics to be used in scoring the model
4) Using the `evaluate` function with a model, dataset, and metrics to generate scores

---
### Evaluation Datasets

Evaluation datasets are the foundation of model evaluation in ScoreBook. The `EvalDataset` class provides a unified interface for loading datasets from multiple sources and associating them with evaluation metrics.

**Key Features:**
- Load from HuggingFace Hub, CSV files, JSON files, or Python lists
- Specify which field contains the ground truth labels
- Associate evaluation metrics with the dataset
- Built on top of HuggingFace datasets for compatibility

**Supported Data Sources:**

1. **HuggingFace Hub**: Load any public dataset
2. **CSV Files**: Load from local CSV files
3. **JSON Files**: Support both flat and nested JSON structures
4. **Python Lists**: Create an evaluation dataset from a list of dict objects, which represent field, value pairs.

In [5]:
# Load from HuggingFace Hub
mmlu_pro = EvalDataset.from_huggingface("TIGER-Lab/MMLU-Pro", label="answer", metrics="accuracy")
print(mmlu_pro)

# Example 2: Load from CSV (using the existing data.csv file)
csv_dataset = EvalDataset.from_csv("data.csv", label="answer", metrics=["accuracy"])
print(f"\nCSV Dataset: \n{csv_dataset}")

# Example 3: Load from JSON (using the existing data.json file)  
json_dataset = EvalDataset.from_json("data.json", label="answer", metrics="accuracy")
print(f"\nJSON Dataset: \n{json_dataset}")

# Example 4: Create from a Python list
data = [
    {"question": "What is 2 + 2?", "answer": "4"},
    {"question": "What is the capital of France?", "answer": "Paris"},
    {"question": "Who wrote Romeo and Juliet?", "answer": "William Shakespeare"}
]
list_dataset = EvalDataset.from_list("demo", label="answer", metrics="accuracy", data=data)
print(f"\nList Dataset: \n{list_dataset}")

<scorebook.types.eval_dataset.EvalDataset object at 0x163f8ad70>

CSV Dataset: 
<scorebook.types.eval_dataset.EvalDataset object at 0x4effb6fd0>

JSON Dataset: 
<scorebook.types.eval_dataset.EvalDataset object at 0x10644ef90>

List Dataset: 
<scorebook.types.eval_dataset.EvalDataset object at 0x1640a2140>


---
### Models

To evaluate a model with ScoreBook, it must be represented by a single callable, which accepts a list of evaluation dataset items, and returns a list of parsed model outputs, ready to be scored against metrics.

Within ScoreBook there are two structures that can be used to encapsulate this process:
- **Inference Functions**: A single function which handles this entire process
- **Inference Pipelines**: A callable `InferencePipeline` instance, which seperates the logic for pre-processing, inference, and post-processing.

---
### Inference Functions

An inference function is a single function responsible for generating a list of model outputs from a list of evaluation dataset items, to then be scored. Inference functions are responsible for the formatting of evaluation items into a structure that can be accepted by a model, generating a prediction for each model output, and parsing a model's output into results for scoring.

In [None]:
import string
from typing import Any, List, Dict


def inference_function(eval_items: List[Dict], **hyperparameters: Any) -> List[Any]:
    """Pre-processes dataset items, inferencing, and post-processing result."""
    results = []
    for eval_item in eval_items:

        # For each evaluation item in an evaluation dataset, a prompt is created b combining multiple fields.
        prompt = f"{eval_item['question']}\nOptions:\n" + "\n".join(
            [
                f"{letter} : {choice}"
                for letter, choice in zip(string.ascii_uppercase, eval_item["options"])
            ]
        )

        # For each prompt created, it is structured into a message format with context ready to be passed into a model.
        messages = [
            {
                "role": "system",
                "content": """
                    Answer the question you are given using only a single letter (for example, 'A').
                        Do not use punctuation. \
                        Do not show your reasoning. \
                        Do not provide any explanation. \
                        Follow the instructions exactly and \
                        always answer using a single uppercase letter.

                        For example, if the question is "What is the capital of France?" and the \
                        choices are "A. Paris", "B. London", "C. Rome", "D. Madrid",
                        - the answer should be "A"
                        - the answer should NOT be "Paris" or "A. Paris" or "A: Paris"

                        Please adhere strictly to the instructions.
                    """,
                },
                {"role": "user", "content": prompt},
            ]

        # For each message, an output is generated, its content extracted, and it is appended to the list of results
        output = pipeline(messages)
        output = output[0]["generated_text"][-1]["content"]
        results.append(output)

    return results

---
### Inference Pipelines

The `InferencePipeline` is ScoreBook's modular approach to model inference. It separates the inference process into three distinct, customizable stages, making it easy to work with different models and data formats.

Inference pipelines provide a structured way to efficiently reuse pre-processing, inference, and post-processing logic. For instance, when pre-processing logic is tailored to the schema of a specific evaluation dataset, a new inference pipeline can be created for a model to evaluate a second dataset—without the need to rewrite the inference function or post-processing steps.

**Pipeline Stages:**

1. **Preprocessor**: Converts dataset items into the format expected by a model
2. **Inference Function**: Performs the actual model inference (can be sync or async)
3. **Postprocessor**: Extracts the final prediction from the model's raw output

**Preprocessor**

The preprocessor function within an inference pipeline converts each item within an evaluation dataset into the format expected by the model. This may include transformations such as combining dataset fields, adding context, and formatting messages within a message dict structure.

When writing a preprocessor function, it must accept a single evaluation dataset item as a parameter and return an output that can be passed into a model.

```python
def preprocessor(eval_item: dict) -> list:
    """Convert evaluation item to model input format."""
    prompt = f"{eval_item['question']}\nOptions:\n" + "\n".join(
        [f"{letter} : {choice}" for letter, choice in zip(string.ascii_uppercase, eval_item["options"])])

    # The system message contains the instructions for the model. We ask the model to adhere strictly to the instructions
    messages = [
        {
            "role": "system",
            "content": """
                Answer the question you are given using only a single letter (for example, 'A').
            """,
        },
        {"role": "user", "content": prompt},
    ]
    return messages
```

**Inference Function**

The inference function is responsible for taking a list of model inputs, generated by the preprocessor function, and returning a list of model outputs.

```python
pipeline = transformers.pipeline(
    "text-generation",
    model="microsoft/Phi-4-mini-instruct",
    model_kwargs={"torch_dtype": "auto"},
    device_map="auto",
)

def inference_function(processed_items: list[list], **hyperparameters) -> list[Any]:
    """Run model inference on preprocessed items."""
    outputs = []
    for messages in processed_items:
        output = pipeline(messages)
        outputs.append(output)
    return outputs
```

**Postprocessor**

The postprocessor function returns a formatted response for each model input's output. This may include extracting the final generated message from a response containing model input, as well as string parsing.

```python
def postprocessor(model_output: Any) -> str:
    """Extract the final answer from model output."""
    return str(model_output[0]["generated_text"][-1]["content"])
```

In [None]:
# Basic pipeline example
def simple_preprocessor(item: dict) -> str:
    """Convert evaluation item to model input format."""
    return f"Question: {item['question']}\nAnswer:"

def mock_inference_function(processed_items: list[str], **hyperparameters) -> list:
    """Mock model inference - in practice, use your actual model here."""
    # This is a placeholder - replace with actual model calls
    mock_outputs = [f"Mock answer for: {item[:50]}..." for item in processed_items]
    return mock_outputs

def simple_postprocessor(output) -> str:
    """Extract the final answer from model output."""
    return output.strip().split('\n')[0]  # Get first line of response

# Create the pipeline
pipeline = InferencePipeline(
    model="mock-model",
    preprocessor=simple_preprocessor,
    inference_function=mock_inference_function,
    postprocessor=simple_postprocessor,
)

print("Created inference pipeline with components:")
print(f"- Model: {pipeline.model}")
print("- Preprocessor, inference function, and postprocessor configured")

---
### Metrics

Metrics in ScoreBook quantify how well your model performs by comparing predictions against ground truth labels. The framework provides built-in metrics and supports custom metric creation.

**Built-in Metrics:**

1. **Accuracy**: Measures the percentage of correct predictions
2. **Precision**: Measures the accuracy of positive predictions (Not Implemented)

**Metric Architecture:**
- All metrics inherit from `MetricBase` abstract class
- Implement a `score()` method that returns both aggregate and per-item scores
- Automatically registered via `@MetricRegistry.register()` decorator
- Can be referenced by string name or class directly

**Usage Patterns:**
- Specify metrics when creating datasets: `metrics=["accuracy", "precision"]`
- Mix built-in and custom metrics: `metrics=[Accuracy, CustomF1Score]`
- Access via registry: `MetricRegistry.get("accuracy")`

Each metric returns both aggregate scores (summary statistics) and item-level scores (individual predictions) for detailed analysis.

In [None]:
# Example: Using built-in metrics
from scorebook.metrics import Accuracy, Precision

# Create metric instances
accuracy_metric = Accuracy()
precision_metric = Precision()

# Example data
outputs = ["A", "B", "A", "C", "A"]
labels = ["A", "A", "A", "C", "B"]

# Calculate accuracy scores
acc_aggregate, acc_items = accuracy_metric.score(outputs, labels)
print("Accuracy Results:")
print(f"  Aggregate: {acc_aggregate}")
print(f"  Per-item: {acc_items}")

# Calculate precision scores  
prec_aggregate, prec_items = precision_metric.score(outputs, labels)
print("\nPrecision Results:")
print(f"  Aggregate: {prec_aggregate}")
print(f"  Per-item: {prec_items}")

---
### Creating Custom Metrics

New metrics can be created easily by defining a new metric class, that inherits from the `MetricBase` class and is registered in the metric registry with the `@MetricRegistry.register()` decorator. When creating a new metric, a score method must be defined which returns aggregate and item scores, calculated from a list of outputs and labels. The metric registry ensures that no two metrics can be defined with the same name, as well as facilitates the use of metric names as strings in the evaluate function.

In [None]:
# Example: Creating a custom exact match metric
from scorebook.metrics import MetricBase, MetricRegistry
from typing import Any, Dict, List, Tuple

@MetricRegistry.register()
class ExactMatchMetric(MetricBase):
    """Custom metric that checks for exact string matches."""
    
    @staticmethod
    def score(outputs: List[Any], labels: List[Any]) -> Tuple[Dict[str, Any], List[Any]]:
        if len(outputs) != len(labels):
            raise ValueError("Number of outputs must match number of labels")
            
        # Calculate exact matches (case-sensitive)
        item_scores = [str(output).strip() == str(label).strip() 
                      for output, label in zip(outputs, labels)]
        
        # Calculate aggregate score
        exact_matches = sum(item_scores)
        total = len(outputs)
        aggregate_scores = {"exact_match": exact_matches / total if total > 0 else 0.0}
        
        return aggregate_scores, item_scores

# Test the custom metric
custom_metric = ExactMatchMetric()
test_outputs = ["Paris", "london", "Rome", "madrid"]
test_labels = ["Paris", "London", "Rome", "Madrid"]

custom_agg, custom_items = custom_metric.score(test_outputs, test_labels)
print("Custom ExactMatch Metric Results:")
print(f"  Aggregate: {custom_agg}")
print(f"  Per-item: {custom_items}")

# Access via registry
registry_metric = MetricRegistry.get("exactmatchmetric")
print(f"\nMetric from registry: {registry_metric.name}")

---
### Evaluate

The `evaluate()` function is ScoreBook's central orchestrator that brings together datasets, models, and metrics to produce evaluation results. It handles the entire evaluation workflow automatically.

**Key Features:**

1. **Multi-Dataset Support**: Evaluate on multiple datasets in one call
2. **Hyperparameter Sweeping**: Test different model configurations
3. **Flexible Scoring**: Choose what level of detail you need
   - `"aggregate"`: Overall dataset scores only
   - `"item"`: Individual prediction scores only  
   - `"all"`: Both aggregate and per-item scores
4. **Progress Tracking**: Built-in progress bars for long evaluations
5. **Async Support**: Handles both synchronous and asynchronous inference functions

**Workflow:**
1. Normalizes input datasets and expands hyperparameter grids
2. For each dataset × hyperparameter combination:
   - Preprocesses items using the pipeline
   - Runs model inference 
   - Postprocesses outputs
   - Computes metric scores
3. Formats and returns results according to specified parameters

The function returns structured results that can be easily analyzed, saved, or visualized.

In [None]:
# Basic evaluate example using our previous components
from scorebook import evaluate

# Use the dataset and pipeline we created earlier
demo_dataset = EvalDataset.from_list(
    "demo_eval", 
    label="answer", 
    metrics=["accuracy"],
    data=[
        {"question": "What is 2+2?", "answer": "4"},
        {"question": "What is 3+3?", "answer": "6"},
        {"question": "What is 5+5?", "answer": "10"}
    ]
)

# Create a simple mock inference pipeline for demonstration
def demo_preprocessor(item: dict) -> str:
    return item["question"]

def demo_inference(processed_items: list[str], **hyperparams) -> list[str]:
    # Mock responses that partially match the expected answers
    mock_responses = ["4", "6", "wrong_answer"]  # Third one is intentionally wrong
    return mock_responses[:len(processed_items)]

def demo_postprocessor(output: str) -> str:
    return output.strip()

demo_pipeline = InferencePipeline(
    model="demo-model",
    preprocessor=demo_preprocessor,
    inference_function=demo_inference,
    postprocessor=demo_postprocessor
)

# Run basic evaluation
results = evaluate(
    demo_pipeline,
    demo_dataset,
    score_type="aggregate"
)

print("Basic Evaluation Results:")
print(results)

---
### Hyperparameter Sweeping

In [None]:
# Example: Hyperparameter sweeping
hyperparams = {
    "temperature": [0.7, 0.9],
    "max_tokens": [50, 100]
}

# Modified inference function that uses hyperparameters
def param_aware_inference(processed_items: list[str], **hyperparams) -> list[str]:
    temp = hyperparams.get("temperature", 0.7)
    max_tokens = hyperparams.get("max_tokens", 50)
    
    # Mock different responses based on parameters
    if temp > 0.8:
        responses = ["4", "6", "10"]  # "High temperature" gives correct answers
    else:
        responses = ["4", "7", "11"]  # "Low temperature" gives some wrong answers
        
    return responses[:len(processed_items)]

param_pipeline = InferencePipeline(
    model="param-model",
    preprocessor=demo_preprocessor,
    inference_function=param_aware_inference,
    postprocessor=demo_postprocessor
)

# Run evaluation with hyperparameter sweep
sweep_results = evaluate(
    param_pipeline,
    demo_dataset,
    hyperparameters=hyperparams,
    score_type="all"
)

print("Hyperparameter Sweep Results:")
print(f"Number of configurations tested: {len(sweep_results['aggregate'])}")
for i, result in enumerate(sweep_results['aggregate']):
    print(f"Config {i+1}: {result}")

# Show multi-dataset evaluation
datasets = [demo_dataset, list_dataset]  # Use both datasets we created
multi_results = evaluate(param_pipeline, datasets, score_type="aggregate")
print(f"\nMulti-dataset results: {len(multi_results)} results")

---
### Evaluation Results

The evaluate function returns results in customizable formats controlled by two optional parameters:
  - return_type - Controls output format (dict or object)
  - score_type - Controls which scores to include (aggregates or all)

  Default Behavior
  By default, evaluate returns a dictionary containing only aggregate scores `(return_type="dict", score_type="aggregates")`.

  Dictionary Format
  Returns aggregate scores as a simple dictionary structure.

  Object Format
  Set return_type="object" to receive an EvalResult object with:
  - `.aggregate_scores` - Aggregate scores as a flat dictionary
  - `.item_scores` - Individual item scores as a flat dictionary

  Both score dictionaries can be converted to pandas DataFrames.

  Export Options
  The EvalResult class provides convenient export methods:
  - `.to_dict()` - Returns the same dictionary structure as the default format
  - `.to_json()` - Saves scores to JSON file
  - `.to_csv()` - Saves scores to CSV file