# ScoreBook Showcase
This notebook demonstrates how to use Trismik's ScoreBook library to evaluate large language models. Scorebook is a library that allows you to evaluate LLMs with any dataset from Hugging Face or your own, and calculate scores for metrics such as accuracy, precision, recall, or F1. ScoreBook facilitates intuitive and efficient LLM experimentation with features such as grouping evaluations, batch inferencing, and sweeping across a grid of hyperparameter configurations.

---
## Getting Started
To show how ScoreBook can be used to easily evaluate a model of your choice by scoring it against a dataset. In this basic example we will use a model and dataset provided by Hugging Face.

In [None]:
from scorebook import EvalDataset, evaluate
from scorebook.types.inference_pipeline import InferencePipeline
import transformers

# Create an evaluation dataset from any hugging face dataset by specifying its path, label field and split.
mmlu_pro = EvalDataset.from_huggingface("TIGER-Lab/MMLU-Pro", label="answer", metrics="accuracy", split="validation")

# In this example we use a simple Hugging Face text-generation pipeline for inference (use any compatible model you like).
pipeline = transformers.pipeline("text-generation", model="microsoft/Phi-4-mini-instruct")

# Define pipeline components: preprocessor, inference function, and postprocessor
def preprocessor(item: dict) -> str:
    """Convert evaluation item to model input format."""
    return item["question"]

def inference_function(processed_items: list[str], **hyperparameters) -> list:
    """Run model inference on preprocessed items."""
    outputs = [pipeline(item) for item in processed_items]
    return outputs

def postprocessor(output) -> str:
    """Extract the final answer from model output."""
    return output[0]["generated_text"][-1]["content"]

# Create an inference pipeline that handles preprocessing, inference, and postprocessing
inference_pipeline = InferencePipeline(
    model="microsoft/Phi-4-mini-instruct",
    preprocessor=preprocessor,
    inference_function=inference_function,
    postprocessor=postprocessor,
)

# Run the evaluation: ScoreBook calls your inference pipeline, compares predictions to labels, and returns results.
evaluation_results = evaluate(
    inference_pipeline,  # the inference pipeline
    mmlu_pro             # the evaluation dataset
)

---
## ScoreBook Components
When working with scorebook, there are 5 core components that should be considered and utilized:
- Evaluation Datasets
- Inference Pipelines
- Metrics
- The Evaluate Function
- Evaluation Results

The typical workflow for score book involves:
1) Creating an evaluation dataset from local files of from hugging face
2) Creating an inference pipeline responsible for returning a model output for each item in the evaluation dataset
3) Assigning metrics to be used in scoring the model
4) Using the `evaluate` function with a inference function, dataset, and metrics to generate scores

---
### Evaluation Datasets

Evaluation datasets are the foundation of model evaluation in ScoreBook. The `EvalDataset` class provides a unified interface for loading datasets from multiple sources and associating them with evaluation metrics.

**Key Features:**
- Load from HuggingFace Hub, CSV files, JSON files, or Python lists
- Specify which field contains the ground truth labels
- Associate evaluation metrics with the dataset
- Built on top of HuggingFace datasets for compatibility

**Supported Data Sources:**

1. **HuggingFace Hub**: Load any public dataset
2. **CSV Files**: Load from local CSV files
3. **JSON Files**: Support both flat and nested JSON structures
4. **Python Lists**:

In [None]:
# Load from HuggingFace Hub
mmlu_pro = EvalDataset.from_huggingface("TIGER-Lab/MMLU-Pro", label="answer", metrics="accuracy")
print(mmlu_pro)

# Example 2: Load from CSV (would work if file existed)
csv_dataset = EvalDataset.from_csv("data.csv", label="ground_truth", metrics=["accuracy", "precision"])
print(f"\nCSV Dataset: \n{csv_dataset}")

# Example 3: Load from JSON (would work if file existed)  
json_dataset = EvalDataset.from_json("data.json", label="answer", metrics="accuracy", split="test")
print(f"\nJSON Dataset: \n{json_dataset}")

# Example 4: Create from a Python list
data = [
    {"question": "What is 2+2?", "answer": "4"},
    {"question": "Capital of France?", "answer": "Paris"}
]
list_dataset = EvalDataset.from_list("demo", label="answer", metrics="accuracy", data=data)
print(f"List Dataset: \n{list_dataset}")

---
### Inference Pipelines

The `InferencePipeline` is ScoreBook's modular approach to model inference. It separates the inference process into three distinct, customizable stages, making it easy to work with different models and data formats.

**Pipeline Stages:**

1. **Preprocessor**: Converts dataset items into the format expected by your model
2. **Inference Function**: Performs the actual model inference (can be sync or async)
3. **Postprocessor**: Extracts the final prediction from the model's raw output

**Preprocessor**

The preprocessor function within an inference pipeline converts each item within an evaluation dataset into the format expected by the model. This may include transformations such as combining dataset fields, adding context, and formatting messages within a message dict structure.

When writing a preprocessor function, it must accept a single evaluation dataset item as a parameter and return an output that can be passed into a model.

```python
def preprocessor(eval_item: dict) -> list:
    """Convert evaluation item to model input format."""
    prompt = f"{eval_item['question']}\nOptions:\n" + "\n".join(
        [f"{letter} : {choice}" for letter, choice in zip(string.ascii_uppercase, eval_item["options"])])

    # The system message contains the instructions for the model. We ask the model to adhere strictly to the instructions
    messages = [
        {
            "role": "system",
            "content": """
                Answer the question you are given using only a single letter (for example, 'A').
            """,
        },
        {"role": "user", "content": prompt},
    ]
    return messages
```

**Inference Function**

The inference function is responsible for taking a list of model inputs, generated by the preprocessor function, and returning a list of model outputs.

```python
pipeline = transformers.pipeline(
    "text-generation",
    model="microsoft/Phi-4-mini-instruct",
    model_kwargs={"torch_dtype": "auto"},
    device_map="auto",
)

def inference_function(processed_items: list[list], **hyperparameters) -> list[Any]:
    """Run model inference on preprocessed items."""
    outputs = []
    for messages in processed_items:
        output = pipeline(messages)
        outputs.append(output)
    return outputs
```

**Postprocessor**

The postprocessor function returns a formatted response for each model input's output. This may include extracting the final generated message from a response containing model input, as well as string parsing.

```python
def postprocessor(model_output: Any) -> str:
    """Extract the final answer from model output."""
    return str(model_output[0]["generated_text"][-1]["content"])
```

**Key Benefits:**
- **Flexibility**: Works with any model (OpenAI API, HuggingFace, local models)
- **Reusability**: Components can be mixed and matched across different evaluations
- **Batch Processing**: Handles multiple items efficiently
- **Hyperparameter Support**: Pass model-specific parameters during evaluation

In [None]:
# Basic pipeline example
def simple_preprocessor(item: dict) -> str:
    """Convert evaluation item to model input format."""
    return f"Question: {item['question']}\nAnswer:"

def mock_inference_function(processed_items: list[str], **hyperparameters) -> list:
    """Mock model inference - in practice, use your actual model here."""
    # This is a placeholder - replace with actual model calls
    mock_outputs = [f"Mock answer for: {item[:50]}..." for item in processed_items]
    return mock_outputs

def simple_postprocessor(output) -> str:
    """Extract the final answer from model output."""
    return output.strip().split('\n')[0]  # Get first line of response

# Create the pipeline
pipeline = InferencePipeline(
    model="mock-model",
    preprocessor=simple_preprocessor,
    inference_function=mock_inference_function,
    postprocessor=simple_postprocessor,
)

print("Created inference pipeline with components:")
print(f"- Model: {pipeline.model}")
print("- Preprocessor, inference function, and postprocessor configured")

---
### Metrics

Metrics in ScoreBook quantify how well your model performs by comparing predictions against ground truth labels. The framework provides built-in metrics and supports custom metric creation.

**Built-in Metrics:**

1. **Accuracy**: Measures the percentage of correct predictions
2. **Precision**: Measures the accuracy of positive predictions

**Metric Architecture:**
- All metrics inherit from `MetricBase` abstract class
- Implement a `score()` method that returns both aggregate and per-item scores
- Automatically registered via `@MetricRegistry.register()` decorator
- Can be referenced by string name or class directly

**Usage Patterns:**
- Specify metrics when creating datasets: `metrics=["accuracy", "precision"]`
- Mix built-in and custom metrics: `metrics=[Accuracy, CustomF1Score]`
- Access via registry: `MetricRegistry.get("accuracy")`

Each metric returns both aggregate scores (summary statistics) and item-level scores (individual predictions) for detailed analysis.

In [None]:
# Example: Using built-in metrics
from scorebook.metrics import Accuracy, Precision

# Create metric instances
accuracy_metric = Accuracy()
precision_metric = Precision()

# Example data
outputs = ["A", "B", "A", "C", "A"]
labels = ["A", "A", "A", "C", "B"]

# Calculate accuracy scores
acc_aggregate, acc_items = accuracy_metric.score(outputs, labels)
print("Accuracy Results:")
print(f"  Aggregate: {acc_aggregate}")
print(f"  Per-item: {acc_items}")

# Calculate precision scores  
prec_aggregate, prec_items = precision_metric.score(outputs, labels)
print("\nPrecision Results:")
print(f"  Aggregate: {prec_aggregate}")
print(f"  Per-item: {prec_items}")

In [None]:
# Example: Creating a custom metric
from scorebook.metrics import MetricBase, MetricRegistry
from typing import Any, Dict, List, Tuple

@MetricRegistry.register()
class ExactMatchMetric(MetricBase):
    """Custom metric that checks for exact string matches."""
    
    @staticmethod
    def score(outputs: List[Any], labels: List[Any]) -> Tuple[Dict[str, Any], List[Any]]:
        if len(outputs) != len(labels):
            raise ValueError("Number of outputs must match number of labels")
            
        # Calculate exact matches (case-sensitive)
        item_scores = [str(output).strip() == str(label).strip() 
                      for output, label in zip(outputs, labels)]
        
        # Calculate aggregate score
        exact_matches = sum(item_scores)
        total = len(outputs)
        aggregate_scores = {"exact_match": exact_matches / total if total > 0 else 0.0}
        
        return aggregate_scores, item_scores

# Test the custom metric
custom_metric = ExactMatchMetric()
test_outputs = ["Paris", "london", "Rome", "madrid"]
test_labels = ["Paris", "London", "Rome", "Madrid"]

custom_agg, custom_items = custom_metric.score(test_outputs, test_labels)
print("Custom ExactMatch Metric Results:")
print(f"  Aggregate: {custom_agg}")
print(f"  Per-item: {custom_items}")

# Access via registry
registry_metric = MetricRegistry.get("exactmatchmetric")
print(f"\nMetric from registry: {registry_metric.name}")

---
### Evaluate

The `evaluate()` function is ScoreBook's central orchestrator that brings together datasets, inference pipelines, and metrics to produce evaluation results. It handles the entire evaluation workflow automatically.

**Key Features:**

1. **Multi-Dataset Support**: Evaluate on multiple datasets in one call
2. **Hyperparameter Sweeping**: Test different model configurations
3. **Flexible Scoring**: Choose what level of detail you need
   - `"aggregate"`: Overall dataset scores only
   - `"item"`: Individual prediction scores only  
   - `"all"`: Both aggregate and per-item scores
4. **Progress Tracking**: Built-in progress bars for long evaluations
5. **Async Support**: Handles both synchronous and asynchronous inference functions

**Workflow:**
1. Normalizes input datasets and expands hyperparameter grids
2. For each dataset × hyperparameter combination:
   - Preprocesses items using the pipeline
   - Runs model inference 
   - Postprocesses outputs
   - Computes metric scores
3. Formats and returns results according to specified parameters

The function returns structured results that can be easily analyzed, saved, or visualized.

In [None]:
# Basic evaluate example using our previous components
from scorebook import evaluate

# Use the dataset and pipeline we created earlier
demo_dataset = EvalDataset.from_list(
    "demo_eval", 
    label="answer", 
    metrics=["accuracy"],
    data=[
        {"question": "What is 2+2?", "answer": "4"},
        {"question": "What is 3+3?", "answer": "6"},
        {"question": "What is 5+5?", "answer": "10"}
    ]
)

# Create a simple mock inference pipeline for demonstration
def demo_preprocessor(item: dict) -> str:
    return item["question"]

def demo_inference(processed_items: list[str], **hyperparams) -> list[str]:
    # Mock responses that partially match the expected answers
    mock_responses = ["4", "6", "wrong_answer"]  # Third one is intentionally wrong
    return mock_responses[:len(processed_items)]

def demo_postprocessor(output: str) -> str:
    return output.strip()

demo_pipeline = InferencePipeline(
    model="demo-model",
    preprocessor=demo_preprocessor,
    inference_function=demo_inference,
    postprocessor=demo_postprocessor
)

# Run basic evaluation
results = evaluate(
    demo_pipeline,
    demo_dataset,
    score_type="aggregate"
)

print("Basic Evaluation Results:")
print(results)

In [None]:
# Example: Hyperparameter sweeping
hyperparams = {
    "temperature": [0.7, 0.9],
    "max_tokens": [50, 100]
}

# Modified inference function that uses hyperparameters
def param_aware_inference(processed_items: list[str], **hyperparams) -> list[str]:
    temp = hyperparams.get("temperature", 0.7)
    max_tokens = hyperparams.get("max_tokens", 50)
    
    # Mock different responses based on parameters
    if temp > 0.8:
        responses = ["4", "6", "10"]  # "High temperature" gives correct answers
    else:
        responses = ["4", "7", "11"]  # "Low temperature" gives some wrong answers
        
    return responses[:len(processed_items)]

param_pipeline = InferencePipeline(
    model="param-model",
    preprocessor=demo_preprocessor,
    inference_function=param_aware_inference,
    postprocessor=demo_postprocessor
)

# Run evaluation with hyperparameter sweep
sweep_results = evaluate(
    param_pipeline,
    demo_dataset,
    hyperparameters=hyperparams,
    score_type="all"
)

print("Hyperparameter Sweep Results:")
print(f"Number of configurations tested: {len(sweep_results['aggregate'])}")
for i, result in enumerate(sweep_results['aggregate']):
    print(f"Config {i+1}: {result}")

# Show multi-dataset evaluation
datasets = [demo_dataset, list_dataset]  # Use both datasets we created
multi_results = evaluate(param_pipeline, datasets, score_type="aggregate")
print(f"\nMulti-dataset results: {len(multi_results)} results")

---
### Evaluation Results

ScoreBook's evaluation results are structured data containers that organize and present your model's performance in multiple formats. The `EvalResult` class provides comprehensive access to both aggregate metrics and individual prediction details.

**Key Properties:**

1. **Aggregate Scores**: Summary statistics across the entire dataset
2. **Item Scores**: Per-prediction detailed results
3. **Export Capabilities**: Multiple output formats for analysis

**Multi-Format Support:**
- **Dictionary Format**: Easy programmatic access and JSON serialization
- **CSV Format**: Spreadsheet-compatible for statistical analysis
- **Structured Objects**: Rich Python objects with methods and properties

**Analysis Workflow:**
1. Run evaluation to get `EvalResult` objects
2. Access aggregate scores for high-level performance overview  
3. Drill into item scores to understand model behavior on specific examples
4. Export results for visualization, reporting, or further analysis
5. Compare results across different models, datasets, or hyperparameters

In [None]:
# Working with evaluation results
# Run evaluation with detailed scoring
detailed_results = evaluate(
    demo_pipeline,
    demo_dataset, 
    score_type="all"
)

print("Result Structure:")
print(f"Keys: {list(detailed_results.keys())}")
print(f"Number of aggregate results: {len(detailed_results['aggregate'])}")
print(f"Number of per-sample results: {len(detailed_results['per_sample'])}")

print("\nAggregate Scores:")
for result in detailed_results['aggregate']:
    print(f"  {result}")

print("\nPer-Sample Results:")
for i, item in enumerate(detailed_results['per_sample'][:3]):  # Show first 3
    print(f"  Item {i}: {item}")

# Demonstrate accessing specific scores
if detailed_results['aggregate']:
    first_result = detailed_results['aggregate'][0]
    accuracy_score = first_result.get('accuracy', 'N/A')
    print(f"\nOverall Accuracy: {accuracy_score}")

# Show different score types
aggregate_only = evaluate(demo_pipeline, demo_dataset, score_type="aggregate")
items_only = evaluate(demo_pipeline, demo_dataset, score_type="item")

print(f"\nAggregate-only results: {len(aggregate_only)} items")
print(f"Items-only results: {len(items_only)} items")