# ScoreBook Showcase
This notebook demonstrates how to use Trismik's ScoreBook library to evaluate large language models. Scorebook is a library that allows you to evaluate LLMs with any dataset from Hugging Face or your own, and calculate scores for metrics such as accuracy, precision, recall, or F1. ScoreBook facilitates intuitive and efficient LLM experimentation with features such as grouping evaluations, batch inferencing, and sweeping across a grid of hyperparameter configurations.

---
## Getting Started
To show how ScoreBook can be used to easily evaluate a model of your choice by scoring it against a dataset. In this basic example we will use a model and dataset provided by Hugging Face.

In [None]:
from scorebook import EvalDataset, evaluate
from scorebook.types.inference_pipeline import InferencePipeline
import transformers

# Create an evaluation dataset from any hugging face dataset by specifying its path, label field and split.
mmlu_pro = EvalDataset.from_huggingface("TIGER-Lab/MMLU-Pro", label="answer", metrics="accuracy", split="validation")

# In this example we use a simple Hugging Face text-generation pipeline for inference (use any compatible model you like).
pipeline = transformers.pipeline("text-generation", model="microsoft/Phi-4-mini-instruct")

# Define pipeline components: preprocessor, inference function, and postprocessor
def preprocessor(item: dict) -> str:
    """Convert evaluation item to model input format."""
    return item["question"]

def inference_function(processed_items: list[str], hyperparameters: dict) -> list:
    """Run model inference on preprocessed items."""
    outputs = [pipeline(item) for item in processed_items]
    return outputs

def postprocessor(output) -> str:
    """Extract the final answer from model output."""
    return output[0]["generated_text"][-1]["content"]

# Create an inference pipeline that handles preprocessing, inference, and postprocessing
inference_pipeline = InferencePipeline(
    model="microsoft/Phi-4-mini-instruct",
    preprocessor=preprocessor,
    inference_function=inference_function,
    postprocessor=postprocessor,
)

# Run the evaluation: ScoreBook calls your inference pipeline, compares predictions to labels, and returns results.
evaluation_results = evaluate(
    inference_pipeline,  # the inference pipeline
    mmlu_pro             # the evaluation dataset
)

---
## ScoreBook Components
When working with scorebook, there are 5 core components that should be considered and utilized:
- Evaluation Datasets
- Inference Pipelines
- Metrics
- The Evaluate Function
- Evaluation Results

The typical workflow for score book involves:
1) Creating an evaluation dataset from local files of from hugging face
2) Creating an inference pipeline responsible for returning a model output for each item in the evaluation dataset
3) Assigning metrics to be used in scoring the model
4) Using the `evaluate` function with a inference function, dataset, and metrics to generate scores

---
### Evaluation Datasets

Evaluation datasets are the foundation of model evaluation in ScoreBook. The `EvalDataset` class provides a unified interface for loading datasets from multiple sources and associating them with evaluation metrics.

**Key Features:**
- Load from HuggingFace Hub, CSV files, JSON files, or Python lists
- Specify which field contains the ground truth labels
- Associate evaluation metrics with the dataset
- Built on top of HuggingFace datasets for compatibility

**Supported Data Sources:**

1. **HuggingFace Hub**: Load any public dataset
```python
dataset = EvalDataset.from_huggingface("TIGER-Lab/MMLU-Pro", label="answer", metrics="accuracy")
```
2. **CSV Files**: Load from local CSV files
```python
dataset = EvalDataset.from_csv("data.csv", label="ground_truth", metrics=["accuracy", "precision"])
```
3.  **JSON Files**: Support both flat and nested JSON structures
```python
dataset = EvalDataset.from_json("data.json", label="answer", metrics="accuracy", split="test")
```

In [None]:
from scorebook import EvalDataset
from scorebook.types.inference_pipeline import InferencePipeline

# Create an evaluation dataset from any hugging face dataset by specifying its path, label field and split.
mmlu_pro = EvalDataset.from_huggingface("TIGER-Lab/MMLU-Pro", label="answer", metrics="accuracy", split="validation")
print(mmlu_pro)

---
### Inference Pipelines

The `InferencePipeline` is ScoreBook's modular approach to model inference. It separates the inference process into three distinct, customizable stages, making it easy to work with different models and data formats.

**Pipeline Stages:**

1. **Pre-processor**: Converts dataset items into the format expected by your model
2. **Inference Function**: Performs the actual model inference (can be sync or async)
3. **Post-processor**: Extracts the final prediction from the model's raw output

**Pre-processor**

The preprocessor function within an inference pipeline, converts each item within an evaluation dataset, into the format expected by the module used. This may include transformations such as combining dataset fields, adding context, and formatting messages within a message dict structure.

When writing a pre-processor function, it must accept a single evaluation dataset item as a parameter, and return an output that can be passed into a model

Example Pre-Processor Function:
```python
def preprocessor(eval_item: dict) -> list:
    """Convert evaluation item to model input format."""
    prompt = f"{eval_item['question']}\nOptions:\n" + "\n".join(
        [f"{letter} : {choice}" for letter, choice in zip(string.ascii_uppercase, eval_item["options"])]
    )

    messages = [
        {
            "role": "system",
            "content": """
                Answer the question you are given using only a single letter (for example, 'A').
                Please adhere strictly to the instructions.
            """,
        },
        {"role": "user", "content": prompt},
    ]

    return messages
```

**Inference Function**

The inference function is responsible for taking a list of model inputs, generated by the pre-processor function, and returning a list of model outputs.

**Postprocessor**

The post-processor function returns a formatted response for each model input's output. This may include extracting the final generated message from a response containing model input, as well as string parsing.

**Running an Inference Pipeline**

**Example Implementation:**

In [None]:
def preprocessor(item: dict) -> str:
    """Convert evaluation item to model input format."""
    return f"Question: {item['question']}\nAnswer:"

def inference_function(processed_items: list[str], hyperparameters: dict) -> list:
    """Run model inference on preprocessed items."""
    # Use any model or API here - OpenAI, HuggingFace, local models, etc.
    outputs = [model.generate(item, **hyperparameters) for item in processed_items]
    return outputs

def postprocessor(output) -> str:
    """Extract the final answer from model output."""
    return output.strip().split('\n')[0]  # Get first line of response

# Create the pipeline
pipeline = InferencePipeline(
    model="gpt-4",
    preprocessor=preprocessor,
    inference_function=inference_function,
    postprocessor=postprocessor,
)

---
### Metrics

Metrics in ScoreBook quantify how well your model performs by comparing predictions against ground truth labels. The framework provides built-in metrics and supports custom metric creation.

**Built-in Metrics:**

1. **Accuracy**: Measures the percentage of correct predictions
   ```python
   from scorebook.metrics import Accuracy
   metric = Accuracy()
   aggregate_scores, item_scores = metric.score(outputs, labels)
   ```

2. **Precision**: Measures the accuracy of positive predictions
   ```python
   from scorebook.metrics import Precision
   ```

**Metric Architecture:**
- All metrics inherit from `MetricBase` abstract class
- Implement a `score()` method that returns both aggregate and per-item scores
- Automatically registered via `@MetricRegistry.register()` decorator
- Can be referenced by string name or class directly

**Custom Metrics:**
```python
from scorebook.metrics import MetricBase, MetricRegistry

@MetricRegistry.register()
class CustomF1Score(MetricBase):
    @staticmethod
    def score(outputs, labels):
        # Calculate F1 score logic here
        item_scores = [calculate_f1(out, lab) for out, lab in zip(outputs, labels)]
        aggregate_score = {"f1": sum(item_scores) / len(item_scores)}
        return aggregate_score, item_scores
```

**Usage Patterns:**
- Specify metrics when creating datasets: `metrics=["accuracy", "precision"]`
- Mix built-in and custom metrics: `metrics=[Accuracy, CustomF1Score]`
- Access via registry: `MetricRegistry.get("accuracy")`

Each metric returns both aggregate scores (summary statistics) and item-level scores (individual predictions) for detailed analysis.

### Evaluate

The `evaluate()` function is ScoreBook's central orchestrator that brings together datasets, inference pipelines, and metrics to produce evaluation results. It handles the entire evaluation workflow automatically.

**Core Functionality:**

```python
from scorebook import evaluate

results = evaluate(
    inference_pipeline,    # Your inference pipeline
    eval_datasets,        # Single dataset or list of datasets  
    hyperparameters=None, # Optional parameter sweep
    item_limit=None,      # Limit evaluation to N items
    score_type="aggregate", # "aggregate", "item", or "all"
    return_type="dict"    # Format of returned results
)
```

**Key Features:**

1. **Multi-Dataset Support**: Evaluate on multiple datasets in one call
   ```python
   datasets = [mmlu_dataset, hellaswag_dataset, arc_dataset]
   results = evaluate(pipeline, datasets)
   ```

2. **Hyperparameter Sweeping**: Test different model configurations
   ```python
   hyperparams = {
       "temperature": [0.7, 0.9],
       "max_tokens": [50, 100],
       "top_p": [0.9, 0.95]
   }
   results = evaluate(pipeline, dataset, hyperparameters=hyperparams)
   ```

3. **Flexible Scoring**: Choose what level of detail you need
   - `"aggregate"`: Overall dataset scores only
   - `"item"`: Individual prediction scores only  
   - `"all"`: Both aggregate and per-item scores

4. **Progress Tracking**: Built-in progress bars for long evaluations

5. **Async Support**: Handles both synchronous and asynchronous inference functions

**Workflow:**
1. Normalizes input datasets and expands hyperparameter grids
2. For each dataset × hyperparameter combination:
   - Preprocesses items using the pipeline
   - Runs model inference 
   - Postprocesses outputs
   - Computes metric scores
3. Formats and returns results according to specified parameters

The function returns structured results that can be easily analyzed, saved, or visualized.

### Evaluation Results

ScoreBook's evaluation results are structured data containers that organize and present your model's performance in multiple formats. The `EvalResult` class provides comprehensive access to both aggregate metrics and individual prediction details.

**Result Structure:**

```python
# Results contain both aggregate and per-item information
results = evaluate(pipeline, dataset)

# Access aggregate scores (overall performance)
aggregate = results[0]["accuracy"]  # Overall accuracy score

# Access individual item results for detailed analysis  
for result in eval_result.item_scores:
    print(f"Item {result['item_id']}: {result['accuracy']}")
```

**Key Properties:**

1. **Aggregate Scores**: Summary statistics across the entire dataset
   ```python
   eval_result.aggregate_scores
   # Returns: {"dataset_name": "mmlu_pro", "accuracy": 0.85, ...}
   ```

2. **Item Scores**: Per-prediction detailed results
   ```python
   eval_result.item_scores
   # Returns list of: {"item_id": 0, "dataset_name": "mmlu_pro", "accuracy": True, ...}
   ```

3. **Export Capabilities**: Multiple output formats for analysis
   ```python
   # Save as structured JSON
   eval_result.to_json("results.json")
   
   # Export to CSV for spreadsheet analysis
   eval_result.to_csv("results.csv")
   
   # Get as dictionary for further processing
   data = eval_result.to_dict()
   ```

**Hyperparameter Integration:**
When using parameter sweeps, results automatically include hyperparameter information:
```python
{
    "dataset_name": "mmlu_pro",
    "accuracy": 0.87,
    "temperature": 0.7,
    "max_tokens": 100
}
```

**Multi-Format Support:**
- **Dictionary Format**: Easy programmatic access and JSON serialization
- **CSV Format**: Spreadsheet-compatible for statistical analysis
- **Structured Objects**: Rich Python objects with methods and properties

**Analysis Workflow:**
1. Run evaluation to get `EvalResult` objects
2. Access aggregate scores for high-level performance overview  
3. Drill into item scores to understand model behavior on specific examples
4. Export results for visualization, reporting, or further analysis
5. Compare results across different models, datasets, or hyperparameters