Skip to content

Scorers

sarmakska edited this page May 31, 2026 · 2 revisions

Scorers

A scorer is any callable that returns a float in the range 0.0 to 1.0. The runner applies every scorer to every prediction and stores the result under the scorer's name.

Writing a scorer

The simplest scorer takes the prediction and the expected answer:

from aieval import scorer


@scorer
def mentions_refund(prediction: str, _expected: str) -> float:
    return 1.0 if "refund" in prediction.lower() else 0.0

Give a scorer an explicit metric name with the decorator argument:

@scorer(name="refund_mention")
def _refund(prediction: str, _expected: str) -> float:
    return 1.0 if "refund" in prediction.lower() else 0.0

Scorer context

A scorer can ask for more than the prediction and the expected answer. Declare any of example, model, provider or prompt as keyword-only parameters and the runner fills them in:

@scorer
def grounded_in_prompt(prediction: str, _expected: str, *, prompt: str | None = None) -> float:
    return 1.0 if prompt and prompt.split()[0].lower() in prediction.lower() else 0.0

The common two argument case is unchanged; the extra parameters are only supplied when you declare them.

Built-in scorers

from aieval.scorers import exact_match, json_valid, rouge_l
Scorer What it measures
exact_match 1.0 when the trimmed prediction equals the trimmed expected answer, else 0.0
json_valid 1.0 when the prediction parses as JSON, else 0.0
rouge_l ROUGE-L F1 between prediction and expected, on whitespace tokens

LLM as judge

llm_judge builds a scorer that asks a grading model to score a prediction against a rubric. The model is prompted to answer with a single JSON object {"score": 0-10, "reason": "..."}, which is parsed strictly and normalised to 0.0 to 1.0. A malformed verdict scores 0.0 rather than crashing the run.

from aieval.scorers import llm_judge

faithful = llm_judge(
    rubric="Reward summaries faithful to the source that omit nothing important.",
    provider="sarmalink",
    model="smart",
    name="faithfulness",
)

The grading provider and model are fixed when you build the judge, independent of the model under evaluation, so you can grade a cheap model's output with a stronger judge. The judge sees the original prompt, the prediction and the expected answer.

Pass the built judge into run like any other scorer:

run(name="qa", dataset=..., scorers=[faithful], provider="sarmalink", model="smart")

Clone this wiki locally