Skip to content

Scorers

sarmakska edited this page Jun 7, 2026 · 2 revisions

Scorers

A scorer is any callable that returns a float in the range 0.0 to 1.0. The runner applies every scorer to every prediction and stores the result under the scorer's name.

Writing a scorer

The simplest scorer takes the prediction and the expected answer:

from aieval import scorer


@scorer
def mentions_refund(prediction: str, _expected: str) -> float:
    return 1.0 if "refund" in prediction.lower() else 0.0

Give a scorer an explicit metric name with the decorator argument:

@scorer(name="refund_mention")
def _refund(prediction: str, _expected: str) -> float:
    return 1.0 if "refund" in prediction.lower() else 0.0

Scorer context

A scorer can ask for more than the prediction and the expected answer. Declare any of example, model, provider or prompt as keyword-only parameters and the runner fills them in:

@scorer
def grounded_in_prompt(prediction: str, _expected: str, *, prompt: str | None = None) -> float:
    return 1.0 if prompt and prompt.split()[0].lower() in prediction.lower() else 0.0

The common two argument case is unchanged; the extra parameters are only supplied when you declare them.

Built-in scorers

from aieval.scorers import exact_match, json_valid, rouge_l, token_f1
Scorer What it measures
exact_match 1.0 when the trimmed prediction equals the trimmed expected answer, else 0.0
json_valid 1.0 when the prediction parses as JSON, else 0.0
rouge_l ROUGE-L F1 between prediction and expected, on whitespace tokens
token_f1 Order-independent token-level F1 overlap, the harmonic mean of token precision and recall

token_f1

token_f1 measures how much the prediction and the reference share at the word level, independent of order. Shared tokens are counted up to the smaller of their two multiplicities, so a prediction cannot inflate its score by repeating a correct word. Unlike rouge_l it ignores word order entirely, so a reordered but correct answer scores 1.0; unlike exact_match it gives partial credit for partial overlap. It is the same span-overlap metric used by SQuAD-style QA evaluations.

from aieval.scorers import token_f1

token_f1("the capital is paris", "paris is the capital")  # 1.0, order ignored
token_f1("the cat extra", "the cat")                       # 0.8, partial overlap

The score is 0.0 when either side is empty and 1.0 only when the two token multisets match exactly. Comparison is lower-cased by default; pass case_sensitive=True to keep casing significant.

LLM as judge

llm_judge builds a scorer that asks a grading model to score a prediction against a rubric. The model is prompted to answer with a single JSON object {"score": 0-10, "reason": "..."}, which is parsed strictly and normalised to 0.0 to 1.0. A malformed verdict scores 0.0 rather than crashing the run.

from aieval.scorers import llm_judge

faithful = llm_judge(
    rubric="Reward summaries faithful to the source that omit nothing important.",
    provider="sarmalink",
    model="smart",
    name="faithfulness",
)

The grading provider and model are fixed when you build the judge, independent of the model under evaluation, so you can grade a cheap model's output with a stronger judge. The judge sees the original prompt, the prediction and the expected answer.

Self-consistency

A single LLM grade carries real variance: ask the same judge the same question twice and you can get two different scores. Pass samples to query the judge several times and take the median verdict, so one noisy grade cannot swing the result. The samples run concurrently, so a judge with three samples costs roughly the wall-clock of one call.

faithful = llm_judge(
    rubric="Reward summaries faithful to the source that omit nothing important.",
    samples=3,            # query three times, take the median verdict
    name="faithfulness",
)

samples defaults to 1, which is a single call and the previous behaviour. It must be a positive integer; a value below 1 raises a ValueError at build time.

Pass the built judge into run like any other scorer:

run(name="qa", dataset=..., scorers=[faithful], provider="sarmalink", model="smart")

Clone this wiki locally