Scorers

A scorer is any callable that returns a float in the range 0.0 to 1.0. The runner applies every scorer to every prediction and stores the result under the scorer's name.

Writing a scorer

The simplest scorer takes the prediction and the expected answer:

from aieval import scorer


@scorer
def mentions_refund(prediction: str, _expected: str) -> float:
    return 1.0 if "refund" in prediction.lower() else 0.0

Give a scorer an explicit metric name with the decorator argument:

@scorer(name="refund_mention")
def _refund(prediction: str, _expected: str) -> float:
    return 1.0 if "refund" in prediction.lower() else 0.0

Scorer context

A scorer can ask for more than the prediction and the expected answer. Declare any of example, model, provider or prompt as keyword-only parameters and the runner fills them in:

@scorer
def grounded_in_prompt(prediction: str, _expected: str, *, prompt: str | None = None) -> float:
    return 1.0 if prompt and prompt.split()[0].lower() in prediction.lower() else 0.0

The common two argument case is unchanged; the extra parameters are only supplied when you declare them.

Built-in scorers

from aieval.scorers import exact_match, json_valid, rouge_l

Scorer	What it measures
`exact_match`	1.0 when the trimmed prediction equals the trimmed expected answer, else 0.0
`json_valid`	1.0 when the prediction parses as JSON, else 0.0
`rouge_l`	ROUGE-L F1 between prediction and expected, on whitespace tokens

LLM as judge

llm_judge builds a scorer that asks a grading model to score a prediction against a rubric. The model is prompted to answer with a single JSON object {"score": 0-10, "reason": "..."}, which is parsed strictly and normalised to 0.0 to 1.0. A malformed verdict scores 0.0 rather than crashing the run.

from aieval.scorers import llm_judge

faithful = llm_judge(
    rubric="Reward summaries faithful to the source that omit nothing important.",
    provider="sarmalink",
    model="smart",
    name="faithfulness",
)

The grading provider and model are fixed when you build the judge, independent of the model under evaluation, so you can grade a cheap model's output with a stronger judge. The judge sees the original prompt, the prediction and the expected answer.

Pass the built judge into run like any other scorer:

run(name="qa", dataset=..., scorers=[faithful], provider="sarmalink", model="smart")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scorers

Scorers

Writing a scorer

Scorer context

Built-in scorers

LLM as judge

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally