-
Notifications
You must be signed in to change notification settings - Fork 0
Scorers
A scorer is any callable that returns a float in the range 0.0 to 1.0. The runner applies every scorer to every prediction and stores the result under the scorer's name.
The simplest scorer takes the prediction and the expected answer:
from aieval import scorer
@scorer
def mentions_refund(prediction: str, _expected: str) -> float:
return 1.0 if "refund" in prediction.lower() else 0.0Give a scorer an explicit metric name with the decorator argument:
@scorer(name="refund_mention")
def _refund(prediction: str, _expected: str) -> float:
return 1.0 if "refund" in prediction.lower() else 0.0A scorer can ask for more than the prediction and the expected answer. Declare any of example, model, provider or prompt as keyword-only parameters and the runner fills them in:
@scorer
def grounded_in_prompt(prediction: str, _expected: str, *, prompt: str | None = None) -> float:
return 1.0 if prompt and prompt.split()[0].lower() in prediction.lower() else 0.0The common two argument case is unchanged; the extra parameters are only supplied when you declare them.
from aieval.scorers import exact_match, json_valid, rouge_l, token_f1| Scorer | What it measures |
|---|---|
exact_match |
1.0 when the trimmed prediction equals the trimmed expected answer, else 0.0 |
json_valid |
1.0 when the prediction parses as JSON, else 0.0 |
rouge_l |
ROUGE-L F1 between prediction and expected, on whitespace tokens |
token_f1 |
Order-independent token-level F1 overlap, the harmonic mean of token precision and recall |
token_f1 measures how much the prediction and the reference share at the word level, independent of order. Shared tokens are counted up to the smaller of their two multiplicities, so a prediction cannot inflate its score by repeating a correct word. Unlike rouge_l it ignores word order entirely, so a reordered but correct answer scores 1.0; unlike exact_match it gives partial credit for partial overlap. It is the same span-overlap metric used by SQuAD-style QA evaluations.
from aieval.scorers import token_f1
token_f1("the capital is paris", "paris is the capital") # 1.0, order ignored
token_f1("the cat extra", "the cat") # 0.8, partial overlapThe score is 0.0 when either side is empty and 1.0 only when the two token multisets match exactly. Comparison is lower-cased by default; pass case_sensitive=True to keep casing significant.
llm_judge builds a scorer that asks a grading model to score a prediction against a rubric. The model is prompted to answer with a single JSON object {"score": 0-10, "reason": "..."}, which is parsed strictly and normalised to 0.0 to 1.0. A malformed verdict scores 0.0 rather than crashing the run.
from aieval.scorers import llm_judge
faithful = llm_judge(
rubric="Reward summaries faithful to the source that omit nothing important.",
provider="sarmalink",
model="smart",
name="faithfulness",
)The grading provider and model are fixed when you build the judge, independent of the model under evaluation, so you can grade a cheap model's output with a stronger judge. The judge sees the original prompt, the prediction and the expected answer.
A single LLM grade carries real variance: ask the same judge the same question twice and you can get two different scores. Pass samples to query the judge several times and take the median verdict, so one noisy grade cannot swing the result. The samples run concurrently, so a judge with three samples costs roughly the wall-clock of one call.
faithful = llm_judge(
rubric="Reward summaries faithful to the source that omit nothing important.",
samples=3, # query three times, take the median verdict
name="faithfulness",
)samples defaults to 1, which is a single call and the previous behaviour. It must be a positive integer; a value below 1 raises a ValueError at build time.
Pass the built judge into run like any other scorer:
run(name="qa", dataset=..., scorers=[faithful], provider="sarmalink", model="smart")