-
Notifications
You must be signed in to change notification settings - Fork 0
Scorers
A scorer is any callable that returns a float in the range 0.0 to 1.0. The runner applies every scorer to every prediction and stores the result under the scorer's name.
The simplest scorer takes the prediction and the expected answer:
from aieval import scorer
@scorer
def mentions_refund(prediction: str, _expected: str) -> float:
return 1.0 if "refund" in prediction.lower() else 0.0Give a scorer an explicit metric name with the decorator argument:
@scorer(name="refund_mention")
def _refund(prediction: str, _expected: str) -> float:
return 1.0 if "refund" in prediction.lower() else 0.0A scorer can ask for more than the prediction and the expected answer. Declare any of example, model, provider or prompt as keyword-only parameters and the runner fills them in:
@scorer
def grounded_in_prompt(prediction: str, _expected: str, *, prompt: str | None = None) -> float:
return 1.0 if prompt and prompt.split()[0].lower() in prediction.lower() else 0.0The common two argument case is unchanged; the extra parameters are only supplied when you declare them.
from aieval.scorers import exact_match, json_valid, rouge_l| Scorer | What it measures |
|---|---|
exact_match |
1.0 when the trimmed prediction equals the trimmed expected answer, else 0.0 |
json_valid |
1.0 when the prediction parses as JSON, else 0.0 |
rouge_l |
ROUGE-L F1 between prediction and expected, on whitespace tokens |
llm_judge builds a scorer that asks a grading model to score a prediction against a rubric. The model is prompted to answer with a single JSON object {"score": 0-10, "reason": "..."}, which is parsed strictly and normalised to 0.0 to 1.0. A malformed verdict scores 0.0 rather than crashing the run.
from aieval.scorers import llm_judge
faithful = llm_judge(
rubric="Reward summaries faithful to the source that omit nothing important.",
provider="sarmalink",
model="smart",
name="faithfulness",
)The grading provider and model are fixed when you build the judge, independent of the model under evaluation, so you can grade a cheap model's output with a stronger judge. The judge sees the original prompt, the prediction and the expected answer.
Pass the built judge into run like any other scorer:
run(name="qa", dataset=..., scorers=[faithful], provider="sarmalink", model="smart")