-
Notifications
You must be signed in to change notification settings - Fork 0
Comparing Runs
Once you have two runs you can compare them three ways: a regression diff, a CI gate, and a paired A/B with a confidence interval. All three pair examples by index and only compare scorers present in both runs.
uv run aieval diff <run_a> <run_b>This prints per-scorer mean deltas (candidate minus baseline) and, for each scorer, the examples that regressed most. The same view is available in the viewer at /diff/<run_a>/<run_b>, with movers in both directions colour coded.
In code:
from aieval import compare
from aieval.backends import get_backend
backend = get_backend()
report = compare(backend.get_results(run_a), backend.get_results(run_b))
for d in report.scorer_deltas:
print(d.scorer, d.delta)uv run aieval ci <run_id> --threshold 0.05The gate compares the candidate run against a baseline and exits non-zero when any scorer's mean drops by more than the threshold. By default the baseline is the previous run of the same name; pass --baseline <run_id> to pin it. If there is no baseline the gate passes, so the first run on a new eval never blocks.
Wire it into a workflow after your eval:
- run: uv run aieval run my_eval.py
- run: |
RUN_ID=$(uv run python -c "from aieval.backends import get_backend; print(get_backend().list_runs()[0]['id'])")
uv run aieval ci "$RUN_ID" --threshold 0.05uv run aieval pairwise <run_a> <run_b> --confidence 0.95 --iterations 2000A single mean difference does not tell you whether a result is real. The pairwise command runs a paired bootstrap: it resamples the per-example differences with replacement many times, recomputes the mean difference on each resample, and reports a percentile confidence interval. The winner column reads:
-
bwhen the whole interval sits above zero (B genuinely beats A) -
awhen the whole interval sits below zero (A genuinely beats B) -
tiewhen the interval straddles zero (the data cannot support a winner)
The bootstrap is seeded, so results are reproducible. In code:
from aieval import pairwise
from aieval.backends import get_backend
backend = get_backend()
for r in pairwise(backend.get_results(run_a), backend.get_results(run_b)):
print(r.scorer, r.mean_diff, (r.ci_low, r.ci_high), r.winner)Each run is tagged with the git SHA and the dataset content version. Comparing runs over different datasets is possible but rarely meaningful, since the examples no longer line up by index. Keep the dataset stable across a comparison, or register dataset versions so drift is at least auditable.