Skip to content

Comparing Runs

sarmakska edited this page May 31, 2026 · 1 revision

Comparing Runs

Once you have two runs you can compare them three ways: a regression diff, a CI gate, and a paired A/B with a confidence interval. All three pair examples by index and only compare scorers present in both runs.

Regression diff

uv run aieval diff <run_a> <run_b>

This prints per-scorer mean deltas (candidate minus baseline) and, for each scorer, the examples that regressed most. The same view is available in the viewer at /diff/<run_a>/<run_b>, with movers in both directions colour coded.

In code:

from aieval import compare
from aieval.backends import get_backend

backend = get_backend()
report = compare(backend.get_results(run_a), backend.get_results(run_b))
for d in report.scorer_deltas:
    print(d.scorer, d.delta)

CI gate

uv run aieval ci <run_id> --threshold 0.05

The gate compares the candidate run against a baseline and exits non-zero when any scorer's mean drops by more than the threshold. By default the baseline is the previous run of the same name; pass --baseline <run_id> to pin it. If there is no baseline the gate passes, so the first run on a new eval never blocks.

Wire it into a workflow after your eval:

- run: uv run aieval run my_eval.py
- run: |
    RUN_ID=$(uv run python -c "from aieval.backends import get_backend; print(get_backend().list_runs()[0]['id'])")
    uv run aieval ci "$RUN_ID" --threshold 0.05

Pairwise A/B

uv run aieval pairwise <run_a> <run_b> --confidence 0.95 --iterations 2000

A single mean difference does not tell you whether a result is real. The pairwise command runs a paired bootstrap: it resamples the per-example differences with replacement many times, recomputes the mean difference on each resample, and reports a percentile confidence interval. The winner column reads:

  • b when the whole interval sits above zero (B genuinely beats A)
  • a when the whole interval sits below zero (A genuinely beats B)
  • tie when the interval straddles zero (the data cannot support a winner)

The bootstrap is seeded, so results are reproducible. In code:

from aieval import pairwise
from aieval.backends import get_backend

backend = get_backend()
for r in pairwise(backend.get_results(run_a), backend.get_results(run_b)):
    print(r.scorer, r.mean_diff, (r.ci_low, r.ci_high), r.winner)

What makes runs comparable

Each run is tagged with the git SHA and the dataset content version. Comparing runs over different datasets is possible but rarely meaningful, since the examples no longer line up by index. Keep the dataset stable across a comparison, or register dataset versions so drift is at least auditable.

Clone this wiki locally