Skip to content
sarmakska edited this page May 31, 2026 · 4 revisions

ai-eval-runner

Evals as code. Datasets, scorers, judges, regression gates and bootstrapped A/B in one Python CLI.

Built by Sarma Linux. MIT licence.


What this is

ai-eval-runner treats LLM evaluation the way you treat backend tests. You write datasets as JSONL and scorers as plain Python functions, run them with one command, and every result is persisted keyed by git SHA and dataset version. A built-in viewer gives you per-example traces and a regression diff, a CI mode gates pull requests on score deltas, and a paired bootstrap tells you whether one run genuinely beats another. It is self-hosted, has no third-party platform dependency, and runs on local SQLite or DuckDB with zero infrastructure.

Who this is for

  • ML engineers shipping prompt and model changes who want a repeatable eval suite instead of eyeballing samples.
  • Backend teams that want an eval gate in CI without buying into a hosted platform.
  • Anyone choosing between two models who wants a confidence interval, not a single number.

Architecture

graph LR
  D[(Dataset<br/>JSONL or list)] --> R[Runner]
  S[Scorers<br/>plain Python + judge] --> R
  R -->|parallel provider calls| L[LLM provider<br/>SarmaLink / OpenAI]
  R -->|spans| OT[OpenTelemetry]
  R --> B[(Backend<br/>SQLite / DuckDB)]
  B --> V[FastAPI + HTMX viewer]
  B --> CMP[Compare: diff / CI gate / pairwise]

  classDef ext fill:#a78bfa,stroke:#a78bfa,color:#fff
  class L,OT ext
Loading

The runner is the centre of the system. It loads a dataset, fans out provider calls in parallel, applies each scorer to every prediction, emits OpenTelemetry spans, and writes results to the configured backend tagged with the git SHA and dataset version. Everything else hangs off that store:

Component File Responsibility
CLI aieval/cli.py Typer commands: run, list, view, diff, ci, pairwise
Runner aieval/core/runner.py Parallel execution with retry, scoring, telemetry and storage
Dataset aieval/core/dataset.py jsonl() and from_list() loaders, version() and DatasetRegistry
Scorer aieval/core/scorer.py The @scorer decorator and context injection
Regression aieval/core/regression.py Per-scorer deltas and per-example diffs
Pairwise aieval/core/pairwise.py Paired bootstrap with confidence intervals
Telemetry aieval/core/telemetry.py OpenTelemetry attribute capture with a no-op fallback
Built-in scorers aieval/scorers/ exact_match, json_valid, rouge_l, llm_judge
Backends aieval/backends/ SQLite and DuckDB result stores
Providers aieval/providers/ SarmaLink and OpenAI-compatible adapters
Viewer aieval/viewer/app.py FastAPI plus HTMX run browser and diff view

Why each piece

Plain Python scorers rather than a custom DSL, because Python is enough. You get full IDE support, real debuggers, and no leaky abstraction between what you want to measure and how you express it. Scorers that need more context (the LLM judge, for example) declare extra keyword parameters and the runner fills them.

Parallel provider calls because eval runs are I/O-bound on the model endpoint. The runner issues calls concurrently and retries transient failures, so a run of a few hundred examples finishes in the time of the slowest batch rather than the sum of every call.

Versioning by git SHA plus dataset version so two runs are only comparable when the code and the data both match. A new prompt changes the SHA, a changed dataset changes the version, and the viewer, diff, CI gate and pairwise comparison all rely on that to avoid comparing apples to oranges.

A paired bootstrap for A/B because a single mean difference does not tell you whether a result is real. Resampling per-example differences and reporting a percentile interval declares a winner only when the data supports it.

SQLite by default, DuckDB when you want analytics, both file-based so there is no server to stand up. Set AIEVAL_BACKEND and AIEVAL_DB_PATH to switch.

Real-world examples

Score with a built-in metric and an LLM judge

This is the bundled examples/summarisation/eval.py, scoring each summary three ways:

from aieval import dataset, run, scorer
from aieval.core.dataset import DatasetRegistry
from aieval.scorers import llm_judge, rouge_l


@scorer
def length_under_120_words(prediction: str, _expected: str) -> float:
    return 1.0 if len(prediction.split()) <= 120 else 0.0


faithful = llm_judge(
    rubric="Reward summaries that are faithful to the source and omit nothing important.",
    provider="sarmalink",
    model="smart",
    name="faithfulness",
)


if __name__ == "__main__":
    rows = list(dataset.jsonl("examples/summarisation/dataset.jsonl"))
    DatasetRegistry().register("summarisation", rows)
    run(
        name="summarisation",
        dataset=rows,
        scorers=[rouge_l, length_under_120_words, faithful],
        provider="sarmalink",
        model="smart",
    )

Gate a pull request on regression

Add an eval workflow alongside your normal CI so a prompt change cannot merge if a scorer drops:

name: evals
on: pull_request
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v3
      - run: uv sync
      - run: uv run aieval run examples/summarisation/eval.py
      - run: |
          RUN_ID=$(uv run python -c "from aieval.backends import get_backend; print(get_backend().list_runs()[0]['id'])")
          uv run aieval ci "$RUN_ID" --threshold 0.05

The ci command compares the latest run against the previous run of the same name and exits non-zero when any scorer regresses past the threshold.

Decide whether a new model wins

uv run aieval pairwise <old_run_id> <new_run_id>

You get a per-scorer table with the observed mean difference and a 95% confidence interval. The winner column reads b only when the whole interval sits above zero, so you never call a difference the data cannot support.

Troubleshooting

aieval run printed nothing and aieval list shows no run. Your eval file must call run(...) when executed. The bundled example does this under if __name__ == "__main__":, and the CLI runs the file as the main module so that guard fires. If you put the run(...) call inside a function that is never called, nothing happens.

uv sync fails to determine wheel contents. The distribution name ai-eval-runner does not match the package directory src/aieval, so hatchling needs an explicit target. This is already configured under [tool.hatch.build.targets.wheel] in pyproject.toml.

aieval command not found. The console script is installed into the project virtualenv, so run it through uv: uv run aieval ....

Viewer starts but shows no runs. The viewer reads from the same backend the runner writes to. Confirm AIEVAL_BACKEND and AIEVAL_DB_PATH point at the same database in both shells. By default this is SQLite at ./aieval.db.

Results land in the wrong database during tests. The backend is cached as a singleton per process. In tests, set AIEVAL_DB_PATH before the first call and reset the singleton (aieval.backends._singleton = None), as the test fixtures do.

Provider calls fail with an auth error. Copy .env.example to .env and set SARMALINK_API_KEY or OPENAI_API_KEY for whichever provider you pass to run.

No telemetry spans appear. OpenTelemetry capture is behind the optional otel extra. Install it with uv sync --extra otel and configure an exporter (for example OTEL_EXPORTER_OTLP_ENDPOINT). Without the extra the runner records attributes internally but exports nothing.

Pairwise reports a tie even though one run looks better. A tie means the confidence interval straddles zero, so the data does not support a clear winner. Add more examples or widen the gap before trusting the difference.

Wiki pages

  • Architecture - component breakdown, run lifecycle and design rationale
  • Quick-Start - install, first eval run and CI integration
  • Scorers - built-in scorers, writing your own, and the LLM judge
  • Comparing-Runs - diff, the CI gate and pairwise A/B
  • Telemetry - OpenTelemetry attribute capture
  • Roadmap - what is shipped and what is next

Repository

github.com/sarmakska/ai-eval-runner

Clone this wiki locally