-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Evals as code. Datasets, scorers, traces and CI regression gates in one Python CLI.
Built by Sarma Linux. MIT licence.
ai-eval-runner treats LLM evaluation the way you treat backend tests. You write datasets as JSONL and scorers as plain Python functions, run them with one command, and every result is persisted keyed by git SHA and dataset hash. A built-in viewer gives you per-example traces, and a CI mode gates pull requests on regression deltas. It is self-hosted, has no third-party platform dependency, and runs on local SQLite or DuckDB with zero infrastructure.
- ML engineers shipping prompt and model changes who want a repeatable eval suite instead of eyeballing samples.
- Backend teams that want an eval gate in CI without buying into a hosted platform.
- Anyone who wants runs, scores and traces in one self-hosted tool with no vendor lock-in.
graph LR
D[(Dataset<br/>JSONL or list)] --> R[Runner]
S[Scorers<br/>plain Python] --> R
R -->|parallel provider calls| L[LLM provider<br/>SarmaLink / OpenAI]
R --> B[(Backend<br/>SQLite / DuckDB)]
B --> V[FastAPI + HTMX viewer]
B --> CI[CI regression gate]
classDef ext fill:#a78bfa,stroke:#a78bfa,color:#fff
class L ext
The runner is the centre of the system. It loads a dataset, fans out provider calls in parallel, applies each scorer to every prediction, and writes results to the configured backend. Everything else hangs off that:
| Component | File | Responsibility |
|---|---|---|
| CLI | aieval/cli.py |
Typer commands: run, list, view, diff, ci
|
| Runner | aieval/core/runner.py |
Parallel execution with retry, then scoring and storage |
| Dataset | aieval/core/dataset.py |
jsonl() file loader and from_list() in-process loader |
| Scorer | aieval/core/scorer.py |
The @scorer decorator and scorer protocol |
| Built-in scorers | aieval/scorers/ |
exact_match, json_valid, rouge_l
|
| Backends | aieval/backends/ |
SQLite and DuckDB result stores |
| Providers | aieval/providers/ |
SarmaLink and OpenAI-compatible adapters |
| Viewer | aieval/viewer/app.py |
FastAPI plus HTMX run browser |
Plain Python scorers rather than a custom DSL, because Python is enough. You get full IDE support, real debuggers, and no leaky abstraction between what you want to measure and how you express it.
Parallel provider calls because eval runs are I/O-bound on the model endpoint. The runner issues calls concurrently and retries transient failures, so a run of a few hundred examples finishes in the time of the slowest batch rather than the sum of every call.
Versioning by git SHA plus dataset hash so two runs are only comparable when the code and the data both match. A new prompt changes the SHA, a changed dataset changes the hash, and the viewer and CI mode use that to avoid comparing apples to oranges.
SQLite by default, DuckDB when you want analytics, both file-based so there is no server to stand up. Set AIEVAL_BACKEND and AIEVAL_DB_PATH in your environment to switch.
Add an eval workflow alongside your normal CI so a prompt change cannot merge if a scorer drops:
name: evals
on: pull_request
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v3
- run: uv sync
- run: uv run aieval run examples/summarisation/eval.py
- run: uv run aieval ci --baseline main --threshold 0.05The ci command compares the current run against the baseline on main and exits non-zero when any scorer regresses past the threshold.
from aieval import dataset, run
from aieval.scorers import json_valid
if __name__ == "__main__":
run(
name="extraction-json",
dataset=dataset.jsonl("data/extraction.jsonl"),
scorers=[json_valid],
provider="openai",
model="gpt-4o-mini",
)from aieval import dataset, run, scorer
@scorer
def mentions_refund(prediction: str, _expected: str) -> float:
return 1.0 if "refund" in prediction.lower() else 0.0
if __name__ == "__main__":
run(
name="support-smoke",
dataset=dataset.from_list([
{"prompt": "Customer wants their money back", "expected": "refund"},
]),
scorers=[mentions_refund],
provider="sarmalink",
model="smart",
)uv sync fails to determine wheel contents. The distribution name ai-eval-runner does not match the package directory src/aieval, so hatchling needs an explicit target. This is already configured under [tool.hatch.build.targets.wheel] in pyproject.toml. If you forked an older revision, add packages = ["src/aieval"] there.
aieval command not found. The console script is installed into the project virtualenv, so run it through uv: uv run aieval .... Running the bare aieval outside the environment will not resolve.
Viewer starts but shows no runs. The viewer reads from the same backend the runner writes to. Confirm AIEVAL_BACKEND and AIEVAL_DB_PATH point at the same database in both shells. By default this is SQLite at ./aieval.db.
Results land in the wrong database during tests. The backend is cached as a singleton for the process. In tests, set AIEVAL_DB_PATH before the first call and reset the singleton (aieval.backends._singleton = None) so a fresh path is honoured, as the smoke tests do.
Provider calls fail with an auth error. Copy .env.example to .env and set SARMALINK_API_KEY or OPENAI_API_KEY for whichever provider you pass to run.
A run looks slower than expected. The runner is parallel but bounded by provider rate limits. Raise the concurrency argument on run if your endpoint allows more in-flight requests.
- Architecture - component breakdown, run lifecycle and design rationale
- Quick-Start - install, first eval run and CI integration
- Roadmap - what is shipped and what is next