Home

ai-eval-runner

Evals as code. Datasets, scorers, traces and CI regression gates in one Python CLI.

Built by Sarma Linux. MIT licence.

What this is

ai-eval-runner treats LLM evaluation the way you treat backend tests. You write datasets as JSONL and scorers as plain Python functions, run them with one command, and every result is persisted keyed by git SHA and dataset hash. A built-in viewer gives you per-example traces, and a CI mode gates pull requests on regression deltas. It is self-hosted, has no third-party platform dependency, and runs on local SQLite or DuckDB with zero infrastructure.

Who this is for

ML engineers shipping prompt and model changes who want a repeatable eval suite instead of eyeballing samples.
Backend teams that want an eval gate in CI without buying into a hosted platform.
Anyone who wants runs, scores and traces in one self-hosted tool with no vendor lock-in.

Architecture

graph LR
  D[(Dataset<br/>JSONL or list)] --> R[Runner]
  S[Scorers<br/>plain Python] --> R
  R -->|parallel provider calls| L[LLM provider<br/>SarmaLink / OpenAI]
  R --> B[(Backend<br/>SQLite / DuckDB)]
  B --> V[FastAPI + HTMX viewer]
  B --> CI[CI regression gate]

  classDef ext fill:#a78bfa,stroke:#a78bfa,color:#fff
  class L ext

The runner is the centre of the system. It loads a dataset, fans out provider calls in parallel, applies each scorer to every prediction, and writes results to the configured backend. Everything else hangs off that:

Component	File	Responsibility
CLI	`aieval/cli.py`	Typer commands: `run`, `list`, `view`, `diff`, `ci`
Runner	`aieval/core/runner.py`	Parallel execution with retry, then scoring and storage
Dataset	`aieval/core/dataset.py`	`jsonl()` file loader and `from_list()` in-process loader
Scorer	`aieval/core/scorer.py`	The `@scorer` decorator and scorer protocol
Built-in scorers	`aieval/scorers/`	`exact_match`, `json_valid`, `rouge_l`
Backends	`aieval/backends/`	SQLite and DuckDB result stores
Providers	`aieval/providers/`	SarmaLink and OpenAI-compatible adapters
Viewer	`aieval/viewer/app.py`	FastAPI plus HTMX run browser

Why each piece

Plain Python scorers rather than a custom DSL, because Python is enough. You get full IDE support, real debuggers, and no leaky abstraction between what you want to measure and how you express it.

Parallel provider calls because eval runs are I/O-bound on the model endpoint. The runner issues calls concurrently and retries transient failures, so a run of a few hundred examples finishes in the time of the slowest batch rather than the sum of every call.

Versioning by git SHA plus dataset hash so two runs are only comparable when the code and the data both match. A new prompt changes the SHA, a changed dataset changes the hash, and the viewer and CI mode use that to avoid comparing apples to oranges.

SQLite by default, DuckDB when you want analytics, both file-based so there is no server to stand up. Set AIEVAL_BACKEND and AIEVAL_DB_PATH in your environment to switch.

Real-world examples

Gate a pull request on regression

Add an eval workflow alongside your normal CI so a prompt change cannot merge if a scorer drops:

name: evals
on: pull_request
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v3
      - run: uv sync
      - run: uv run aieval run examples/summarisation/eval.py
      - run: uv run aieval ci --baseline main --threshold 0.05

The ci command compares the current run against the baseline on main and exits non-zero when any scorer regresses past the threshold.

Score structured output for validity

from aieval import dataset, run
from aieval.scorers import json_valid

if __name__ == "__main__":
    run(
        name="extraction-json",
        dataset=dataset.jsonl("data/extraction.jsonl"),
        scorers=[json_valid],
        provider="openai",
        model="gpt-4o-mini",
    )

Build a dataset in process for a quick check

from aieval import dataset, run, scorer


@scorer
def mentions_refund(prediction: str, _expected: str) -> float:
    return 1.0 if "refund" in prediction.lower() else 0.0


if __name__ == "__main__":
    run(
        name="support-smoke",
        dataset=dataset.from_list([
            {"prompt": "Customer wants their money back", "expected": "refund"},
        ]),
        scorers=[mentions_refund],
        provider="sarmalink",
        model="smart",
    )

Troubleshooting

uv sync fails to determine wheel contents. The distribution name ai-eval-runner does not match the package directory src/aieval, so hatchling needs an explicit target. This is already configured under [tool.hatch.build.targets.wheel] in pyproject.toml. If you forked an older revision, add packages = ["src/aieval"] there.

aieval command not found. The console script is installed into the project virtualenv, so run it through uv: uv run aieval .... Running the bare aieval outside the environment will not resolve.

Viewer starts but shows no runs. The viewer reads from the same backend the runner writes to. Confirm AIEVAL_BACKEND and AIEVAL_DB_PATH point at the same database in both shells. By default this is SQLite at ./aieval.db.

Results land in the wrong database during tests. The backend is cached as a singleton for the process. In tests, set AIEVAL_DB_PATH before the first call and reset the singleton (aieval.backends._singleton = None) so a fresh path is honoured, as the smoke tests do.

Provider calls fail with an auth error. Copy .env.example to .env and set SARMALINK_API_KEY or OPENAI_API_KEY for whichever provider you pass to run.

A run looks slower than expected. The runner is parallel but bounded by provider rate limits. Raise the concurrency argument on run if your endpoint allows more in-flight requests.

Wiki pages

Architecture - component breakdown, run lifecycle and design rationale
Quick-Start - install, first eval run and CI integration
Roadmap - what is shipped and what is next

Repository

github.com/sarmakska/ai-eval-runner

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

ai-eval-runner

What this is

Who this is for

Architecture

Why each piece

Real-world examples

Gate a pull request on regression

Score structured output for validity

Build a dataset in process for a quick check

Troubleshooting

Wiki pages

Repository

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally