Home

ai-eval-runner

Evals as code. Datasets, scorers, traces, regressions, all in one CLI.

Built by Sarma Linux. MIT licence.

What this is

LLM evaluation in 2026 is non-negotiable. If you ship a prompt change without an eval suite you are guessing. Most eval tools are either heavyweight platforms with vendor lock-in or thin wrappers that score one example at a time. This runner is the middle path.

Datasets as JSONL or HuggingFace.
Scorers as plain Python functions: built-in LLM-as-judge, exact-match, BLEU, ROUGE, JSON schema validity, rubric grading.
Runs versioned by git SHA plus dataset hash.
Outputs to local SQLite or Postgres or DuckDB.
Traces visualised in a built-in HTMX viewer.
CI integration that fails the build on regression deltas above threshold.

Treat prompts as software with the same rigour as backend code.

Who this is for

ML engineers shipping prompt changes to production.
Backend teams that want eval gates in their CI pipeline without a third-party platform.
Anyone who wants regression detection and traces in one self-hosted tool.

Stack

Python 3.12, uv, Typer CLI, DuckDB, FastAPI, HTMX.

Wiki pages

Architecture — component table, run lifecycle state machine, why each piece
Quick-Start — install, first eval run, CI integration
Roadmap — what is shipped and what is next

Repository

github.com/sarmakska/ai-eval-runner

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

ai-eval-runner

What this is

Who this is for

Stack

Wiki pages

Repository

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally