Skip to content
sarmakska edited this page May 3, 2026 · 4 revisions

ai-eval-runner

Evals as code. Datasets, scorers, traces, regressions, all in one CLI.

Built by Sarma Linux. MIT licence.


What this is

LLM evaluation in 2026 is non-negotiable. If you ship a prompt change without an eval suite you are guessing. Most eval tools are either heavyweight platforms with vendor lock-in or thin wrappers that score one example at a time. This runner is the middle path.

  • Datasets as JSONL or HuggingFace.
  • Scorers as plain Python functions: built-in LLM-as-judge, exact-match, BLEU, ROUGE, JSON schema validity, rubric grading.
  • Runs versioned by git SHA plus dataset hash.
  • Outputs to local SQLite or Postgres or DuckDB.
  • Traces visualised in a built-in HTMX viewer.
  • CI integration that fails the build on regression deltas above threshold.

Treat prompts as software with the same rigour as backend code.

Who this is for

  • ML engineers shipping prompt changes to production.
  • Backend teams that want eval gates in their CI pipeline without a third-party platform.
  • Anyone who wants regression detection and traces in one self-hosted tool.

Stack

Python 3.12, uv, Typer CLI, DuckDB, FastAPI, HTMX.


Wiki pages

  • Architecture — component table, run lifecycle state machine, why each piece
  • Quick-Start — install, first eval run, CI integration
  • Roadmap — what is shipped and what is next

Repository

github.com/sarmakska/ai-eval-runner

Clone this wiki locally