-
Notifications
You must be signed in to change notification settings - Fork 0
Home
sarmakska edited this page May 3, 2026
·
4 revisions
Evals as code. Datasets, scorers, traces, regressions, all in one CLI.
Built by Sarma Linux. MIT licence.
LLM evaluation in 2026 is non-negotiable. If you ship a prompt change without an eval suite you are guessing. Most eval tools are either heavyweight platforms with vendor lock-in or thin wrappers that score one example at a time. This runner is the middle path.
- Datasets as JSONL or HuggingFace.
- Scorers as plain Python functions: built-in LLM-as-judge, exact-match, BLEU, ROUGE, JSON schema validity, rubric grading.
- Runs versioned by git SHA plus dataset hash.
- Outputs to local SQLite or Postgres or DuckDB.
- Traces visualised in a built-in HTMX viewer.
- CI integration that fails the build on regression deltas above threshold.
Treat prompts as software with the same rigour as backend code.
- ML engineers shipping prompt changes to production.
- Backend teams that want eval gates in their CI pipeline without a third-party platform.
- Anyone who wants regression detection and traces in one self-hosted tool.
Python 3.12, uv, Typer CLI, DuckDB, FastAPI, HTMX.
- Architecture — component table, run lifecycle state machine, why each piece
- Quick-Start — install, first eval run, CI integration
- Roadmap — what is shipped and what is next