-
Notifications
You must be signed in to change notification settings - Fork 0
Roadmap
sarmakska edited this page May 31, 2026
·
2 revisions
- Parallel async runner with retry, git SHA and dataset version tagging
- JSONL and in-process dataset loaders
- Dataset versioning with an order-independent content version and a registry
- Scorer decorator with optional naming and context injection
- Built-in scorers: exact match, JSON validity, ROUGE-L
- LLM-as-judge scorers graded against a rubric
- SQLite and DuckDB backends
- SarmaLink and OpenAI-compatible providers
- Regression diff:
aieval diffand the viewer diff route - CI regression gate:
aieval ciwith a per-scorer threshold - Pairwise A/B with bootstrapped confidence intervals:
aieval pairwise - OpenTelemetry attribute capture behind an optional extra
- FastAPI + HTMX viewer
- Typer CLI
- BLEU and BERTScore scorers
- Calibration sets for judge scorers, with agreement metrics against human labels
- HuggingFace dataset hub integration
- Postgres backend
- Streaming view of in-progress runs
- Cost tracking per run
- A vendor-locked, hosted eval platform. This is open source on purpose.
- Real-time online evaluation. Use APM tools for that.
- Auto-fix-the-prompt. Humans should review prompt changes.
PRs welcome. Pick from "Planned", open an issue, fork, branch, push, PR. Small, focused.
I will not merge:
- Framework swaps. Typer and FastAPI stay.
- Sync runners. Everything is async.
- Adapters for providers without a free tier path.
Releases: see GitHub Releases.