-
Notifications
You must be signed in to change notification settings - Fork 0
Architecture
sarmakska edited this page May 3, 2026
·
3 revisions
graph LR
D[(Dataset<br/>JSONL or HF)] --> R[Runner]
S[Scorers<br/>plain Python] --> R
R -->|N parallel calls| L[LLM provider]
R --> B[(Backend<br/>SQLite / Postgres / DuckDB)]
B --> V[FastAPI + HTMX viewer]
B --> CI[CI integration<br/>PR comments]
classDef ext fill:#a78bfa,stroke:#a78bfa,color:#fff
class L ext
| File | Responsibility |
|---|---|
aieval/cli.py |
Typer CLI: run, view, diff, ci |
aieval/core/runner.py |
Parallel execution with retry + partial resumption |
aieval/core/dataset.py |
JSONL + HF dataset loaders |
aieval/core/scorer.py |
Scorer decorator + protocol |
aieval/scorers/*.py |
Built-in scorers (exact match, JSON schema, ROUGE, BLEU, LLM-as-judge, rubric) |
aieval/backends/*.py |
SQLite, DuckDB, Postgres backends |
aieval/providers/*.py |
LLM provider adapters (SarmaLink, OpenAI, Ollama) |
aieval/viewer/app.py |
FastAPI + HTMX viewer |
aieval/ci/github_action.py |
PR comment with regression delta |
stateDiagram-v2
[*] --> Pending
Pending --> Running: created in DB
Running --> Persisting: all examples scored
Persisting --> Done: results stored
Running --> Failed: provider errors after retries
Done --> [*]
Failed --> [*]
Plain Python scorers rather than DSLs because Python is fine for this. You get full IDE support and no leaky abstraction.
Async parallel execution because eval runs are I/O-bound on provider calls. 8x parallel is the sane default; tune up for higher rate limits.
Versioning by git SHA + dataset hash so two runs with the same model and the same dataset are comparable. Different datasets get different hashes; different code (a new prompt) gets a different SHA.
SQLite default backend because zero infrastructure. Promote to Postgres when you have more than one machine running evals.
HTMX viewer rather than React because the viewer is a read-only inspection tool. HTML over the wire is faster to build and easier to embed.