-
Notifications
You must be signed in to change notification settings - Fork 0
Architecture
sarmakska edited this page May 31, 2026
·
3 revisions
graph LR
D[(Dataset<br/>JSONL or list)] --> R[Runner]
S[Scorers<br/>plain Python] --> R
R -->|N parallel calls| L[LLM provider]
R --> B[(Backend<br/>SQLite / DuckDB)]
B --> V[FastAPI + HTMX viewer]
B --> CI[CI regression gate]
classDef ext fill:#a78bfa,stroke:#a78bfa,color:#fff
class L ext
| File | Responsibility |
|---|---|
aieval/cli.py |
Typer CLI: run, list, view, diff, ci |
aieval/core/runner.py |
Parallel execution with retry |
aieval/core/dataset.py |
JSONL file loader and in-process list loader |
aieval/core/scorer.py |
Scorer decorator and protocol |
aieval/scorers/*.py |
Built-in scorers: exact match, JSON validity, ROUGE |
aieval/backends/*.py |
SQLite and DuckDB backends |
aieval/providers/*.py |
LLM provider adapters: SarmaLink, OpenAI |
aieval/viewer/app.py |
FastAPI + HTMX viewer |
stateDiagram-v2
[*] --> Pending
Pending --> Running: created in DB
Running --> Persisting: all examples scored
Persisting --> Done: results stored
Running --> Failed: provider errors after retries
Done --> [*]
Failed --> [*]
Plain Python scorers rather than DSLs because Python is fine for this. You get full IDE support and no leaky abstraction.
Async parallel execution because eval runs are I/O-bound on provider calls. Tune the concurrency argument on run up for higher rate limits.
Versioning by git SHA + dataset hash so two runs with the same model and the same dataset are comparable. Different datasets get different hashes; different code (a new prompt) gets a different SHA.
SQLite default backend because zero infrastructure. Switch to DuckDB by setting AIEVAL_BACKEND when you want columnar analytics over many runs.
HTMX viewer rather than React because the viewer is a read-only inspection tool. HTML over the wire is faster to build and easier to embed.