-
Notifications
You must be signed in to change notification settings - Fork 0
Architecture
graph LR
D[(Dataset<br/>JSONL or list)] --> R[Runner]
S[Scorers<br/>plain Python + judge] --> R
R -->|parallel provider calls| L[LLM provider]
R -->|spans| OT[OpenTelemetry]
R --> B[(Backend<br/>SQLite / DuckDB)]
B --> V[FastAPI + HTMX viewer]
B --> CMP[Compare: diff / CI gate / pairwise]
classDef ext fill:#a78bfa,stroke:#a78bfa,color:#fff
class L,OT ext
| File | Responsibility |
|---|---|
aieval/cli.py |
Typer CLI: run, list, view, diff, ci, pairwise |
aieval/core/runner.py |
Parallel execution with retry, scoring, telemetry, storage; tags runs with git SHA and dataset version |
aieval/core/dataset.py |
JSONL and in-process loaders, content version(), DatasetRegistry
|
aieval/core/scorer.py |
Scorer decorator, naming, and context injection via invoke_scorer
|
aieval/core/regression.py |
compare() for per-scorer deltas and per-example diffs |
aieval/core/pairwise.py |
Paired bootstrap A/B with percentile confidence intervals |
aieval/core/telemetry.py |
OpenTelemetry span and attribute capture, no-op fallback |
aieval/scorers/*.py |
Built-in scorers: exact match, JSON validity, ROUGE-L, LLM judge |
aieval/backends/*.py |
SQLite and DuckDB backends |
aieval/providers/*.py |
LLM provider adapters: SarmaLink, OpenAI |
aieval/viewer/app.py |
FastAPI + HTMX viewer and diff view |
stateDiagram-v2
[*] --> Created: run row written with git SHA + dataset version
Created --> Running: examples fanned out under a concurrency semaphore
Running --> Scoring: each prediction scored by every scorer
Scoring --> Persisting: results written to the backend
Persisting --> Done: run closed, summary printed and emitted as a span
Running --> Failed: provider errors after retries
Done --> [*]
Failed --> [*]
Plain Python scorers rather than DSLs because Python is fine for this. You get full IDE support and no leaky abstraction. A scorer is any callable returning a float in 0.0 to 1.0. Scorers can declare example, model, provider or prompt keyword parameters and the runner supplies them, which is what lets the LLM judge see the full request.
Async parallel execution because eval runs are I/O-bound on provider calls. Tune the concurrency argument on run up for higher rate limits.
Versioning by git SHA plus dataset version so two runs with the same code and data are comparable. The dataset version is an order-independent SHA-256 over the canonical JSON of every row, so reordering rows does not change it but editing one does.
A paired bootstrap for A/B so a difference is only called a win when its confidence interval clears zero. It is pure Python with a seedable generator, so it is reproducible and adds no numeric dependency.
OpenTelemetry behind an optional extra so the runner has no hard telemetry dependency. With the API installed it emits spans with GenAI semantic attributes; without it the same call sites record attributes internally and export nothing.
SQLite default backend because zero infrastructure. Switch to DuckDB by setting AIEVAL_BACKEND when you want columnar analytics over many runs.
HTMX viewer rather than a single-page app because the viewer is a read-only inspection tool. HTML over the wire is faster to build and easier to embed.