Architecture

graph LR
  D[(Dataset<br/>JSONL or list)] --> R[Runner]
  S[Scorers<br/>plain Python + judge] --> R
  R -->|parallel provider calls| L[LLM provider]
  R -->|spans| OT[OpenTelemetry]
  R --> B[(Backend<br/>SQLite / DuckDB)]
  B --> V[FastAPI + HTMX viewer]
  B --> CMP[Compare: diff / CI gate / pairwise]

  classDef ext fill:#a78bfa,stroke:#a78bfa,color:#fff
  class L,OT ext

Components

File	Responsibility
`aieval/cli.py`	Typer CLI: run, list, view, diff, ci, pairwise
`aieval/core/runner.py`	Parallel execution with retry, scoring, telemetry, storage; tags runs with git SHA and dataset version
`aieval/core/dataset.py`	JSONL and in-process loaders, content `version()`, `DatasetRegistry`
`aieval/core/scorer.py`	Scorer decorator, naming, and context injection via `invoke_scorer`
`aieval/core/regression.py`	`compare()` for per-scorer deltas and per-example diffs
`aieval/core/pairwise.py`	Paired bootstrap A/B with percentile confidence intervals
`aieval/core/telemetry.py`	OpenTelemetry span and attribute capture, no-op fallback
`aieval/scorers/*.py`	Built-in scorers: exact match, JSON validity, ROUGE-L, LLM judge
`aieval/backends/*.py`	SQLite and DuckDB backends
`aieval/providers/*.py`	LLM provider adapters: SarmaLink, OpenAI
`aieval/viewer/app.py`	FastAPI + HTMX viewer and diff view

Run lifecycle

stateDiagram-v2
  [*] --> Created: run row written with git SHA + dataset version
  Created --> Running: examples fanned out under a concurrency semaphore
  Running --> Scoring: each prediction scored by every scorer
  Scoring --> Persisting: results written to the backend
  Persisting --> Done: run closed, summary printed and emitted as a span
  Running --> Failed: provider errors after retries
  Done --> [*]
  Failed --> [*]

Why each piece

Plain Python scorers rather than DSLs because Python is fine for this. You get full IDE support and no leaky abstraction. A scorer is any callable returning a float in 0.0 to 1.0. Scorers can declare example, model, provider or prompt keyword parameters and the runner supplies them, which is what lets the LLM judge see the full request.

Async parallel execution because eval runs are I/O-bound on provider calls. Tune the concurrency argument on run up for higher rate limits.

Versioning by git SHA plus dataset version so two runs with the same code and data are comparable. The dataset version is an order-independent SHA-256 over the canonical JSON of every row, so reordering rows does not change it but editing one does.

A paired bootstrap for A/B so a difference is only called a win when its confidence interval clears zero. It is pure Python with a seedable generator, so it is reproducible and adds no numeric dependency.

OpenTelemetry behind an optional extra so the runner has no hard telemetry dependency. With the API installed it emits spans with GenAI semantic attributes; without it the same call sites record attributes internally and export nothing.

SQLite default backend because zero infrastructure. Switch to DuckDB by setting AIEVAL_BACKEND when you want columnar analytics over many runs.

HTMX viewer rather than a single-page app because the viewer is a read-only inspection tool. HTML over the wire is faster to build and easier to embed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture

Architecture

Components

Run lifecycle

Why each piece

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally