Skip to content

Architecture

sarmakska edited this page May 31, 2026 · 3 revisions

Architecture

graph LR
  D[(Dataset<br/>JSONL or list)] --> R[Runner]
  S[Scorers<br/>plain Python + judge] --> R
  R -->|parallel provider calls| L[LLM provider]
  R -->|spans| OT[OpenTelemetry]
  R --> B[(Backend<br/>SQLite / DuckDB)]
  B --> V[FastAPI + HTMX viewer]
  B --> CMP[Compare: diff / CI gate / pairwise]

  classDef ext fill:#a78bfa,stroke:#a78bfa,color:#fff
  class L,OT ext
Loading

Components

File Responsibility
aieval/cli.py Typer CLI: run, list, view, diff, ci, pairwise
aieval/core/runner.py Parallel execution with retry, scoring, telemetry, storage; tags runs with git SHA and dataset version
aieval/core/dataset.py JSONL and in-process loaders, content version(), DatasetRegistry
aieval/core/scorer.py Scorer decorator, naming, and context injection via invoke_scorer
aieval/core/regression.py compare() for per-scorer deltas and per-example diffs
aieval/core/pairwise.py Paired bootstrap A/B with percentile confidence intervals
aieval/core/telemetry.py OpenTelemetry span and attribute capture, no-op fallback
aieval/scorers/*.py Built-in scorers: exact match, JSON validity, ROUGE-L, LLM judge
aieval/backends/*.py SQLite and DuckDB backends
aieval/providers/*.py LLM provider adapters: SarmaLink, OpenAI
aieval/viewer/app.py FastAPI + HTMX viewer and diff view

Run lifecycle

stateDiagram-v2
  [*] --> Created: run row written with git SHA + dataset version
  Created --> Running: examples fanned out under a concurrency semaphore
  Running --> Scoring: each prediction scored by every scorer
  Scoring --> Persisting: results written to the backend
  Persisting --> Done: run closed, summary printed and emitted as a span
  Running --> Failed: provider errors after retries
  Done --> [*]
  Failed --> [*]
Loading

Why each piece

Plain Python scorers rather than DSLs because Python is fine for this. You get full IDE support and no leaky abstraction. A scorer is any callable returning a float in 0.0 to 1.0. Scorers can declare example, model, provider or prompt keyword parameters and the runner supplies them, which is what lets the LLM judge see the full request.

Async parallel execution because eval runs are I/O-bound on provider calls. Tune the concurrency argument on run up for higher rate limits.

Versioning by git SHA plus dataset version so two runs with the same code and data are comparable. The dataset version is an order-independent SHA-256 over the canonical JSON of every row, so reordering rows does not change it but editing one does.

A paired bootstrap for A/B so a difference is only called a win when its confidence interval clears zero. It is pure Python with a seedable generator, so it is reproducible and adds no numeric dependency.

OpenTelemetry behind an optional extra so the runner has no hard telemetry dependency. With the API installed it emits spans with GenAI semantic attributes; without it the same call sites record attributes internally and export nothing.

SQLite default backend because zero infrastructure. Switch to DuckDB by setting AIEVAL_BACKEND when you want columnar analytics over many runs.

HTMX viewer rather than a single-page app because the viewer is a read-only inspection tool. HTML over the wire is faster to build and easier to embed.

Clone this wiki locally