Architecture

graph LR
  D[(Dataset<br/>JSONL or HF)] --> R[Runner]
  S[Scorers<br/>plain Python] --> R
  R -->|N parallel calls| L[LLM provider]
  R --> B[(Backend<br/>SQLite / Postgres / DuckDB)]
  B --> V[FastAPI + HTMX viewer]
  B --> CI[CI integration<br/>PR comments]

  classDef ext fill:#a78bfa,stroke:#a78bfa,color:#fff
  class L ext

Components

File	Responsibility
`aieval/cli.py`	Typer CLI: run, view, diff, ci
`aieval/core/runner.py`	Parallel execution with retry + partial resumption
`aieval/core/dataset.py`	JSONL + HF dataset loaders
`aieval/core/scorer.py`	Scorer decorator + protocol
`aieval/scorers/*.py`	Built-in scorers (exact match, JSON schema, ROUGE, BLEU, LLM-as-judge, rubric)
`aieval/backends/*.py`	SQLite, DuckDB, Postgres backends
`aieval/providers/*.py`	LLM provider adapters (SarmaLink, OpenAI, Ollama)
`aieval/viewer/app.py`	FastAPI + HTMX viewer
`aieval/ci/github_action.py`	PR comment with regression delta

Run lifecycle

stateDiagram-v2
  [*] --> Pending
  Pending --> Running: created in DB
  Running --> Persisting: all examples scored
  Persisting --> Done: results stored
  Running --> Failed: provider errors after retries
  Done --> [*]
  Failed --> [*]

Why each piece

Plain Python scorers rather than DSLs because Python is fine for this. You get full IDE support and no leaky abstraction.

Async parallel execution because eval runs are I/O-bound on provider calls. 8x parallel is the sane default; tune up for higher rate limits.

Versioning by git SHA + dataset hash so two runs with the same model and the same dataset are comparable. Different datasets get different hashes; different code (a new prompt) gets a different SHA.

SQLite default backend because zero infrastructure. Promote to Postgres when you have more than one machine running evals.

HTMX viewer rather than React because the viewer is a read-only inspection tool. HTML over the wire is faster to build and easier to embed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture

Architecture

Components

Run lifecycle

Why each piece

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally