Skip to content

Architecture

sarmakska edited this page May 3, 2026 · 3 revisions

Architecture

graph LR
  D[(Dataset<br/>JSONL or HF)] --> R[Runner]
  S[Scorers<br/>plain Python] --> R
  R -->|N parallel calls| L[LLM provider]
  R --> B[(Backend<br/>SQLite / Postgres / DuckDB)]
  B --> V[FastAPI + HTMX viewer]
  B --> CI[CI integration<br/>PR comments]

  classDef ext fill:#a78bfa,stroke:#a78bfa,color:#fff
  class L ext
Loading

Components

File Responsibility
aieval/cli.py Typer CLI: run, view, diff, ci
aieval/core/runner.py Parallel execution with retry + partial resumption
aieval/core/dataset.py JSONL + HF dataset loaders
aieval/core/scorer.py Scorer decorator + protocol
aieval/scorers/*.py Built-in scorers (exact match, JSON schema, ROUGE, BLEU, LLM-as-judge, rubric)
aieval/backends/*.py SQLite, DuckDB, Postgres backends
aieval/providers/*.py LLM provider adapters (SarmaLink, OpenAI, Ollama)
aieval/viewer/app.py FastAPI + HTMX viewer
aieval/ci/github_action.py PR comment with regression delta

Run lifecycle

stateDiagram-v2
  [*] --> Pending
  Pending --> Running: created in DB
  Running --> Persisting: all examples scored
  Persisting --> Done: results stored
  Running --> Failed: provider errors after retries
  Done --> [*]
  Failed --> [*]
Loading

Why each piece

Plain Python scorers rather than DSLs because Python is fine for this. You get full IDE support and no leaky abstraction.

Async parallel execution because eval runs are I/O-bound on provider calls. 8x parallel is the sane default; tune up for higher rate limits.

Versioning by git SHA + dataset hash so two runs with the same model and the same dataset are comparable. Different datasets get different hashes; different code (a new prompt) gets a different SHA.

SQLite default backend because zero infrastructure. Promote to Postgres when you have more than one machine running evals.

HTMX viewer rather than React because the viewer is a read-only inspection tool. HTML over the wire is faster to build and easier to embed.

Clone this wiki locally