Skip to content

Architecture

sarmakska edited this page May 31, 2026 · 3 revisions

Architecture

graph LR
  D[(Dataset<br/>JSONL or list)] --> R[Runner]
  S[Scorers<br/>plain Python] --> R
  R -->|N parallel calls| L[LLM provider]
  R --> B[(Backend<br/>SQLite / DuckDB)]
  B --> V[FastAPI + HTMX viewer]
  B --> CI[CI regression gate]

  classDef ext fill:#a78bfa,stroke:#a78bfa,color:#fff
  class L ext
Loading

Components

File Responsibility
aieval/cli.py Typer CLI: run, list, view, diff, ci
aieval/core/runner.py Parallel execution with retry
aieval/core/dataset.py JSONL file loader and in-process list loader
aieval/core/scorer.py Scorer decorator and protocol
aieval/scorers/*.py Built-in scorers: exact match, JSON validity, ROUGE
aieval/backends/*.py SQLite and DuckDB backends
aieval/providers/*.py LLM provider adapters: SarmaLink, OpenAI
aieval/viewer/app.py FastAPI + HTMX viewer

Run lifecycle

stateDiagram-v2
  [*] --> Pending
  Pending --> Running: created in DB
  Running --> Persisting: all examples scored
  Persisting --> Done: results stored
  Running --> Failed: provider errors after retries
  Done --> [*]
  Failed --> [*]
Loading

Why each piece

Plain Python scorers rather than DSLs because Python is fine for this. You get full IDE support and no leaky abstraction.

Async parallel execution because eval runs are I/O-bound on provider calls. Tune the concurrency argument on run up for higher rate limits.

Versioning by git SHA + dataset hash so two runs with the same model and the same dataset are comparable. Different datasets get different hashes; different code (a new prompt) gets a different SHA.

SQLite default backend because zero infrastructure. Switch to DuckDB by setting AIEVAL_BACKEND when you want columnar analytics over many runs.

HTMX viewer rather than React because the viewer is a read-only inspection tool. HTML over the wire is faster to build and easier to embed.

Clone this wiki locally