Skip to content

Quick Start

sarmakska edited this page May 31, 2026 · 2 revisions

Quick Start

git clone https://github.com/sarmakska/ai-eval-runner.git
cd ai-eval-runner
uv sync
cp .env.example .env

Set at minimum:

SARMALINK_API_KEY=...

Run the example

uv run aieval run examples/summarisation/eval.py

You should see:

Run f6f08131-...: summarisation on 3 examples with model=smart (sha=a79690f, dataset=7358a450b57d)
Done. Pass rate: 66.7%, avg latency: 1240ms

Each run is tagged with the git SHA and the dataset content version.

View results

uv run aieval view

Open http://localhost:8000. You see a list of runs. Click into one for the per-example view, or visit /diff/<run_a>/<run_b> for a regression diff.

Write your first eval

Create my_eval.py:

from aieval import dataset, run, scorer
from aieval.scorers import rouge_l


@scorer
def length_under_120_words(prediction, _expected):
    return 1.0 if len(prediction.split()) <= 120 else 0.0


if __name__ == "__main__":
    run(
        name="my-eval",
        dataset=dataset.from_list([
            {"prompt": "Summarise: ...", "expected": "..."},
        ]),
        scorers=[rouge_l, length_under_120_words],
        provider="sarmalink",
        model="smart",
    )

Run with uv run aieval run my_eval.py. The CLI executes the file as the main module, so the if __name__ == "__main__": guard fires.

Compare two runs

uv run aieval list                       # find run ids
uv run aieval diff <run_a> <run_b>       # per-scorer deltas
uv run aieval pairwise <run_a> <run_b>   # bootstrapped confidence interval

CI integration

In .github/workflows/evals.yml:

- run: uv run aieval run my_eval.py
- run: |
    RUN_ID=$(uv run python -c "from aieval.backends import get_backend; print(get_backend().list_runs()[0]['id'])")
    uv run aieval ci "$RUN_ID" --threshold 0.05

The ci command compares the run to the previous run of the same name and exits non-zero if any scorer drops past the threshold.

See Comparing-Runs for the full diff, gate and pairwise reference, and Scorers for the built-in scorers and the LLM judge.

Clone this wiki locally