Quick Start

git clone https://github.com/sarmakska/ai-eval-runner.git
cd ai-eval-runner
uv sync
cp .env.example .env

Set at minimum:

SARMALINK_API_KEY=...

Run the example

uv run aieval run examples/summarisation/eval.py

You should see:

Run f6f08131-...: summarisation on 3 examples with model=smart (sha=a79690f, dataset=7358a450b57d)
Done. Pass rate: 66.7%, avg latency: 1240ms

Each run is tagged with the git SHA and the dataset content version.

View results

uv run aieval view

Open http://localhost:8000. You see a list of runs. Click into one for the per-example view, or visit /diff/<run_a>/<run_b> for a regression diff.

Write your first eval

Create my_eval.py:

from aieval import dataset, run, scorer
from aieval.scorers import rouge_l


@scorer
def length_under_120_words(prediction, _expected):
    return 1.0 if len(prediction.split()) <= 120 else 0.0


if __name__ == "__main__":
    run(
        name="my-eval",
        dataset=dataset.from_list([
            {"prompt": "Summarise: ...", "expected": "..."},
        ]),
        scorers=[rouge_l, length_under_120_words],
        provider="sarmalink",
        model="smart",
    )

Run with uv run aieval run my_eval.py. The CLI executes the file as the main module, so the if __name__ == "__main__": guard fires.

Compare two runs

uv run aieval list                       # find run ids
uv run aieval diff <run_a> <run_b>       # per-scorer deltas
uv run aieval pairwise <run_a> <run_b>   # bootstrapped confidence interval

CI integration

In .github/workflows/evals.yml:

- run: uv run aieval run my_eval.py
- run: |
    RUN_ID=$(uv run python -c "from aieval.backends import get_backend; print(get_backend().list_runs()[0]['id'])")
    uv run aieval ci "$RUN_ID" --threshold 0.05

The ci command compares the run to the previous run of the same name and exits non-zero if any scorer drops past the threshold.

See Comparing-Runs for the full diff, gate and pairwise reference, and Scorers for the built-in scorers and the LLM judge.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick Start

Quick Start

Run the example

View results

Write your first eval

Compare two runs

CI integration

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally