Skip to content

Quick Start

sarmakska edited this page May 3, 2026 · 2 revisions

Quick Start

git clone https://github.com/sarmakska/ai-eval-runner.git
cd ai-eval-runner
uv sync
cp .env.example .env

Set at minimum:

SARMALINK_API_KEY=...

Run the example

uv run aieval run examples/summarisation/eval.py

You should see:

Run abc12345: summarisation on 3 examples with model=smart
Done. Pass rate: 66.7%, avg latency: 1240ms

View results

uv run uvicorn aieval.viewer.app:app --reload

Open http://localhost:8000. You see a list of runs. Click into one for the per-example view.

Write your first eval

Create my_eval.py:

from aieval import dataset, scorer, run
from aieval.scorers import rouge_l

@scorer
def length_under_120_words(prediction, _expected):
    return 1.0 if len(prediction.split()) <= 120 else 0.0

if __name__ == "__main__":
    run(
        name="my-eval",
        dataset=dataset.from_list([
            {"prompt": "Summarise: ...", "expected": "..."},
        ]),
        scorers=[rouge_l, length_under_120_words],
        provider="sarmalink",
        model="smart",
    )

Run with uv run python my_eval.py.

CI integration

In .github/workflows/evals.yml:

- run: uv run aieval ci --baseline main

The CLI compares the current run to the baseline run on main and posts a PR comment with the regression delta per scorer. Build fails if any scorer drops below the configured threshold.

Clone this wiki locally