Quick Start

git clone https://github.com/sarmakska/ai-eval-runner.git
cd ai-eval-runner
uv sync
cp .env.example .env

Set at minimum:

SARMALINK_API_KEY=...

Run the example

uv run aieval run examples/summarisation/eval.py

You should see:

Run abc12345: summarisation on 3 examples with model=smart
Done. Pass rate: 66.7%, avg latency: 1240ms

View results

uv run uvicorn aieval.viewer.app:app --reload

Open http://localhost:8000. You see a list of runs. Click into one for the per-example view.

Write your first eval

Create my_eval.py:

from aieval import dataset, scorer, run
from aieval.scorers import rouge_l

@scorer
def length_under_120_words(prediction, _expected):
    return 1.0 if len(prediction.split()) <= 120 else 0.0

if __name__ == "__main__":
    run(
        name="my-eval",
        dataset=dataset.from_list([
            {"prompt": "Summarise: ...", "expected": "..."},
        ]),
        scorers=[rouge_l, length_under_120_words],
        provider="sarmalink",
        model="smart",
    )

Run with uv run python my_eval.py.

CI integration

In .github/workflows/evals.yml:

- run: uv run aieval ci --baseline main

The CLI compares the current run to the baseline run on main and posts a PR comment with the regression delta per scorer. Build fails if any scorer drops below the configured threshold.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick Start

Quick Start

Run the example

View results

Write your first eval

CI integration

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally