-
Notifications
You must be signed in to change notification settings - Fork 0
Quick Start
sarmakska edited this page May 31, 2026
·
2 revisions
git clone https://github.com/sarmakska/ai-eval-runner.git
cd ai-eval-runner
uv sync
cp .env.example .envSet at minimum:
SARMALINK_API_KEY=...
uv run aieval run examples/summarisation/eval.pyYou should see:
Run f6f08131-...: summarisation on 3 examples with model=smart (sha=a79690f, dataset=7358a450b57d)
Done. Pass rate: 66.7%, avg latency: 1240ms
Each run is tagged with the git SHA and the dataset content version.
uv run aieval viewOpen http://localhost:8000. You see a list of runs. Click into one for the per-example view, or visit /diff/<run_a>/<run_b> for a regression diff.
Create my_eval.py:
from aieval import dataset, run, scorer
from aieval.scorers import rouge_l
@scorer
def length_under_120_words(prediction, _expected):
return 1.0 if len(prediction.split()) <= 120 else 0.0
if __name__ == "__main__":
run(
name="my-eval",
dataset=dataset.from_list([
{"prompt": "Summarise: ...", "expected": "..."},
]),
scorers=[rouge_l, length_under_120_words],
provider="sarmalink",
model="smart",
)Run with uv run aieval run my_eval.py. The CLI executes the file as the main module, so the if __name__ == "__main__": guard fires.
uv run aieval list # find run ids
uv run aieval diff <run_a> <run_b> # per-scorer deltas
uv run aieval pairwise <run_a> <run_b> # bootstrapped confidence intervalIn .github/workflows/evals.yml:
- run: uv run aieval run my_eval.py
- run: |
RUN_ID=$(uv run python -c "from aieval.backends import get_backend; print(get_backend().list_runs()[0]['id'])")
uv run aieval ci "$RUN_ID" --threshold 0.05The ci command compares the run to the previous run of the same name and exits non-zero if any scorer drops past the threshold.
See Comparing-Runs for the full diff, gate and pairwise reference, and Scorers for the built-in scorers and the LLM judge.