-
Notifications
You must be signed in to change notification settings - Fork 0
Quick Start
sarmakska edited this page May 3, 2026
·
2 revisions
git clone https://github.com/sarmakska/ai-eval-runner.git
cd ai-eval-runner
uv sync
cp .env.example .envSet at minimum:
SARMALINK_API_KEY=...
uv run aieval run examples/summarisation/eval.pyYou should see:
Run abc12345: summarisation on 3 examples with model=smart
Done. Pass rate: 66.7%, avg latency: 1240ms
uv run uvicorn aieval.viewer.app:app --reloadOpen http://localhost:8000. You see a list of runs. Click into one for the per-example view.
Create my_eval.py:
from aieval import dataset, scorer, run
from aieval.scorers import rouge_l
@scorer
def length_under_120_words(prediction, _expected):
return 1.0 if len(prediction.split()) <= 120 else 0.0
if __name__ == "__main__":
run(
name="my-eval",
dataset=dataset.from_list([
{"prompt": "Summarise: ...", "expected": "..."},
]),
scorers=[rouge_l, length_under_120_words],
provider="sarmalink",
model="smart",
)Run with uv run python my_eval.py.
In .github/workflows/evals.yml:
- run: uv run aieval ci --baseline mainThe CLI compares the current run to the baseline run on main and posts a PR comment with the regression delta per scorer. Build fails if any scorer drops below the configured threshold.