LLM features can regress without a code change — a model update, a prompt tweak, a retrieval shift. Without evals in CI, you learn about it from customer complaints.
llm-eval-runner is a CI-native eval framework. It grades your LLM outputs against golden test cases, computes accuracy / groundedness / latency / cost, and fails the build on regression. The default heuristic mode makes zero LLM calls — so it's free to run on every PR.
By Ruchit Suthar — Software Architect & Technical Leader. 📖 Method: Evals for LLM Features: Building the Regression Net
pip install llm-eval-runner # Python (the grader)
npm install @ruchit07/llm-eval-runner # TypeScript client (builds test cases)| Metric | What it measures | Cost |
|---|---|---|
| accuracy | Token F1 overlap vs golden answers (SQuAD metric) + must/must-not constraints | $0 |
| groundedness | Fraction of answer sentences supported by the retrieved context | $0 |
| faithfulness | Answer doesn't contradict the context (heuristic; upgraded by LLM judge) | $0 |
| latency | p50 / p95 / p99 against a threshold | $0 |
| cost | Average cost per query against a budget | $0 |
Heuristic mode runs in < 1s with no API calls. That's the point: so cheap there's no excuse not to run it on every PR.
- uses: ruchit07/llm-eval-runner@v1
with:
test-cases: ./evals/test-cases.json
thresholds: ./evals/eval-criteria.json
fail-on-regression: trueThe action writes a Markdown summary to the job and fails the build if any metric drops below threshold.
Your app produces answers; the grader scores them. This separation means the grader never needs your API keys or your model.
- Your app runs the test cases and writes each answer into an
actualblock:
[
{
"id": "tc-001",
"input": { "question": "How do I reset my password?", "context": "Click Forgot Password..." },
"expected": { "mustContain": ["Forgot Password"] },
"actual": { "answer": "Click the Forgot Password link.", "latencyMs": 820, "costUsd": 0.0012 }
}
]- The grader scores it:
llm-eval run test-cases.json --thresholds eval-criteria.json============================================================
EVAL REPORT — support-qa
============================================================
Result: ✅ PASSED
Cases: 3/3 passed (100%)
Metrics:
✓ accuracy ████████████████░░░░ 82.0% (≥ 50%)
✓ groundedness ██████████████████░░ 91.0% (≥ 70%)
✓ latency ████████████████████ 100.0% (≥ 0%)
Latency: p50=820ms p95=910ms p99=910ms
Cost: $0.003400 total ($0.001133/query)
============================================================
import { EvalSuiteBuilder } from '@ruchit07/llm-eval-runner';
import { writeFileSync } from 'fs';
const suite = new EvalSuiteBuilder('support-qa').addCases([
{ id: 'tc-1', input: { question: 'How do I reset my password?' }, expected: { mustContain: ['Forgot Password'] } },
]);
// Run each case against your actual feature
await suite.run(async (question, context) => {
const result = await myAIFeature(question, context);
return { answer: result.answer, context: result.context, costUsd: result.cost };
});
writeFileSync('test-cases.json', suite.toJSON());
// Now: llm-eval run test-cases.json --thresholds eval-criteria.jsonfrom llm_eval_runner import EvalRunner, Thresholds, TestCase, TestCaseInput, Expected, RunResult, print_report
cases = [
TestCase(id="tc-1", input=TestCaseInput(question="How do I reset my password?"),
expected=Expected(must_contain=["Forgot Password"])),
]
def run_fn(question, context):
answer = my_ai_feature(question) # your model call
return RunResult(answer=answer, latency_ms=820, cost_usd=0.0012)
runner = EvalRunner(Thresholds(accuracy=0.5, latency_p95_ms=2000))
report = runner.run("support-qa", cases, run_fn)
print_report(report)- Tier 1 — every PR: heuristic mode (this package, default). Free, fast, catches the common regressions.
- Tier 2 — weekly / pre-release: LLM-as-judge (
pip install llm-eval-runner[judge]) for subtle hallucination and contradiction detection.
Why two tiers: you want evals on every PR (so cheap they're free) and the depth of an LLM judge (too expensive per-PR). Full rationale.
- ai-spec — generates the
eval-criteria.jsonand seedtest-cases.jsonthis tool consumes. - ai-native-app-blueprint — the production reference whose
packages/evalsmirrors these metrics.
MIT