llm-eval-runner

LLM features can regress without a code change — a model update, a prompt tweak, a retrieval shift. Without evals in CI, you learn about it from customer complaints.

llm-eval-runner is a CI-native eval framework. It grades your LLM outputs against golden test cases, computes accuracy / groundedness / latency / cost, and fails the build on regression. The default heuristic mode makes zero LLM calls — so it's free to run on every PR.

By Ruchit Suthar — Software Architect & Technical Leader. 📖 Method: Evals for LLM Features: Building the Regression Net

Install

pip install llm-eval-runner                 # Python (the grader)
npm install @ruchit07/llm-eval-runner       # TypeScript client (builds test cases)

The metrics

Metric	What it measures	Cost
accuracy	Token F1 overlap vs golden answers (SQuAD metric) + must/must-not constraints	$0
groundedness	Fraction of answer sentences supported by the retrieved context	$0
faithfulness	Answer doesn't contradict the context (heuristic; upgraded by LLM judge)	$0
latency	p50 / p95 / p99 against a threshold	$0
cost	Average cost per query against a budget	$0

Heuristic mode runs in < 1s with no API calls. That's the point: so cheap there's no excuse not to run it on every PR.

Use it in GitHub Actions

- uses: ruchit07/llm-eval-runner@v1
  with:
    test-cases: ./evals/test-cases.json
    thresholds: ./evals/eval-criteria.json
    fail-on-regression: true

The action writes a Markdown summary to the job and fails the build if any metric drops below threshold.

How it works: the replay pattern

Your app produces answers; the grader scores them. This separation means the grader never needs your API keys or your model.

Your app runs the test cases and writes each answer into an actual block:

[
  {
    "id": "tc-001",
    "input": { "question": "How do I reset my password?", "context": "Click Forgot Password..." },
    "expected": { "mustContain": ["Forgot Password"] },
    "actual": { "answer": "Click the Forgot Password link.", "latencyMs": 820, "costUsd": 0.0012 }
  }
]

The grader scores it:

llm-eval run test-cases.json --thresholds eval-criteria.json

============================================================
EVAL REPORT — support-qa
============================================================
Result: ✅ PASSED
Cases: 3/3 passed (100%)

Metrics:
  ✓ accuracy         ████████████████░░░░  82.0% (≥ 50%)
  ✓ groundedness     ██████████████████░░  91.0% (≥ 70%)
  ✓ latency          ████████████████████  100.0% (≥ 0%)

Latency: p50=820ms  p95=910ms  p99=910ms
Cost:    $0.003400 total ($0.001133/query)
============================================================

Building test cases in TypeScript

import { EvalSuiteBuilder } from '@ruchit07/llm-eval-runner';
import { writeFileSync } from 'fs';

const suite = new EvalSuiteBuilder('support-qa').addCases([
  { id: 'tc-1', input: { question: 'How do I reset my password?' }, expected: { mustContain: ['Forgot Password'] } },
]);

// Run each case against your actual feature
await suite.run(async (question, context) => {
  const result = await myAIFeature(question, context);
  return { answer: result.answer, context: result.context, costUsd: result.cost };
});

writeFileSync('test-cases.json', suite.toJSON());
// Now: llm-eval run test-cases.json --thresholds eval-criteria.json

Using the Python API directly (live grading)

from llm_eval_runner import EvalRunner, Thresholds, TestCase, TestCaseInput, Expected, RunResult, print_report

cases = [
    TestCase(id="tc-1", input=TestCaseInput(question="How do I reset my password?"),
             expected=Expected(must_contain=["Forgot Password"])),
]

def run_fn(question, context):
    answer = my_ai_feature(question)            # your model call
    return RunResult(answer=answer, latency_ms=820, cost_usd=0.0012)

runner = EvalRunner(Thresholds(accuracy=0.5, latency_p95_ms=2000))
report = runner.run("support-qa", cases, run_fn)
print_report(report)

Two-tier strategy (recommended)

Tier 1 — every PR: heuristic mode (this package, default). Free, fast, catches the common regressions.
Tier 2 — weekly / pre-release: LLM-as-judge (pip install llm-eval-runner[judge]) for subtle hallucination and contradiction detection.

Why two tiers: you want evals on every PR (so cheap they're free) and the depth of an LLM judge (too expensive per-PR). Full rationale.

Pairs with

ai-spec — generates the eval-criteria.json and seed test-cases.json this tool consumes.
ai-native-app-blueprint — the production reference whose packages/evals mirrors these metrics.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
clients/typescript		clients/typescript
examples		examples
src/llm_eval_runner		src/llm_eval_runner
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
action.yml		action.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-eval-runner

Install

The metrics

Use it in GitHub Actions

How it works: the replay pattern

Building test cases in TypeScript

Using the Python API directly (live grading)

Two-tier strategy (recommended)

Pairs with

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-eval-runner

Install

The metrics

Use it in GitHub Actions

How it works: the replay pattern

Building test cases in TypeScript

Using the Python API directly (live grading)

Two-tier strategy (recommended)

Pairs with

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages