Skip to content

ruchit07/llm-eval-runner

Repository files navigation

llm-eval-runner

LLM features can regress without a code change — a model update, a prompt tweak, a retrieval shift. Without evals in CI, you learn about it from customer complaints.

llm-eval-runner is a CI-native eval framework. It grades your LLM outputs against golden test cases, computes accuracy / groundedness / latency / cost, and fails the build on regression. The default heuristic mode makes zero LLM calls — so it's free to run on every PR.

By Ruchit Suthar — Software Architect & Technical Leader. 📖 Method: Evals for LLM Features: Building the Regression Net


Install

pip install llm-eval-runner                 # Python (the grader)
npm install @ruchit07/llm-eval-runner       # TypeScript client (builds test cases)

The metrics

Metric What it measures Cost
accuracy Token F1 overlap vs golden answers (SQuAD metric) + must/must-not constraints $0
groundedness Fraction of answer sentences supported by the retrieved context $0
faithfulness Answer doesn't contradict the context (heuristic; upgraded by LLM judge) $0
latency p50 / p95 / p99 against a threshold $0
cost Average cost per query against a budget $0

Heuristic mode runs in < 1s with no API calls. That's the point: so cheap there's no excuse not to run it on every PR.


Use it in GitHub Actions

- uses: ruchit07/llm-eval-runner@v1
  with:
    test-cases: ./evals/test-cases.json
    thresholds: ./evals/eval-criteria.json
    fail-on-regression: true

The action writes a Markdown summary to the job and fails the build if any metric drops below threshold.


How it works: the replay pattern

Your app produces answers; the grader scores them. This separation means the grader never needs your API keys or your model.

  1. Your app runs the test cases and writes each answer into an actual block:
[
  {
    "id": "tc-001",
    "input": { "question": "How do I reset my password?", "context": "Click Forgot Password..." },
    "expected": { "mustContain": ["Forgot Password"] },
    "actual": { "answer": "Click the Forgot Password link.", "latencyMs": 820, "costUsd": 0.0012 }
  }
]
  1. The grader scores it:
llm-eval run test-cases.json --thresholds eval-criteria.json
============================================================
EVAL REPORT — support-qa
============================================================
Result: ✅ PASSED
Cases: 3/3 passed (100%)

Metrics:
  ✓ accuracy         ████████████████░░░░  82.0% (≥ 50%)
  ✓ groundedness     ██████████████████░░  91.0% (≥ 70%)
  ✓ latency          ████████████████████  100.0% (≥ 0%)

Latency: p50=820ms  p95=910ms  p99=910ms
Cost:    $0.003400 total ($0.001133/query)
============================================================

Building test cases in TypeScript

import { EvalSuiteBuilder } from '@ruchit07/llm-eval-runner';
import { writeFileSync } from 'fs';

const suite = new EvalSuiteBuilder('support-qa').addCases([
  { id: 'tc-1', input: { question: 'How do I reset my password?' }, expected: { mustContain: ['Forgot Password'] } },
]);

// Run each case against your actual feature
await suite.run(async (question, context) => {
  const result = await myAIFeature(question, context);
  return { answer: result.answer, context: result.context, costUsd: result.cost };
});

writeFileSync('test-cases.json', suite.toJSON());
// Now: llm-eval run test-cases.json --thresholds eval-criteria.json

Using the Python API directly (live grading)

from llm_eval_runner import EvalRunner, Thresholds, TestCase, TestCaseInput, Expected, RunResult, print_report

cases = [
    TestCase(id="tc-1", input=TestCaseInput(question="How do I reset my password?"),
             expected=Expected(must_contain=["Forgot Password"])),
]

def run_fn(question, context):
    answer = my_ai_feature(question)            # your model call
    return RunResult(answer=answer, latency_ms=820, cost_usd=0.0012)

runner = EvalRunner(Thresholds(accuracy=0.5, latency_p95_ms=2000))
report = runner.run("support-qa", cases, run_fn)
print_report(report)

Two-tier strategy (recommended)

  • Tier 1 — every PR: heuristic mode (this package, default). Free, fast, catches the common regressions.
  • Tier 2 — weekly / pre-release: LLM-as-judge (pip install llm-eval-runner[judge]) for subtle hallucination and contradiction detection.

Why two tiers: you want evals on every PR (so cheap they're free) and the depth of an LLM judge (too expensive per-PR). Full rationale.


Pairs with

  • ai-spec — generates the eval-criteria.json and seed test-cases.json this tool consumes.
  • ai-native-app-blueprint — the production reference whose packages/evals mirrors these metrics.

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors