Pre-deploy LLM regression testing for CI pipelines. Trace your LLM calls, then llmgate diff baseline current fails your PR if output quality dropped. No server, no account, SQLite only.
import llmgate
@llmgate.trace
def answer(question: str) -> str:
return my_llm_call(question)That's it. Every call is logged locally. When you change your prompt or swap models, run:
llmgate diff main feature-branchIf outputs degraded, the command exits 1 and your PR fails. No server, no account, no config.
pip install llmgateimport llmgate
import os
os.environ["LLMGATE_RUN_ID"] = "v1.0" # set per run, or use git SHA in CI
@llmgate.trace
def my_pipeline(query: str) -> str:
context = retrieve(query)
return llm.complete(f"{context}\n\n{query}")output = my_pipeline("What is the capital of France?")
llmgate.assert_contains(output, "Paris")
llmgate.assert_output(output, lambda s: len(s) < 500, "response too long")
llmgate.assert_similarity(output, baseline, threshold=0.85)# See all recorded runs
llmgate runs
# Compare two runs — exits 1 if regressions found
llmgate diff v1.0 v1.1
# Inspect a specific run
llmgate show abc123- name: Run LLM eval suite
env:
LLMGATE_RUN_ID: ${{ github.sha }}
run: python examples/eval_suite.py
- name: Check for regressions
run: llmgate diff ${{ github.base_ref }} ${{ github.sha }}- All traces are stored in
.llmgate.db(SQLite, commit it or cache it as a CI artifact) @llmgate.traceworks with any function that returns a string, or OpenAI/Anthropic response objectsllmgate diffcomputes token-level similarity between baseline and current outputs- Nothing leaves your machine unless you choose to push the
.dbfile
llmgate runs # list all runs with stats
llmgate show <run-id> # inspect calls in a run
llmgate diff <baseline> <current> # compare runs, exit 1 on regression
--threshold FLOAT # similarity threshold (default: 0.8)
--no-fail # report only, don't exit 1