Skip to content

veltiq/groundtruth

groundtruth — catch when your AI coding assistant claims work it didn't do

CI npm MIT Node >= 20 Zero runtime dependencies PRs welcome

English · 简体中文 · Español · Português · Français · Deutsch · 日本語 · Русский · العربية

groundtruth

TL;DR — Your AI says "Done! I added X, fixed Y, wrote tests." groundtruth checks each claim against the real diff and flags the ones that never happened. One command: npx @veltiq/groundtruth install.

Catch when your AI coding assistant claims work it didn't do.

Your agent ends a turn with "Done! I added a rateLimiter middleware to src/server.ts, fixed the timeout bug, and added tests." You trust the summary, commit, and move on. Two weeks later production breaks — the rate limiter was never written. The summary lied (or hallucinated), and nothing checked it against the actual diff.

groundtruth reads the assistant's end-of-turn summary, extracts each concrete claim, and verifies it against what actually changed — the ground truth. It runs automatically as a Claude Code hook, or on demand from the CLI.

groundtruth — claim check

  ❌ unsupported  symbol `rateLimiter`
     Claimed `rateLimiter`, but it does not appear anywhere in this turn's changes.
     from: "I added a `rateLimiter` middleware to `src/server.ts`, ... and added tests."
  ❌ unsupported  file src/server.ts
     Claimed a change to `src/server.ts`, but it is not among the files changed (README.md).
  ❌ unsupported  tests
     Claimed test work, but no test file changed and no test command ran this turn.

  3 claims · 0 verified · 3 unsupported

The whole codebase here was a single README edit. groundtruth caught all three false claims.


Why this exists

Research on agentic pull requests found that "phantom changes" — work the description claims but never implements — are the single most common kind of inconsistency. Tests and CI catch code that's wrong; nothing catches code that was simply never written but confidently reported as done. That's the gap groundtruth fills.

It is built on one principle: the diff doesn't lie. Natural-language summaries are graded against deterministic facts (which files changed, which symbols appear in the added lines, whether a test file or install command actually ran), never against another model's opinion.

Try it in 30 seconds

No install, no config — see it catch a phantom change against a canned transcript:

npx @veltiq/groundtruth verify --transcript examples/phantom-change.jsonl --no-git

Install

Requires Node ≥ 20. No global install needed — the hook runs through npx.

# Wire it into Claude Code as a Stop hook for this project (./.claude/settings.json)
npx @veltiq/groundtruth install

# …or for every project (~/.claude/settings.json)
npx @veltiq/groundtruth install --global

Restart Claude Code (or run /hooks) and groundtruth checks every turn automatically. Want a faster, always-on binary? Run npm i -g @veltiq/groundtruth first (it installs the groundtruth command) and install auto-detects it. To check the current session without installing anything:

npx @veltiq/groundtruth verify

Prefer plugins? Add the marketplace and install in one step:

/plugin marketplace add veltiq/groundtruth
/plugin install groundtruth

How it works

transcript ─▶ Turn ─▶ ( Evidence + Claims ) ─▶ Verdicts ─▶ Report
            summary      diff       prose      per-claim
            + tools    ground truth  parse      check
  1. Read the turn. Parse the Claude Code JSONL transcript for the latest turn: the assistant's final summary plus every tool it called (Write, Edit, MultiEdit, Bash, …).
  2. Collect ground truth. Build evidence from those tool calls (precise, turn-scoped) plus the git working-tree diff (corroborating). This is the set of files touched, text added/removed, and commands run.
  3. Extract claims. Pull concrete assertions out of the prose, anchored on strong signals — backticked identifiers, real file paths, test/dependency keywords. Statements of intent ("I'll add…") are ignored.
  4. Verify. Check each claim against the evidence and assign a verdict.
Verdict Meaning
verified Concrete evidence backs the claim.
unsupported The claim is concretely checkable and has zero matching evidence — a phantom change.
⚠️ review Semantic or ambiguous (e.g. "fixed the bug") — shown for your attention, never counted as a failure.

A deliberate bias toward silence

False alarms are what get a tool like this uninstalled, so the rules are conservative by design: a claim is only marked unsupported when it is unambiguously checkable and nothing supports it. Anything vague becomes review, not a failure. groundtruth would rather miss a questionable claim than wrongly accuse a correct one. See docs/design.md.

Usage

groundtruth verify                       # check the latest session for this project
groundtruth verify --transcript x.jsonl  # check a specific transcript
groundtruth verify --markdown            # emit markdown (great as a PR comment)
groundtruth verify --json                # machine-readable output
groundtruth verify --strict              # exit non-zero if anything is unsupported

groundtruth install [--global] [--npx] [--strict] [--print]

By default the hook is non-blocking: it prints its report and gets out of the way. Pass --strict (or set GROUNDTRUTH_STRICT=1) to make it block the turn when unsupported claims are found.

What it checks

Claim type Example Verified when…
file "updated src/auth.ts" that file was touched this turn
symbol "added a validateInput function" the identifier appears in the added (or removed) code
test "added tests" a test file changed or a test command ran
dependency "installed zod" a manifest changed or an install command ran
command "ran the build" a matching command ran via the Bash tool (advisory)
action "fixed the timeout bug" — not machine-checkable; flagged for review

Full details in docs/claim-types.md.

Use in CI (GitHub Action)

Post claim verdicts as a sticky comment on every PR — grading the PR description against the diff, so it works on any PR with zero agent setup. (groundtruth runs this on its own PRs.)

# .github/workflows/groundtruth.yml
name: groundtruth
on: pull_request
permissions:
  contents: read
  pull-requests: write
jobs:
  claim-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
        with: { fetch-depth: 0 }
      - uses: veltiq/groundtruth@v0.3.0

Add with: { strict: true } to turn it into a merge gate. Full options in docs/github-action.md.

Locally, against your commit message

--staged checks a message against what's actually staged — drop this in .git/hooks/commit-msg (or a lefthook/husky commit-msg hook):

#!/bin/sh
# .git/hooks/commit-msg — verify the commit message against the staged diff
npx @veltiq/groundtruth verify --summary "$1" --staged
# add --strict to abort the commit when a claim is unsupported

Prefer pre-commit? Add this to .pre-commit-config.yaml:

repos:
  - repo: https://github.com/veltiq/groundtruth
    rev: v0.5.0
    hooks:
      - id: groundtruth
        verbose: true          # show the report even when nothing fails
        # args: ["--strict"]   # and abort the commit on an unsupported claim

Then pre-commit install --hook-type commit-msg.

Configuration

Optional — drop a .groundtruthrc.json in your project (or a "groundtruth" key in package.json):

{
  "strict": false,
  "failOn": ["unsupported"],
  "shadow": false,
  "ignore": ["CHANGELOG.md", "*.generated.ts"],
  "ignoreKinds": ["command"],
  "output": "terminal"
}
  • ignore — claim targets to skip (substring or * glob). Your escape hatch for any false positive.
  • ignoreKinds — whole claim kinds to skip (file, symbol, test, dependency, command, action).
  • strict / output — defaults for blocking and output format.
  • failOn — which verdict levels count as a failure in strict mode (default ["unsupported"]).
  • shadow — record to the ledger but never print or block (for gradual rollout).

Install into more hook events for multi-agent workflows:

npx @veltiq/groundtruth install --events Stop,SubagentStop,SessionEnd

SubagentStop checks each subagent's turn; SessionEnd prints a per-session digest.

Other agents

The Stop hook is Claude Code-specific, but verify reads other agents' transcripts too — the claim engine is agent-neutral:

groundtruth verify --agent codex     # OpenAI Codex CLI
groundtruth verify --agent gemini    # Gemini CLI
groundtruth verify --agent cursor    # Cursor (agent-transcripts, or state.vscdb on older builds)
groundtruth verify --agent opencode  # OpenCode
groundtruth verify --agent aider     # Aider (best-effort)
groundtruth verify --agent auto      # pick the most recent across all

Each adapter normalizes the agent's transcript into the same {summary, toolUses} shape. New adapters are a great contribution — see CONTRIBUTING.md.

Older Cursor builds keep sessions in a SQLite store (globalStorage/state.vscdb) instead of JSONL. groundtruth reads it automatically via node:sqlite (Node 24+, or Node 22 with --experimental-sqlite); on older Node it falls back to the JSONL transcripts.

Stats & status bar

The hook keeps a privacy-safe local tally (counts only — never code or prompts, in ~/.groundtruth/ledger.jsonl):

groundtruth stats          # this project: turns, verified, unsupported, to-review (7d/30d/all)
groundtruth stats --all    # across every project

Show a live count in the Claude Code status bar (🔎 gt 3❌ ·7d):

npx @veltiq/groundtruth install --statusline

Honest limitations

  • It verifies that claimed work exists in the diff, not that it is correct. "Fixed the bug" can be confirmed to touch the right code; it cannot be confirmed to actually fix anything. That's what tests are for.
  • Extraction favors precision over recall — it will miss vaguely-worded claims rather than risk a false accusation.
  • Today it targets the Claude Code transcript format. The core (extractClaims, verifyClaims) is format-agnostic; adapters for other agents are welcome — see Contributing.

Use as a library

import { runPipeline, renderMarkdown } from "@veltiq/groundtruth";

const report = runPipeline({ transcriptPath: "session.jsonl", cwd: process.cwd() });
console.log(renderMarkdown(report));

FAQ

Does it send my code anywhere? No. It runs entirely locally — reads your transcript and git, writes nothing except when you run install. Zero network calls, zero runtime dependencies.

Will it block my commits or get in the way? No. By default it just prints a report and exits cleanly. Blocking is strictly opt-in (--strict).

Isn't this what tests are for? Tests catch code that's wrong. groundtruth catches code that was never written but reported as done — there's nothing for a test to run. They're complementary.

Does it work with Cursor / other agents? The engine is format-agnostic; today it ships a Claude Code transcript adapter. Adapters for other agents are a great first contribution — see CONTRIBUTING.md.

Will it falsely accuse me? It's tuned hard against that. A claim is only unsupported when it's concretely checkable and nothing supports it; everything fuzzy is shown as advisory, never a failure.

Contributing

Issues and PRs welcome — especially new claim patterns, agent adapters, and false-positive reports (those are gold). See CONTRIBUTING.md and the Code of Conduct.

Star history

If groundtruth ever catches your agent in a lie, a ⭐ helps other people find it.

Star History Chart

License

MIT © Veltiq

About

Catch when your AI coding assistant claims work it didn't do — verifies the end-of-turn summary against the actual diff. Claude Code Stop hook + CLI.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors