Design-surface calibration for Claude Code guardrail hooks. Measures, per hook, catch-rate (recall) on real violations and false-positive rate on benign events — on the hooks' actual operating surface (real Claude Code event payloads), not a synthetic chat corpus. Standalone, plus a claudemax drop-in.
The gap it closes:
llm-dark-patterns/evaluationscores hooks against the DarkBench chat distribution, which its ownRESULTS.mdadmits "differs from the hooks' design surface (Claude Code closeout text)." hookbench scores them on that real surface — where the false positives that interrupt your day actually live.
Per 2026 guardrail-eval practice (focused review, not exhaustive): you need two balanced sets — adversarial (violations, scored by recall) and a benign set sampled from real traffic (scored by false-positive rate) — because a single F1 hides which failure you have, and FP cost is operational: a few-percent FPR on a busy hook means constant spurious blocks (guardrail metrics, provider benchmark). hookbench foregrounds FPR = FP/(FP+TN) next to recall.
Against claudemax's live dp.sh no-vibes, hookbench reproduced a genuine false
positive: echo 'reminder: avoid git push --force …' is blocked — the matcher
fires on the substring regardless of echo/quote context (PreToolUse FPR 33%, recall
100%). Full data + the honest payload-fidelity caveat in RESULTS.md.
python -m hookbench validate data/seed_corpus.jsonl
python -m hookbench score --corpus data/seed_corpus.jsonl \
--hook "bash /path/to/.claude/hooks/dp.sh no-vibes.sh" --event PreToolUse
scripts/probe.sh # reproduce the RESULTS.md probe
python -m hookbench score ... --gate --max-fpr 0.05 --min-recall 0.9 # CI gateThe daily loop (capture → label → score → gate) and the claudemax wiring are in
integrations/claudemax/.
- Black-box hook accuracy is only trustworthy on real captured payloads.
Synthesized payloads exercise the engine but are not real-hook accuracy — the
scorer prints a NOTE when a corpus has no
real_captureevents, andRESULTS.mdexplicitly does not report the synthesized Stop numbers as a hook failure. - The runner reads the verdict from the exit code (2=block, else allow), exactly as Claude Code enforces — no stdout-decision guessing.
python -m pytest -q → 7 passed (scoring math exact, runner vs a controllable mock,
gate pass/fail, capture roundtrip, adapter capture, seed loads). Real-hook numbers
come only from real captures; closeout/Stop calibration is pending real-payload
capture (that's what the daily loop is for).
License: Apache-2.0.