Skip to content

waitdeadai/hookbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hookbench

Design-surface calibration for Claude Code guardrail hooks. Measures, per hook, catch-rate (recall) on real violations and false-positive rate on benign events — on the hooks' actual operating surface (real Claude Code event payloads), not a synthetic chat corpus. Standalone, plus a claudemax drop-in.

The gap it closes: llm-dark-patterns/evaluation scores hooks against the DarkBench chat distribution, which its own RESULTS.md admits "differs from the hooks' design surface (Claude Code closeout text)." hookbench scores them on that real surface — where the false positives that interrupt your day actually live.

Why false-positive rate, not just F1

Per 2026 guardrail-eval practice (focused review, not exhaustive): you need two balanced sets — adversarial (violations, scored by recall) and a benign set sampled from real traffic (scored by false-positive rate) — because a single F1 hides which failure you have, and FP cost is operational: a few-percent FPR on a busy hook means constant spurious blocks (guardrail metrics, provider benchmark). hookbench foregrounds FPR = FP/(FP+TN) next to recall.

It already found a real bug

Against claudemax's live dp.sh no-vibes, hookbench reproduced a genuine false positive: echo 'reminder: avoid git push --force …' is blocked — the matcher fires on the substring regardless of echo/quote context (PreToolUse FPR 33%, recall 100%). Full data + the honest payload-fidelity caveat in RESULTS.md.

Use

python -m hookbench validate data/seed_corpus.jsonl
python -m hookbench score --corpus data/seed_corpus.jsonl \
  --hook "bash /path/to/.claude/hooks/dp.sh no-vibes.sh" --event PreToolUse
scripts/probe.sh                # reproduce the RESULTS.md probe
python -m hookbench score ... --gate --max-fpr 0.05 --min-recall 0.9   # CI gate

The daily loop (capture → label → score → gate) and the claudemax wiring are in integrations/claudemax/.

Honesty constraints (by design)

  • Black-box hook accuracy is only trustworthy on real captured payloads. Synthesized payloads exercise the engine but are not real-hook accuracy — the scorer prints a NOTE when a corpus has no real_capture events, and RESULTS.md explicitly does not report the synthesized Stop numbers as a hook failure.
  • The runner reads the verdict from the exit code (2=block, else allow), exactly as Claude Code enforces — no stdout-decision guessing.

What's verified

python -m pytest -q → 7 passed (scoring math exact, runner vs a controllable mock, gate pass/fail, capture roundtrip, adapter capture, seed loads). Real-hook numbers come only from real captures; closeout/Stop calibration is pending real-payload capture (that's what the daily loop is for).

License: Apache-2.0.

About

Design-surface calibration for Claude Code guardrail hooks: per-hook catch-rate + false-positive rate on real event payloads

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors