hookbench

Design-surface calibration for Claude Code guardrail hooks. Measures, per hook, catch-rate (recall) on real violations and false-positive rate on benign events — on the hooks' actual operating surface (real Claude Code event payloads), not a synthetic chat corpus. Standalone, plus a claudemax drop-in.

The gap it closes: llm-dark-patterns/evaluation scores hooks against the DarkBench chat distribution, which its own RESULTS.md admits "differs from the hooks' design surface (Claude Code closeout text)." hookbench scores them on that real surface — where the false positives that interrupt your day actually live.

Why false-positive rate, not just F1

Per 2026 guardrail-eval practice (focused review, not exhaustive): you need two balanced sets — adversarial (violations, scored by recall) and a benign set sampled from real traffic (scored by false-positive rate) — because a single F1 hides which failure you have, and FP cost is operational: a few-percent FPR on a busy hook means constant spurious blocks (guardrail metrics, provider benchmark). hookbench foregrounds FPR = FP/(FP+TN) next to recall.

It already found a real bug

Against claudemax's live dp.sh no-vibes, hookbench reproduced a genuine false positive: echo 'reminder: avoid git push --force …' is blocked — the matcher fires on the substring regardless of echo/quote context (PreToolUse FPR 33%, recall 100%). Full data + the honest payload-fidelity caveat in RESULTS.md.

Use

python -m hookbench validate data/seed_corpus.jsonl
python -m hookbench score --corpus data/seed_corpus.jsonl \
  --hook "bash /path/to/.claude/hooks/dp.sh no-vibes.sh" --event PreToolUse
scripts/probe.sh                # reproduce the RESULTS.md probe
python -m hookbench score ... --gate --max-fpr 0.05 --min-recall 0.9   # CI gate

The daily loop (capture → label → score → gate) and the claudemax wiring are in integrations/claudemax/.

Honesty constraints (by design)

Black-box hook accuracy is only trustworthy on real captured payloads. Synthesized payloads exercise the engine but are not real-hook accuracy — the scorer prints a NOTE when a corpus has no real_capture events, and RESULTS.md explicitly does not report the synthesized Stop numbers as a hook failure.
The runner reads the verdict from the exit code (2=block, else allow), exactly as Claude Code enforces — no stdout-decision guessing.

What's verified

python -m pytest -q → 7 passed (scoring math exact, runner vs a controllable mock, gate pass/fail, capture roundtrip, adapter capture, seed loads). Real-hook numbers come only from real captures; closeout/Stop calibration is pending real-payload capture (that's what the daily loop is for).

License: Apache-2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
hookbench		hookbench
integrations/claudemax		integrations/claudemax
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
RESULTS.md		RESULTS.md
SPEC.md		SPEC.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hookbench

Why false-positive rate, not just F1

It already found a real bug

Use

Honesty constraints (by design)

What's verified

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

hookbench

Why false-positive rate, not just F1

It already found a real bug

Use

Honesty constraints (by design)

What's verified

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages