What problem does this solve
cladding gates everything on ~37 drift detectors, and SECURITY.md itself elevates a detector false-negative to "security-adjacent, because it erodes the falsifiability claim." Yet there is no measured precision / recall / false-positive rate for the detectors, and no regression suite that proves they don't drift over releases toward either noise (false-positives train users to ignore findings) or blindness (the documented security risk). clad benchmark today measures token/compression behavior, not gate accuracy. This is the meta-gap: the thing that judges everything is itself unjudged.
Proposed shape
A labeled detector-accuracy corpus + a scorer, run as a contributor self-audit (sibling to npm run conformance):
- Fixtures tagged
{detector, expect: drift|clean} — extend the existing conformance-fixture idea (GOVERNANCE.md §4.1 explicitly welcomes new fail-case fixtures).
- A scorer computes per-detector TP/FP/FN → precision/recall/FPR, aggregated across the corpus.
- A tracked baseline (like
perf/baseline.json) so a PR that pushes a detector's FPR up, or recall down, is visible in review.
This turns "we have 37 detectors" into "here is each detector's measured accuracy, and it isn't regressing."
Versioning scope (GOVERNANCE.md §2)
In-scope check (GOVERNANCE.md §4.1 / §4.2)
Alternatives considered
- Rely on the existing 26 conformance fixtures — those verify the contract (does the suite match), not per-detector statistical accuracy on adversarial near-miss cases.
Willing to implement?
Part of a competitive-gap analysis; flagged as the highest-leverage "integrity of the integrity layer" item.
What problem does this solve
cladding gates everything on ~37 drift detectors, and
SECURITY.mditself elevates a detector false-negative to "security-adjacent, because it erodes the falsifiability claim." Yet there is no measured precision / recall / false-positive rate for the detectors, and no regression suite that proves they don't drift over releases toward either noise (false-positives train users to ignore findings) or blindness (the documented security risk).clad benchmarktoday measures token/compression behavior, not gate accuracy. This is the meta-gap: the thing that judges everything is itself unjudged.Proposed shape
A labeled detector-accuracy corpus + a scorer, run as a contributor self-audit (sibling to
npm run conformance):{detector, expect: drift|clean}— extend the existing conformance-fixture idea (GOVERNANCE.md §4.1 explicitly welcomes new fail-case fixtures).perf/baseline.json) so a PR that pushes a detector's FPR up, or recall down, is visible in review.This turns "we have 37 detectors" into "here is each detector's measured accuracy, and it isn't regressing."
Versioning scope (GOVERNANCE.md §2)
In-scope check (GOVERNANCE.md §4.1 / §4.2)
Alternatives considered
Willing to implement?
Part of a competitive-gap analysis; flagged as the highest-leverage "integrity of the integrity layer" item.