Skip to content

[feature] Detector self-evaluation corpus + measured FPR/FNR #197

Description

@yuyu04

What problem does this solve

cladding gates everything on ~37 drift detectors, and SECURITY.md itself elevates a detector false-negative to "security-adjacent, because it erodes the falsifiability claim." Yet there is no measured precision / recall / false-positive rate for the detectors, and no regression suite that proves they don't drift over releases toward either noise (false-positives train users to ignore findings) or blindness (the documented security risk). clad benchmark today measures token/compression behavior, not gate accuracy. This is the meta-gap: the thing that judges everything is itself unjudged.

Proposed shape

A labeled detector-accuracy corpus + a scorer, run as a contributor self-audit (sibling to npm run conformance):

  • Fixtures tagged {detector, expect: drift|clean} — extend the existing conformance-fixture idea (GOVERNANCE.md §4.1 explicitly welcomes new fail-case fixtures).
  • A scorer computes per-detector TP/FP/FN → precision/recall/FPR, aggregated across the corpus.
  • A tracked baseline (like perf/baseline.json) so a PR that pushes a detector's FPR up, or recall down, is visible in review.

This turns "we have 37 detectors" into "here is each detector's measured accuracy, and it isn't regressing."

Versioning scope (GOVERNANCE.md §2)

  • Minor — new fixtures + a self-audit scorer (no change to shipped detector behavior)

In-scope check (GOVERNANCE.md §4.1 / §4.2)

  • Not regressing Iron Law conformance — strengthens its evidentiary basis
  • Not bypassing the anti-self-cert guard
  • Not forking the Ironclad spec
  • Not cosmetic-only — the deliverable is fixtures + a scorer with tests

Alternatives considered

  • Rely on the existing 26 conformance fixtures — those verify the contract (does the suite match), not per-detector statistical accuracy on adversarial near-miss cases.

Willing to implement?

  • Yes — I'd open a PR (would value scope guidance on the corpus size / labeling convention first)

Part of a competitive-gap analysis; flagged as the highest-leverage "integrity of the integrity layer" item.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions