Skip to content

Luis's port of wikiharness code is too conservative on non-hallucinated quotes #25

Description

@tieguy

TL;DR

The citation-verification grounding gate conflates two different model capabilities:

  • (A) transcription — can the model emit a long, character-perfect verbatim quote?
  • (B) assessment — can the model correctly judge whether the source supports the claim?

The gate uses success at (A) as a hard precondition for trusting (B): a Supported/Partial
verdict whose quoted span doesn't string-locate in the fetched source is silently downgraded to
Not supported
. So failures at (A) — paraphrase, re-casing, non-contiguous spans joined with "…",
dash/ellipsis drift, truncation — masquerade as failures at (B), producing false Not supported
verdicts on citations that are genuinely supported.

We want to give the model room to "fix" (A) — normalize, retry with a shorter span, or point instead
of transcribe — without weakening the anti-hallucination guarantee.

This is the wikiharness→SP42 citation port; the issue is that the port is too conservative, not
that it's wrong. Both stricter-than-reference layers are deliberate anti-fabrication discipline
(ADR-0007); the problem is they over-fire on transcription noise.

Evidence (measured 2026-06-09)

Benchmark: alex-cite-checker's labeled corpus (189 rows — claim + cached source_text + human
ground truth). Method: serve each case's cached source text over localhost so SP42 verifies the
exact labeled bytes (no live-fetch / extraction drift). Compare SP42 to alex's own per-case
verdicts, identical model(s). (alex's benchmark = the reference implementation this was ported from.)

Single model (mistralai/mistral-small-3.2-24b-instruct, the same model both ran):

  • SP42-vs-alex exact agreement: 68.5%
  • accuracy vs ground truth: SP42 54.4% / alex 55.1% (≈ parity — the port reproduces the
    reference's accuracy)
  • source_unavailable: SP42 38 / alex 26 (SP42 more aggressive)

Panel (alex's PANEL_FAST = mistral + granite + gemma; SP42's voting is a verbatim port of alex's
voting.js — 4-class plurality + skeptical tiebreaker Partial > NotSupported > SourceUnavailable > Supported):

  • SP42-vs-alex exact agreement: 75.1%
  • accuracy vs ground truth: SP42 62.0% / alex 66.1%
  • source_unavailable: SP42 14 / alex 14 (voting out-votes the single-model over-aggression —
    converged)
  • but the not_supported lean persists — panel confusion: SP42 said not_supported where alex
    said supported ×10 / partial ×12.

Net: at the panel level SP42 trails alex by ~4 pts on alex's own ground truth, almost entirely via
extra Not supported calls.

Root cause — two deliberate layers SP42 has that alex's benchmark does NOT

Not an accidental port bug. Both are the anti-hallucination discipline (ADR-0007), both stricter than
alex's raw benchmark:

  1. Prompt-level. SP42's verify SYSTEM prompt (crates/sp42-core/src/citation/prompts.rs)
    hard-instructs:

    "For SUPPORTED or PARTIAL you MUST quote a short, VERBATIM span copied exactly… If you cannot
    find such a verbatim span, the verdict is NOT_SUPPORTED."

    alex's prompt has no downgrade instruction — it just asks for a quote inside comments
    ({verdict, comments} vs SP42's {verdict, quote}). So the prompt itself nudges toward
    NOT_SUPPORTED. (SP42 ported this faithfully from wikiharness, which had already strengthened
    alex's prompt.)

  2. Pipeline grounding gate. assemble_citation_finding
    (crates/sp42-core/src/citation/verify.rs) string-locates the model's quote via locate_quote
    and downgrades Supported/PartialNot supported if it doesn't locate. alex's benchmark
    records the raw model verdict (run_benchmark.js: predicted_verdict: result.verdict;
    parsing.js does zero locate).

    locate_quote (crates/sp42-core/src/citation/locate_quote.rs) today: contiguous substring,
    CASE-SENSITIVE
    , normalizing only NFC + whitespace-run collapse + curly→straight quotes. There
    is even a test, case_difference_does_not_match, that deliberately encodes the conflation
    ("a model that re-cases a quote gets no free pass") — but re-casing is a transcription artifact,
    not a fabrication.

The invariant we MUST preserve

A Supported/Partial verdict must be backed by evidence the code can confirm exists in the
source we fetched this session
(the model isn't inventing it from memory). Keep that requirement;
make the confirmation robust to transcription noise. The line between "robust" and "permissive":
the adversarial proptest (a fabricated / disjoint-alphabet quote must still be rejected) stays
green
. Anything that keeps it green has not re-opened the hole.

Plan (layered, cheapest → deepest)

  1. Normalization the matcher is missing (free, safe): case-insensitive matching (return the
    original-cased span), unify dashes (‐ ‑ – — −-), ellipsis (...), strip
    NBSP/zero-width. Flip case_difference_does_not_match. A fabricated span still won't match
    case-folded, so (A) is fixed without touching the guarantee.
  2. Multi-fragment / ellipsis-aware locate: split the quote on ellipsis and require each fragment
    to locate in order within a bounded window (models quote non-contiguous evidence). Each
    fragment is still a real substring → grounded.
  3. Repair turn: on a miss after (1)+(2), one cheap extra turn — "your quote didn't match the
    source verbatim; here is the source again; return the exact shortest substring that backs the
    claim, or NO_SPAN" — re-located deterministically. Separates assessment (turn 1) from
    transcription (turn 2). Bounded to 1 retry.
  4. Model points, code extracts (anchor-extract): model returns a short unique anchor phrase
    (5–10 words); code locates it and expands to the surrounding sentence for display. The model only
    points (easy); code does the exact extraction (grounded). Luis: we'll likely want this
    eventually, within the architectural change (6).
  5. Bounded fuzzy match — guarded, last resort: best-window normalized similarity ≥ a high
    threshold AND the window must contain the claim's load-bearing tokens (dates/numbers/names). Only
    reach for this if 1–4 leave residual false-negatives; gate behind the proptest.
  6. Architectural — decouple at the output: emit (verdict, grounding_status ∈ {located, located-fuzzy, unlocated}) instead of collapsing into one verdict. Today's silent
    Supported → NotSupported asserts a different false thing (the source lacks/contradicts the
    evidence) when the truth is "we couldn't verify the transcription." For a system that never
    auto-edits
    , surface Supported (quote unverified — please confirm) to the human and keep the
    hard located gate only for any auto-accept/auto-edit path. anchor-extract (4) lives here.

Recommended order: ship (1)+(2) now → (3) → architectural (6) carrying (4); (5) only if needed.

Measurement (harness already built)

For each layer, on the 185-case benchmark:

  • (i) false-negatives recovered — SP42 Not supported that flip to Supported/Partial and
    now agree with GT
    ;
  • (ii) agreement-with-alex and GT-accuracy deltas;
  • (iii) guardrail — the fabricated-quote proptest stays green + a negative-control set gains
    zero spurious Supported.

Plus a ceiling measurement first: emit the pre-gate verdict (already present in
VerificationOutcome.votes, just not surfaced) to count how many Not supported are gate-downgrades
= the maximum (1)–(4) can recover. That tells us if this is a 2-pt or a 10-pt fix before building
much.

Where the code lives

  • Branch impl/citation-verification (worktree ../SP42-impl-citation), local/unpushed.
  • Files: crates/sp42-core/src/citation/{locate_quote.rs, verify.rs, prompts.rs}.
  • Reusable comparison harnesses: /tmp/sp42-vs-alex.cjs, /tmp/sp42-vs-alex-panel.cjs.

Baseline to beat (panel: mistral+granite+gemma, cached sources): SP42 62.0% GT accuracy /
75.1% agreement with alex / source_unavailable 14.

Metadata

Metadata

Assignees

Labels

architectureArchitecture and structural designbugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions