TL;DR
The citation-verification grounding gate conflates two different model capabilities:
- (A) transcription — can the model emit a long, character-perfect verbatim quote?
- (B) assessment — can the model correctly judge whether the source supports the claim?
The gate uses success at (A) as a hard precondition for trusting (B): a Supported/Partial
verdict whose quoted span doesn't string-locate in the fetched source is silently downgraded to
Not supported. So failures at (A) — paraphrase, re-casing, non-contiguous spans joined with "…",
dash/ellipsis drift, truncation — masquerade as failures at (B), producing false Not supported
verdicts on citations that are genuinely supported.
We want to give the model room to "fix" (A) — normalize, retry with a shorter span, or point instead
of transcribe — without weakening the anti-hallucination guarantee.
This is the wikiharness→SP42 citation port; the issue is that the port is too conservative, not
that it's wrong. Both stricter-than-reference layers are deliberate anti-fabrication discipline
(ADR-0007); the problem is they over-fire on transcription noise.
Evidence (measured 2026-06-09)
Benchmark: alex-cite-checker's labeled corpus (189 rows — claim + cached source_text + human
ground truth). Method: serve each case's cached source text over localhost so SP42 verifies the
exact labeled bytes (no live-fetch / extraction drift). Compare SP42 to alex's own per-case
verdicts, identical model(s). (alex's benchmark = the reference implementation this was ported from.)
Single model (mistralai/mistral-small-3.2-24b-instruct, the same model both ran):
- SP42-vs-alex exact agreement: 68.5%
- accuracy vs ground truth: SP42 54.4% / alex 55.1% (≈ parity — the port reproduces the
reference's accuracy)
source_unavailable: SP42 38 / alex 26 (SP42 more aggressive)
Panel (alex's PANEL_FAST = mistral + granite + gemma; SP42's voting is a verbatim port of alex's
voting.js — 4-class plurality + skeptical tiebreaker Partial > NotSupported > SourceUnavailable > Supported):
- SP42-vs-alex exact agreement: 75.1%
- accuracy vs ground truth: SP42 62.0% / alex 66.1%
source_unavailable: SP42 14 / alex 14 (voting out-votes the single-model over-aggression —
converged)
- but the
not_supported lean persists — panel confusion: SP42 said not_supported where alex
said supported ×10 / partial ×12.
Net: at the panel level SP42 trails alex by ~4 pts on alex's own ground truth, almost entirely via
extra Not supported calls.
Root cause — two deliberate layers SP42 has that alex's benchmark does NOT
Not an accidental port bug. Both are the anti-hallucination discipline (ADR-0007), both stricter than
alex's raw benchmark:
-
Prompt-level. SP42's verify SYSTEM prompt (crates/sp42-core/src/citation/prompts.rs)
hard-instructs:
"For SUPPORTED or PARTIAL you MUST quote a short, VERBATIM span copied exactly… If you cannot
find such a verbatim span, the verdict is NOT_SUPPORTED."
alex's prompt has no downgrade instruction — it just asks for a quote inside comments
({verdict, comments} vs SP42's {verdict, quote}). So the prompt itself nudges toward
NOT_SUPPORTED. (SP42 ported this faithfully from wikiharness, which had already strengthened
alex's prompt.)
-
Pipeline grounding gate. assemble_citation_finding
(crates/sp42-core/src/citation/verify.rs) string-locates the model's quote via locate_quote
and downgrades Supported/Partial → Not supported if it doesn't locate. alex's benchmark
records the raw model verdict (run_benchmark.js: predicted_verdict: result.verdict;
parsing.js does zero locate).
locate_quote (crates/sp42-core/src/citation/locate_quote.rs) today: contiguous substring,
CASE-SENSITIVE, normalizing only NFC + whitespace-run collapse + curly→straight quotes. There
is even a test, case_difference_does_not_match, that deliberately encodes the conflation
("a model that re-cases a quote gets no free pass") — but re-casing is a transcription artifact,
not a fabrication.
The invariant we MUST preserve
A Supported/Partial verdict must be backed by evidence the code can confirm exists in the
source we fetched this session (the model isn't inventing it from memory). Keep that requirement;
make the confirmation robust to transcription noise. The line between "robust" and "permissive":
the adversarial proptest (a fabricated / disjoint-alphabet quote must still be rejected) stays
green. Anything that keeps it green has not re-opened the hole.
Plan (layered, cheapest → deepest)
- Normalization the matcher is missing (free, safe): case-insensitive matching (return the
original-cased span), unify dashes (‐ ‑ – — − → -), ellipsis (… → ...), strip
NBSP/zero-width. Flip case_difference_does_not_match. A fabricated span still won't match
case-folded, so (A) is fixed without touching the guarantee.
- Multi-fragment / ellipsis-aware locate: split the quote on ellipsis and require each fragment
to locate in order within a bounded window (models quote non-contiguous evidence). Each
fragment is still a real substring → grounded.
- Repair turn: on a miss after (1)+(2), one cheap extra turn — "your quote didn't match the
source verbatim; here is the source again; return the exact shortest substring that backs the
claim, or NO_SPAN" — re-located deterministically. Separates assessment (turn 1) from
transcription (turn 2). Bounded to 1 retry.
- Model points, code extracts (anchor-extract): model returns a short unique anchor phrase
(5–10 words); code locates it and expands to the surrounding sentence for display. The model only
points (easy); code does the exact extraction (grounded). Luis: we'll likely want this
eventually, within the architectural change (6).
- Bounded fuzzy match — guarded, last resort: best-window normalized similarity ≥ a high
threshold AND the window must contain the claim's load-bearing tokens (dates/numbers/names). Only
reach for this if 1–4 leave residual false-negatives; gate behind the proptest.
- Architectural — decouple at the output: emit
(verdict, grounding_status ∈ {located, located-fuzzy, unlocated}) instead of collapsing into one verdict. Today's silent
Supported → NotSupported asserts a different false thing (the source lacks/contradicts the
evidence) when the truth is "we couldn't verify the transcription." For a system that never
auto-edits, surface Supported (quote unverified — please confirm) to the human and keep the
hard located gate only for any auto-accept/auto-edit path. anchor-extract (4) lives here.
Recommended order: ship (1)+(2) now → (3) → architectural (6) carrying (4); (5) only if needed.
Measurement (harness already built)
For each layer, on the 185-case benchmark:
- (i) false-negatives recovered — SP42
Not supported that flip to Supported/Partial and
now agree with GT;
- (ii) agreement-with-alex and GT-accuracy deltas;
- (iii) guardrail — the fabricated-quote proptest stays green + a negative-control set gains
zero spurious Supported.
Plus a ceiling measurement first: emit the pre-gate verdict (already present in
VerificationOutcome.votes, just not surfaced) to count how many Not supported are gate-downgrades
= the maximum (1)–(4) can recover. That tells us if this is a 2-pt or a 10-pt fix before building
much.
Where the code lives
- Branch
impl/citation-verification (worktree ../SP42-impl-citation), local/unpushed.
- Files:
crates/sp42-core/src/citation/{locate_quote.rs, verify.rs, prompts.rs}.
- Reusable comparison harnesses:
/tmp/sp42-vs-alex.cjs, /tmp/sp42-vs-alex-panel.cjs.
Baseline to beat (panel: mistral+granite+gemma, cached sources): SP42 62.0% GT accuracy /
75.1% agreement with alex / source_unavailable 14.
TL;DR
The citation-verification grounding gate conflates two different model capabilities:
The gate uses success at (A) as a hard precondition for trusting (B): a
Supported/Partialverdict whose quoted span doesn't string-locate in the fetched source is silently downgraded to
Not supported. So failures at (A) — paraphrase, re-casing, non-contiguous spans joined with "…",dash/ellipsis drift, truncation — masquerade as failures at (B), producing false
Not supportedverdicts on citations that are genuinely supported.
We want to give the model room to "fix" (A) — normalize, retry with a shorter span, or point instead
of transcribe — without weakening the anti-hallucination guarantee.
This is the wikiharness→SP42 citation port; the issue is that the port is too conservative, not
that it's wrong. Both stricter-than-reference layers are deliberate anti-fabrication discipline
(ADR-0007); the problem is they over-fire on transcription noise.
Evidence (measured 2026-06-09)
Benchmark: alex-cite-checker's labeled corpus (189 rows — claim + cached
source_text+ humanground truth). Method: serve each case's cached source text over localhost so SP42 verifies the
exact labeled bytes (no live-fetch / extraction drift). Compare SP42 to alex's own per-case
verdicts, identical model(s). (alex's benchmark = the reference implementation this was ported from.)
Single model (
mistralai/mistral-small-3.2-24b-instruct, the same model both ran):reference's accuracy)
source_unavailable: SP42 38 / alex 26 (SP42 more aggressive)Panel (alex's PANEL_FAST = mistral + granite + gemma; SP42's voting is a verbatim port of alex's
voting.js— 4-class plurality + skeptical tiebreakerPartial > NotSupported > SourceUnavailable > Supported):source_unavailable: SP42 14 / alex 14 (voting out-votes the single-model over-aggression —converged)
not_supportedlean persists — panel confusion: SP42 saidnot_supportedwhere alexsaid
supported×10 /partial×12.Net: at the panel level SP42 trails alex by ~4 pts on alex's own ground truth, almost entirely via
extra
Not supportedcalls.Root cause — two deliberate layers SP42 has that alex's benchmark does NOT
Not an accidental port bug. Both are the anti-hallucination discipline (ADR-0007), both stricter than
alex's raw benchmark:
Prompt-level. SP42's verify
SYSTEMprompt (crates/sp42-core/src/citation/prompts.rs)hard-instructs:
alex's prompt has no downgrade instruction — it just asks for a quote inside
comments(
{verdict, comments}vs SP42's{verdict, quote}). So the prompt itself nudges towardNOT_SUPPORTED. (SP42 ported this faithfully from wikiharness, which had already strengthenedalex's prompt.)
Pipeline grounding gate.
assemble_citation_finding(
crates/sp42-core/src/citation/verify.rs) string-locates the model's quote vialocate_quoteand downgrades
Supported/Partial→Not supportedif it doesn't locate. alex's benchmarkrecords the raw model verdict (
run_benchmark.js:predicted_verdict: result.verdict;parsing.jsdoes zero locate).locate_quote(crates/sp42-core/src/citation/locate_quote.rs) today: contiguous substring,CASE-SENSITIVE, normalizing only NFC + whitespace-run collapse + curly→straight quotes. There
is even a test,
case_difference_does_not_match, that deliberately encodes the conflation("a model that re-cases a quote gets no free pass") — but re-casing is a transcription artifact,
not a fabrication.
The invariant we MUST preserve
A
Supported/Partialverdict must be backed by evidence the code can confirm exists in thesource we fetched this session (the model isn't inventing it from memory). Keep that requirement;
make the confirmation robust to transcription noise. The line between "robust" and "permissive":
the adversarial proptest (a fabricated / disjoint-alphabet quote must still be rejected) stays
green. Anything that keeps it green has not re-opened the hole.
Plan (layered, cheapest → deepest)
original-cased span), unify dashes (
‐ ‑ – — −→-), ellipsis (…→...), stripNBSP/zero-width. Flip
case_difference_does_not_match. A fabricated span still won't matchcase-folded, so (A) is fixed without touching the guarantee.
to locate in order within a bounded window (models quote non-contiguous evidence). Each
fragment is still a real substring → grounded.
source verbatim; here is the source again; return the exact shortest substring that backs the
claim, or NO_SPAN" — re-located deterministically. Separates assessment (turn 1) from
transcription (turn 2). Bounded to 1 retry.
(5–10 words); code locates it and expands to the surrounding sentence for display. The model only
points (easy); code does the exact extraction (grounded). Luis: we'll likely want this
eventually, within the architectural change (6).
threshold AND the window must contain the claim's load-bearing tokens (dates/numbers/names). Only
reach for this if 1–4 leave residual false-negatives; gate behind the proptest.
(verdict, grounding_status ∈ {located, located-fuzzy, unlocated})instead of collapsing into one verdict. Today's silentSupported → NotSupportedasserts a different false thing (the source lacks/contradicts theevidence) when the truth is "we couldn't verify the transcription." For a system that never
auto-edits, surface
Supported (quote unverified — please confirm)to the human and keep thehard
locatedgate only for any auto-accept/auto-edit path. anchor-extract (4) lives here.Recommended order: ship (1)+(2) now → (3) → architectural (6) carrying (4); (5) only if needed.
Measurement (harness already built)
For each layer, on the 185-case benchmark:
Not supportedthat flip toSupported/Partialandnow agree with GT;
zero spurious
Supported.Plus a ceiling measurement first: emit the pre-gate verdict (already present in
VerificationOutcome.votes, just not surfaced) to count how manyNot supportedare gate-downgrades= the maximum (1)–(4) can recover. That tells us if this is a 2-pt or a 10-pt fix before building
much.
Where the code lives
impl/citation-verification(worktree../SP42-impl-citation), local/unpushed.crates/sp42-core/src/citation/{locate_quote.rs, verify.rs, prompts.rs}./tmp/sp42-vs-alex.cjs,/tmp/sp42-vs-alex-panel.cjs.Baseline to beat (panel: mistral+granite+gemma, cached sources): SP42 62.0% GT accuracy /
75.1% agreement with alex /
source_unavailable14.