Luis's port of wikiharness code is too conservative on non-hallucinated quotes

## TL;DR

The citation-verification **grounding gate conflates two different model capabilities**:

- **(A) transcription** — can the model emit a long, *character-perfect* verbatim quote?
- **(B) assessment** — can the model correctly judge whether the source supports the claim?

The gate uses success at (A) as a *hard precondition* for trusting (B): a `Supported`/`Partial`
verdict whose quoted span doesn't string-locate in the fetched source is silently **downgraded to
`Not supported`**. So failures at (A) — paraphrase, re-casing, non-contiguous spans joined with "…",
dash/ellipsis drift, truncation — **masquerade as failures at (B)**, producing false `Not supported`
verdicts on citations that are genuinely supported.

We want to give the model room to "fix" (A) — normalize, retry with a shorter span, or point instead
of transcribe — **without weakening the anti-hallucination guarantee**.

This is the wikiharness→SP42 citation port; the issue is that the port is *too conservative*, not
that it's wrong. Both stricter-than-reference layers are deliberate anti-fabrication discipline
(ADR-0007); the problem is they over-fire on transcription noise.

## Evidence (measured 2026-06-09)

Benchmark: alex-cite-checker's labeled corpus (189 rows — claim + cached `source_text` + human
ground truth). Method: serve each case's **cached** source text over localhost so SP42 verifies the
*exact labeled bytes* (no live-fetch / extraction drift). Compare SP42 to alex's **own** per-case
verdicts, identical model(s). (alex's benchmark = the reference implementation this was ported from.)

**Single model** (`mistralai/mistral-small-3.2-24b-instruct`, the same model both ran):
- SP42-vs-alex exact agreement: **68.5%**
- accuracy vs ground truth: **SP42 54.4% / alex 55.1%** (≈ parity — the port reproduces the
  reference's *accuracy*)
- `source_unavailable`: **SP42 38 / alex 26** (SP42 more aggressive)

**Panel** (alex's PANEL_FAST = mistral + granite + gemma; SP42's voting is a verbatim port of alex's
`voting.js` — 4-class plurality + skeptical tiebreaker `Partial > NotSupported > SourceUnavailable >
Supported`):
- SP42-vs-alex exact agreement: **75.1%**
- accuracy vs ground truth: **SP42 62.0% / alex 66.1%**
- `source_unavailable`: **SP42 14 / alex 14** (voting out-votes the single-model over-aggression —
  converged)
- **but the `not_supported` lean persists** — panel confusion: SP42 said `not_supported` where alex
  said `supported` ×10 / `partial` ×12.

Net: at the panel level SP42 trails alex by ~4 pts on alex's own ground truth, almost entirely via
*extra* `Not supported` calls.

## Root cause — two deliberate layers SP42 has that alex's benchmark does NOT

Not an accidental port bug. Both are the anti-hallucination discipline (ADR-0007), both stricter than
alex's raw benchmark:

1. **Prompt-level.** SP42's verify `SYSTEM` prompt (`crates/sp42-core/src/citation/prompts.rs`)
   hard-instructs:
   > "For SUPPORTED or PARTIAL you MUST quote a short, VERBATIM span copied exactly… If you cannot
   > find such a verbatim span, the verdict is NOT_SUPPORTED."

   alex's prompt has **no downgrade instruction** — it just asks for a quote inside `comments`
   (`{verdict, comments}` vs SP42's `{verdict, quote}`). So the prompt itself nudges toward
   `NOT_SUPPORTED`. (SP42 ported this faithfully from wikiharness, which had already strengthened
   alex's prompt.)

2. **Pipeline grounding gate.** `assemble_citation_finding`
   (`crates/sp42-core/src/citation/verify.rs`) string-locates the model's quote via `locate_quote`
   and **downgrades `Supported`/`Partial` → `Not supported` if it doesn't locate**. alex's benchmark
   records the *raw* model verdict (`run_benchmark.js`: `predicted_verdict: result.verdict`;
   `parsing.js` does zero locate).

   `locate_quote` (`crates/sp42-core/src/citation/locate_quote.rs`) today: **contiguous substring,
   CASE-SENSITIVE**, normalizing only NFC + whitespace-run collapse + curly→straight quotes. There
   is even a test, `case_difference_does_not_match`, that *deliberately encodes the conflation*
   ("a model that re-cases a quote gets no free pass") — but re-casing is a transcription artifact,
   not a fabrication.

## The invariant we MUST preserve

A `Supported`/`Partial` verdict must be backed by evidence the **code can confirm exists in the
source we fetched this session** (the model isn't inventing it from memory). Keep that requirement;
make the *confirmation* robust to transcription noise. The line between "robust" and "permissive":
the adversarial proptest (a fabricated / disjoint-alphabet quote must still be rejected) **stays
green**. Anything that keeps it green has not re-opened the hole.

## Plan (layered, cheapest → deepest)

1. **Normalization the matcher is missing** (free, safe): case-insensitive matching (return the
   original-cased span), unify dashes (`‐ ‑ – — −` → `-`), ellipsis (`…` → `...`), strip
   NBSP/zero-width. Flip `case_difference_does_not_match`. A fabricated span still won't match
   case-folded, so (A) is fixed without touching the guarantee.
2. **Multi-fragment / ellipsis-aware locate**: split the quote on ellipsis and require each fragment
   to locate *in order within a bounded window* (models quote non-contiguous evidence). Each
   fragment is still a real substring → grounded.
3. **Repair turn**: on a miss after (1)+(2), one cheap extra turn — "your quote didn't match the
   source verbatim; here is the source again; return the exact *shortest* substring that backs the
   claim, or NO_SPAN" — re-located deterministically. Separates assessment (turn 1) from
   transcription (turn 2). Bounded to 1 retry.
4. **Model points, code extracts (anchor-extract)**: model returns a short *unique anchor phrase*
   (5–10 words); code locates it and expands to the surrounding sentence for display. The model only
   *points* (easy); code does the exact extraction (grounded). _Luis: we'll likely want this
   eventually, within the architectural change (6)._
5. **Bounded fuzzy match — guarded, last resort**: best-window normalized similarity ≥ a *high*
   threshold AND the window must contain the claim's load-bearing tokens (dates/numbers/names). Only
   reach for this if 1–4 leave residual false-negatives; gate behind the proptest.
6. **Architectural — decouple at the output**: emit `(verdict, grounding_status ∈ {located,
   located-fuzzy, unlocated})` instead of collapsing into one verdict. Today's silent
   `Supported → NotSupported` asserts a *different false thing* (the source lacks/contradicts the
   evidence) when the truth is "we couldn't verify the transcription." For a system that **never
   auto-edits**, surface `Supported (quote unverified — please confirm)` to the human and keep the
   hard `located` gate only for any auto-accept/auto-edit path. anchor-extract (4) lives here.

**Recommended order:** ship (1)+(2) now → (3) → architectural (6) carrying (4); (5) only if needed.

## Measurement (harness already built)

For each layer, on the 185-case benchmark:
- **(i)** false-negatives recovered — SP42 `Not supported` that flip to `Supported`/`Partial` *and
  now agree with GT*;
- **(ii)** agreement-with-alex and GT-accuracy deltas;
- **(iii) guardrail** — the fabricated-quote proptest stays green + a negative-control set gains
  **zero** spurious `Supported`.

Plus a **ceiling** measurement first: emit the *pre-gate* verdict (already present in
`VerificationOutcome.votes`, just not surfaced) to count how many `Not supported` are gate-downgrades
= the maximum (1)–(4) can recover. That tells us if this is a 2-pt or a 10-pt fix before building
much.

## Where the code lives

- Branch `impl/citation-verification` (worktree `../SP42-impl-citation`), local/unpushed.
- Files: `crates/sp42-core/src/citation/{locate_quote.rs, verify.rs, prompts.rs}`.
- Reusable comparison harnesses: `/tmp/sp42-vs-alex.cjs`, `/tmp/sp42-vs-alex-panel.cjs`.

**Baseline to beat** (panel: mistral+granite+gemma, cached sources): SP42 **62.0%** GT accuracy /
**75.1%** agreement with alex / `source_unavailable` 14.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Luis's port of wikiharness code is too conservative on non-hallucinated quotes #25

TL;DR

Evidence (measured 2026-06-09)

Root cause — two deliberate layers SP42 has that alex's benchmark does NOT

The invariant we MUST preserve

Plan (layered, cheapest → deepest)

Measurement (harness already built)

Where the code lives

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Luis's port of wikiharness code is too conservative on non-hallucinated quotes #25

Description

TL;DR

Evidence (measured 2026-06-09)

Root cause — two deliberate layers SP42 has that alex's benchmark does NOT

The invariant we MUST preserve

Plan (layered, cheapest → deepest)

Measurement (harness already built)

Where the code lives

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions