Skip to content

v1.2.0 — catch-rate +15pp, SCoRe leak caught, X1-X4 + ID-anchored

Choose a tag to compare

@rpatrik96 rpatrik96 released this 30 May 09:56
· 41 commits to main since this release
1a62981

Headline (HALLMARK v1.0 corrected gold, apples-to-apples vs v1.1.0)

Metric v1.1.0 v1.2.0 Δ
dev_public FPR 1.59% 1.99% +2 FPs (documented)
test_public FPR (held-out) 2.32% 2.32% unchanged
dev_public caught-on-hallucinated 60.4% 75.2% +14.8pp
test_public caught-on-hallucinated 58.0% 73.7% +15.7pp
dev_public leak 0.65% (4) 0.65% (4) unchanged
test_public leak 0.76% (4) 0.57% (3) −1 leak (SCoRe caught)

The +14.8pp / +15.7pp catch-rate improvements come from the new ID-anchored field-mismatch helper (X3) + relaxed-author retrieval fallback (X4) unblocking ~110 entries the v1.1.0 cascade abstained on. The SCoRe leak (entry claims NeurIPS, real venue is ICLR 2021) is caught by the new cross-source venue verification (X1).

This is a minor release per semver — four new behavioral capabilities, not bugfixes.

What's new

X1 — Cross-source venue verification

fact_checker.py:_detect_cross_source_venue_mismatch. When ≥2 order-reliable sources contributed candidate records and agree on a canonical venue that differs from the entry's canonical venue, downgrade the venue outcome to MISMATCH. The venue analogue of the existing _detect_author_fabrication. Catches the SCoRe wrong-venue leak class (the v1.1.0 cheap_fix target).

X2 — arXiv DataCite DOI extraction

fact_checker.py:_arxiv_id_from_entry now mines entry["doi"] for 10.48550/arXiv.<id> (case-insensitive, version-stripping). The rest of the arXiv-ID-anchored machinery is unchanged. Unblocks HALLMARK's 2026-synthetic batch (50 entries) where the only ID was an arXiv DataCite DOI.

X3 — ID-anchored venue/year mismatch

fact_checker.py:_id_anchored_field_mismatch. Fires when (a) entry DOI resolves via Crossref, (b) the DOI record's title confirms the entry, AND (c) compare_venue returns a hard MISMATCH OR compare_year returns a hard MISMATCH beyond tolerance (gated against preprint-twin records via the existing _doi_is_preprint helper). Emits VENUE_MISMATCH / YEAR_MISMATCH on DOI-confirmed entries. The field analogue of the existing author-only helper. Drives the bulk of the catch-rate increase (~70-90 entries in the dominant HALL-CNV cluster E).

X4 — Relaxed-author retrieval fallback

fact_checker.py:_query_cascade. When the standard cascade returns zero candidates (or all candidates below abstention_below), retry Crossref and OpenAlex with the raw title (no first-author constraint). The transition is never not_found → VERIFIED; the realistic transition is not_found → AUTHOR_MISMATCH (the cascade now finds a wrong-paper candidate whose authors disagree).

Regression fix (commit 1e37f7c)

The combination of X2 + the existing given-name audit machinery could let an arXiv-API record win selection over a structured Crossref/DBLP candidate, skipping the audit on Least-to-Most-shape entries. Fixed by adding an _ORDER_RELIABLE_PREFERENCE_BAND = 0.02 constant to _select_best_candidate: inside a 0.02 score sub-band, order-reliable structured records win selection.

Known regressions (documented; flagged for v1.2.1 triage)

3 new dev FPs and 1 new test FP introduced by the catch-rate work:

  • ed071a6dfa34 (Improving Robustness using Generated Data): verifiedarxiv_id_mismatch via X2.
  • e59d381d98e6 (2026-synthetic VALID): verifiedgiven_name_substitution via X2 + 1e37f7c interaction.
  • f185501f556e + d07ee00b0c0f (Beyond log2(T) + 𝒩-WL): not_foundpartial_match via X4 fallback retrieving wrong-paper candidates.

Offsetting clears: adf6c58262bf (Community Concealment) and e51140f8b514 (RLang) — both v1.1.0 FPs now correctly verified.

Residual leaks (transparency)

4 policy-adjusted residual leaks (down from 5 in v1.1.0):

  • 3 letter-add near_miss_title (Privacys, Explanations, Models) — --strict catches all 3
  • 1 author-list truncation (OSAKA) — --strict catches as AUTHOR_TRUNCATED

Plus 3 hyphen-only differences explicitly not counted as leaks by default (hyphenation is bibliographic noise; --strict still catches via Levenshtein-1).

See docs/KNOWN_LEAKS.md for the per-leak enumeration with --strict rule mapping.

Tests

1088 → 1122 passing (+30 fix tests + 4 regression-fix tests; new modules test_cross_source_venue.py, test_arxiv_datacite_doi.py, test_id_anchored_field_mismatch.py, test_relaxed_author_fallback.py).

See CHANGELOG.md for the full per-rule breakdown.