v1.2.0 — catch-rate +15pp, SCoRe leak caught, X1-X4 + ID-anchored
Headline (HALLMARK v1.0 corrected gold, apples-to-apples vs v1.1.0)
| Metric | v1.1.0 | v1.2.0 | Δ |
|---|---|---|---|
| dev_public FPR | 1.59% | 1.99% | +2 FPs (documented) |
| test_public FPR (held-out) | 2.32% | 2.32% | unchanged |
| dev_public caught-on-hallucinated | 60.4% | 75.2% | +14.8pp |
| test_public caught-on-hallucinated | 58.0% | 73.7% | +15.7pp |
| dev_public leak | 0.65% (4) | 0.65% (4) | unchanged |
| test_public leak | 0.76% (4) | 0.57% (3) | −1 leak (SCoRe caught) |
The +14.8pp / +15.7pp catch-rate improvements come from the new ID-anchored field-mismatch helper (X3) + relaxed-author retrieval fallback (X4) unblocking ~110 entries the v1.1.0 cascade abstained on. The SCoRe leak (entry claims NeurIPS, real venue is ICLR 2021) is caught by the new cross-source venue verification (X1).
This is a minor release per semver — four new behavioral capabilities, not bugfixes.
What's new
X1 — Cross-source venue verification
fact_checker.py:_detect_cross_source_venue_mismatch. When ≥2 order-reliable sources contributed candidate records and agree on a canonical venue that differs from the entry's canonical venue, downgrade the venue outcome to MISMATCH. The venue analogue of the existing _detect_author_fabrication. Catches the SCoRe wrong-venue leak class (the v1.1.0 cheap_fix target).
X2 — arXiv DataCite DOI extraction
fact_checker.py:_arxiv_id_from_entry now mines entry["doi"] for 10.48550/arXiv.<id> (case-insensitive, version-stripping). The rest of the arXiv-ID-anchored machinery is unchanged. Unblocks HALLMARK's 2026-synthetic batch (50 entries) where the only ID was an arXiv DataCite DOI.
X3 — ID-anchored venue/year mismatch
fact_checker.py:_id_anchored_field_mismatch. Fires when (a) entry DOI resolves via Crossref, (b) the DOI record's title confirms the entry, AND (c) compare_venue returns a hard MISMATCH OR compare_year returns a hard MISMATCH beyond tolerance (gated against preprint-twin records via the existing _doi_is_preprint helper). Emits VENUE_MISMATCH / YEAR_MISMATCH on DOI-confirmed entries. The field analogue of the existing author-only helper. Drives the bulk of the catch-rate increase (~70-90 entries in the dominant HALL-CNV cluster E).
X4 — Relaxed-author retrieval fallback
fact_checker.py:_query_cascade. When the standard cascade returns zero candidates (or all candidates below abstention_below), retry Crossref and OpenAlex with the raw title (no first-author constraint). The transition is never not_found → VERIFIED; the realistic transition is not_found → AUTHOR_MISMATCH (the cascade now finds a wrong-paper candidate whose authors disagree).
Regression fix (commit 1e37f7c)
The combination of X2 + the existing given-name audit machinery could let an arXiv-API record win selection over a structured Crossref/DBLP candidate, skipping the audit on Least-to-Most-shape entries. Fixed by adding an _ORDER_RELIABLE_PREFERENCE_BAND = 0.02 constant to _select_best_candidate: inside a 0.02 score sub-band, order-reliable structured records win selection.
Known regressions (documented; flagged for v1.2.1 triage)
3 new dev FPs and 1 new test FP introduced by the catch-rate work:
ed071a6dfa34(Improving Robustness using Generated Data):verified→arxiv_id_mismatchvia X2.e59d381d98e6(2026-synthetic VALID):verified→given_name_substitutionvia X2 + 1e37f7c interaction.f185501f556e+d07ee00b0c0f(Beyond log2(T) + 𝒩-WL):not_found→partial_matchvia X4 fallback retrieving wrong-paper candidates.
Offsetting clears: adf6c58262bf (Community Concealment) and e51140f8b514 (RLang) — both v1.1.0 FPs now correctly verified.
Residual leaks (transparency)
4 policy-adjusted residual leaks (down from 5 in v1.1.0):
- 3 letter-add
near_miss_title(Privacys, Explanations, Models) —--strictcatches all 3 - 1 author-list truncation (OSAKA) —
--strictcatches asAUTHOR_TRUNCATED
Plus 3 hyphen-only differences explicitly not counted as leaks by default (hyphenation is bibliographic noise; --strict still catches via Levenshtein-1).
See docs/KNOWN_LEAKS.md for the per-leak enumeration with --strict rule mapping.
Tests
1088 → 1122 passing (+30 fix tests + 4 regression-fix tests; new modules test_cross_source_venue.py, test_arxiv_datacite_doi.py, test_id_anchored_field_mismatch.py, test_relaxed_author_fallback.py).
See CHANGELOG.md for the full per-rule breakdown.