v1.1.0 — held-out FPR -74%, --strict mode (arXiv 2026 policy), 32 HALLMARK mislabels corrected
Headline (HALLMARK v1.0 corrected gold, apples-to-apples)
| Pre-fix | Post-fix | Δ | |
|---|---|---|---|
| dev_public FPR | 2.58% | 1.59% | −38.5% |
| test_public FPR (held-out) | 8.94% | 2.32% | −74.1% |
| dev_public leak | 0.49% | 0.65% / 0.32% (policy-adjusted) | +1 raw |
| test_public leak | 0.38% | 0.76% / 0.57% (policy-adjusted) | +2 raw |
The −74% held-out FPR drop is primarily driven by the new CNV venue + retrieval refinements.
What's new
--strict mode (BIBTEX_CHECK_STRICT=1)
Aligned with arXiv's May 2026 1-year-ban policy for hallucinated references. Tightens five gates for high-stakes audits where leak ≫ FP:
- Title: Levenshtein-1 catches 1-character title perturbations (
Privacys/Privacy,Schema Variable/Schema-Variable, etc.) asTITLE_NEAR_MISS. - Year: tolerance 0; preprint-twin records route to a new
STRICT_WARN_PREPRINT_YEARstatus. - Author-set: single-source single-extra threshold (vs the default's ≥2/≥2).
- Author order: no alphabetization escape.
- Truncated author list without an
and others/et alsentinel flagsAUTHOR_TRUNCATED.
Plus --strict-warn-cnv promotes unconfirmed/not_found to a fourth visible category STRICT_WARN_CNV.
Cross-source author-fabrication detection
fact_checker.py:_detect_author_fabrication downgrades the author outcome to AUTHOR_MISMATCH when the entry has ≥2 surnames absent from every order-reliable candidate's full author set (≥2 sources contributing, no and others sentinel). Catches fabricated trailing authors that slip past the prefix-N slice.
Could-not-verify reductions on real refs
Six tightly-scoped venue + retrieval fixes (word-boundary venue acronym match, OpenReview venueid normalization, track/decoration suffix stripping, TMLR/JMLR ISO-4 alias expansion, diacritic-preserving paperhash + term= fallback, DBLP cascade query LaTeX-strip + Unicode-fold) lift ~50 of 65 VALID could-not-verify entries to VERIFIED.
Plus four systematic FP fixes
latex_to_plainhtml.unescapes DBLP-scraped XML entities (d'Amore,Ch'ng,D'Hondt,&titles).symmetric_author_matchhonors record alphabetization (CrossRef NeurIPS/ICML proceedings sort A-Z; that's a record-sort artifact, not a swap).- OpenReview/OpenReview.net treated as a hosting-platform venue → NON_COMPARABLE.
- Preprint-twin year → NON_COMPARABLE.
Upstream HALLMARK dataset
hallmark#9 corrects 32 entries the v1.0 auto-labeller flagged as fabricated but are in fact real, correctly-cited papers (4 batches; FlashAttention, DDPM, Imagen, SimCLR, Performers, ViT-vs-CNN, Improved-DDPM, Classifier-Free Diffusion Guidance, MERLOT, SimSiam, AdaFed, Chain-of-Thought, Zero-Shot Reasoner, ...). Failure mode: arXiv DOIs register with DataCite, not CrossRef — the auto-labeller's CrossRef-resolution check returned "no resolve" for legitimate arXiv-published papers. Three corrections override prior-audit rejections (FlashAttention, DDPM-Dhariwal, Imagen) on independent arXiv-grounded evidence; provenance + conflict notes in scripts/patch_mislabels.py.
Residual leaks (transparency)
5 policy-adjusted residual leaks documented per-entry in docs/KNOWN_LEAKS.md:
- 3 letter-add
near_miss_title(Privacys, Explanations, Models) —--strictcatches all 3 asTITLE_NEAR_MISS - 1 author-list truncation (OSAKA) —
--strictcatches asAUTHOR_TRUNCATED - 1 wrong-venue (SCoRe: claims NeurIPS, real venue is ICLR 2021) — v1.1.1
cheap_fixtarget via cross-source venue verification
Plus 3 hyphen-only differences explicitly not counted as leaks by default mode (hyphenation is bibliographic noise; --strict still catches via Levenshtein-1).
Tests
1064 → 1088 passing (+24 strict-mode + 8 cross-source author-fab + 3 regression). CI lint (black 24.10.0 + ruff 0.7.4) clean.
See CHANGELOG.md for the full per-rule breakdown.