Skip to content

v1.1.0 — held-out FPR -74%, --strict mode (arXiv 2026 policy), 32 HALLMARK mislabels corrected

Choose a tag to compare

@rpatrik96 rpatrik96 released this 30 May 08:09
· 42 commits to main since this release

Headline (HALLMARK v1.0 corrected gold, apples-to-apples)

Pre-fix Post-fix Δ
dev_public FPR 2.58% 1.59% −38.5%
test_public FPR (held-out) 8.94% 2.32% −74.1%
dev_public leak 0.49% 0.65% / 0.32% (policy-adjusted) +1 raw
test_public leak 0.38% 0.76% / 0.57% (policy-adjusted) +2 raw

The −74% held-out FPR drop is primarily driven by the new CNV venue + retrieval refinements.

What's new

--strict mode (BIBTEX_CHECK_STRICT=1)

Aligned with arXiv's May 2026 1-year-ban policy for hallucinated references. Tightens five gates for high-stakes audits where leak ≫ FP:

  • Title: Levenshtein-1 catches 1-character title perturbations (Privacys/Privacy, Schema Variable/Schema-Variable, etc.) as TITLE_NEAR_MISS.
  • Year: tolerance 0; preprint-twin records route to a new STRICT_WARN_PREPRINT_YEAR status.
  • Author-set: single-source single-extra threshold (vs the default's ≥2/≥2).
  • Author order: no alphabetization escape.
  • Truncated author list without an and others/et al sentinel flags AUTHOR_TRUNCATED.

Plus --strict-warn-cnv promotes unconfirmed/not_found to a fourth visible category STRICT_WARN_CNV.

Cross-source author-fabrication detection

fact_checker.py:_detect_author_fabrication downgrades the author outcome to AUTHOR_MISMATCH when the entry has ≥2 surnames absent from every order-reliable candidate's full author set (≥2 sources contributing, no and others sentinel). Catches fabricated trailing authors that slip past the prefix-N slice.

Could-not-verify reductions on real refs

Six tightly-scoped venue + retrieval fixes (word-boundary venue acronym match, OpenReview venueid normalization, track/decoration suffix stripping, TMLR/JMLR ISO-4 alias expansion, diacritic-preserving paperhash + term= fallback, DBLP cascade query LaTeX-strip + Unicode-fold) lift ~50 of 65 VALID could-not-verify entries to VERIFIED.

Plus four systematic FP fixes

  • latex_to_plain html.unescapes DBLP-scraped XML entities (d'Amore, Ch'ng, D'Hondt, & titles).
  • symmetric_author_match honors record alphabetization (CrossRef NeurIPS/ICML proceedings sort A-Z; that's a record-sort artifact, not a swap).
  • OpenReview/OpenReview.net treated as a hosting-platform venue → NON_COMPARABLE.
  • Preprint-twin year → NON_COMPARABLE.

Upstream HALLMARK dataset

hallmark#9 corrects 32 entries the v1.0 auto-labeller flagged as fabricated but are in fact real, correctly-cited papers (4 batches; FlashAttention, DDPM, Imagen, SimCLR, Performers, ViT-vs-CNN, Improved-DDPM, Classifier-Free Diffusion Guidance, MERLOT, SimSiam, AdaFed, Chain-of-Thought, Zero-Shot Reasoner, ...). Failure mode: arXiv DOIs register with DataCite, not CrossRef — the auto-labeller's CrossRef-resolution check returned "no resolve" for legitimate arXiv-published papers. Three corrections override prior-audit rejections (FlashAttention, DDPM-Dhariwal, Imagen) on independent arXiv-grounded evidence; provenance + conflict notes in scripts/patch_mislabels.py.

Residual leaks (transparency)

5 policy-adjusted residual leaks documented per-entry in docs/KNOWN_LEAKS.md:

  • 3 letter-add near_miss_title (Privacys, Explanations, Models) — --strict catches all 3 as TITLE_NEAR_MISS
  • 1 author-list truncation (OSAKA) — --strict catches as AUTHOR_TRUNCATED
  • 1 wrong-venue (SCoRe: claims NeurIPS, real venue is ICLR 2021) — v1.1.1 cheap_fix target via cross-source venue verification

Plus 3 hyphen-only differences explicitly not counted as leaks by default mode (hyphenation is bibliographic noise; --strict still catches via Levenshtein-1).

Tests

1064 → 1088 passing (+24 strict-mode + 8 cross-source author-fab + 3 regression). CI lint (black 24.10.0 + ruff 0.7.4) clean.

See CHANGELOG.md for the full per-rule breakdown.