Release v1.1.0 — held-out FPR -74%, --strict mode (arXiv 2026 policy), 32 HALLMARK mislabels corrected · rpatrik96/bibtexupdater

Headline (HALLMARK v1.0 corrected gold, apples-to-apples)

	Pre-fix	Post-fix	Δ
dev_public FPR	2.58%	1.59%	−38.5%
test_public FPR (held-out)	8.94%	2.32%	−74.1%
dev_public leak	0.49%	0.65% / 0.32% (policy-adjusted)	+1 raw
test_public leak	0.38%	0.76% / 0.57% (policy-adjusted)	+2 raw

The −74% held-out FPR drop is primarily driven by the new CNV venue + retrieval refinements.

What's new

`--strict` mode (`BIBTEX_CHECK_STRICT=1`)

Aligned with arXiv's May 2026 1-year-ban policy for hallucinated references. Tightens five gates for high-stakes audits where leak ≫ FP:

Title: Levenshtein-1 catches 1-character title perturbations (Privacys/Privacy, Schema Variable/Schema-Variable, etc.) as TITLE_NEAR_MISS.
Year: tolerance 0; preprint-twin records route to a new STRICT_WARN_PREPRINT_YEAR status.
Author-set: single-source single-extra threshold (vs the default's ≥2/≥2).
Author order: no alphabetization escape.
Truncated author list without an and others/et al sentinel flags AUTHOR_TRUNCATED.

Plus --strict-warn-cnv promotes unconfirmed/not_found to a fourth visible category STRICT_WARN_CNV.

Cross-source author-fabrication detection

fact_checker.py:_detect_author_fabrication downgrades the author outcome to AUTHOR_MISMATCH when the entry has ≥2 surnames absent from every order-reliable candidate's full author set (≥2 sources contributing, no and others sentinel). Catches fabricated trailing authors that slip past the prefix-N slice.

Could-not-verify reductions on real refs

Six tightly-scoped venue + retrieval fixes (word-boundary venue acronym match, OpenReview venueid normalization, track/decoration suffix stripping, TMLR/JMLR ISO-4 alias expansion, diacritic-preserving paperhash + term= fallback, DBLP cascade query LaTeX-strip + Unicode-fold) lift ~50 of 65 VALID could-not-verify entries to VERIFIED.

Plus four systematic FP fixes

latex_to_plain html.unescapes DBLP-scraped XML entities (d'Amore, Ch'ng, D'Hondt, & titles).
symmetric_author_match honors record alphabetization (CrossRef NeurIPS/ICML proceedings sort A-Z; that's a record-sort artifact, not a swap).
OpenReview/OpenReview.net treated as a hosting-platform venue → NON_COMPARABLE.
Preprint-twin year → NON_COMPARABLE.

Upstream HALLMARK dataset

hallmark#9 corrects 32 entries the v1.0 auto-labeller flagged as fabricated but are in fact real, correctly-cited papers (4 batches; FlashAttention, DDPM, Imagen, SimCLR, Performers, ViT-vs-CNN, Improved-DDPM, Classifier-Free Diffusion Guidance, MERLOT, SimSiam, AdaFed, Chain-of-Thought, Zero-Shot Reasoner, ...). Failure mode: arXiv DOIs register with DataCite, not CrossRef — the auto-labeller's CrossRef-resolution check returned "no resolve" for legitimate arXiv-published papers. Three corrections override prior-audit rejections (FlashAttention, DDPM-Dhariwal, Imagen) on independent arXiv-grounded evidence; provenance + conflict notes in scripts/patch_mislabels.py.

Residual leaks (transparency)

5 policy-adjusted residual leaks documented per-entry in docs/KNOWN_LEAKS.md:

3 letter-add near_miss_title (Privacys, Explanations, Models) — --strict catches all 3 as TITLE_NEAR_MISS
1 author-list truncation (OSAKA) — --strict catches as AUTHOR_TRUNCATED
1 wrong-venue (SCoRe: claims NeurIPS, real venue is ICLR 2021) — v1.1.1 cheap_fix target via cross-source venue verification

Plus 3 hyphen-only differences explicitly not counted as leaks by default mode (hyphenation is bibliographic noise; --strict still catches via Levenshtein-1).

Tests

1064 → 1088 passing (+24 strict-mode + 8 cross-source author-fab + 3 regression). CI lint (black 24.10.0 + ruff 0.7.4) clean.

See CHANGELOG.md for the full per-rule breakdown.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.1.0 — held-out FPR -74%, --strict mode (arXiv 2026 policy), 32 HALLMARK mislabels corrected

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Headline (HALLMARK v1.0 corrected gold, apples-to-apples)

What's new

`--strict` mode (`BIBTEX_CHECK_STRICT=1`)

Cross-source author-fabrication detection

Could-not-verify reductions on real refs

Plus four systematic FP fixes

Upstream HALLMARK dataset

Residual leaks (transparency)

Tests

Uh oh!

v1.1.0 — held-out FPR -74%, --strict mode (arXiv 2026 policy), 32 HALLMARK mislabels corrected

Headline (HALLMARK v1.0 corrected gold, apples-to-apples)

What's new

--strict mode (BIBTEX_CHECK_STRICT=1)

Cross-source author-fabrication detection

Could-not-verify reductions on real refs

Plus four systematic FP fixes

Upstream HALLMARK dataset

Residual leaks (transparency)

Tests

Uh oh!

`--strict` mode (`BIBTEX_CHECK_STRICT=1`)