any2md 1.0.2 — heuristics module + 11 quality fixes
Patch release. Closes #15 plus 8 additional quality issues discovered during deep investigation against four real-world inputs (arxiv academic paper, ISO/IEC 27002 standard, COMP4441 academic DOCX, Wikipedia article via URL).
What's new
- New
any2md/heuristics.pymodule — pure-functions module for frontmatter field refinement:refine_title,refine_abstract,extract_authors(with arxiv API enrichment),filter_organization,arxiv_lookup,is_arxiv_filename. - New
produced_byextension field on the frontmatter contract. Records the software that produced the source file (PDF Creator, DOCX Application). Distinct fromextracted_viawhich records the any2md backend that produced the markdown. - 5 new pipeline stages: C8
decode_html_entities(universal HTML-entity removal, code-block aware) + T7-T10 (table-formatted-TOC dedupe, cover-page-artifact strip, repeated-byline removal, web-fragment cleanup). - New
--no-arxiv-lookupCLI flag to disable the arxiv API metadata enrichment (default-on, useful for airgapped envs). - Project logo and brand assets under
assets/.
Eleven quality fixes (full list)
| # | Severity | Symptom | Affected |
|---|---|---|---|
| A1 | HIGH | Authors not extracted from PDF body byline | Academic PDFs |
| A2 | HIGH | Abstract picked the byline / cover blurb / TOC line | Most docs |
| A3 | MED | Organization populated with PDF Creator software junk | Most PDFs |
| B1 | HIGH | HTML entities (&, <) in body |
All formats |
| C1 | MED | ISO/TR titles detected as cover-page boilerplate | Standards docs |
| C3 | LOW | DOCX titles concatenated course code + project | Some DOCX |
| C4 | LOW | Wikipedia titles kept "Wikipedia:" namespace prefix | Wikipedia |
| D1 | MED | TOC dumped as markdown table after abstract | Academic PDFs |
| D2 | MED | trafilatura fragments leaked into web outputs | Some web pages |
| D3 | MED | ISO cover-page QR/license blurb in body | Standards docs |
| E1 | LOW | "Author's Contact Information:" duplicated byline | Academic PDFs |
Real-world before/after on the AI Governance paper
# v1.0.1
authors: [] # ❌
organization: "LaTeX with acmart 2024/08/25 v2.09 ..." # ❌ tooling, not org
abstract_for_rag: "PHILIP MOREIRA TOMEI 1, 2, RUPAL JAIN ..." # ❌ byline
# v1.0.2
authors: ["Philip Moreira Tomei", "Rupal Jain", "Matija Franklin"] # ✓ from arxiv API
organization: "" # ✓
produced_by: "LaTeX with acmart 2024/08/25 v2.09 ..." # ✓ tooling
abstract_for_rag: "This paper argues that market governance # ✓ real abstract
mechanisms should be considered a key approach in the
governance of artificial intelligence (AI), alongside
traditional regulatory frameworks. ..."Install
pip install --upgrade "any2md[high-fidelity]" # Docling + heuristics
pip install --upgrade any2md # LightweightDocumentation
- README — refreshed with v1.0.2 features and architecture diagram
- docs/output-format.md —
produced_byfield reference - docs/cli-reference.md —
--no-arxiv-lookupflag - docs/architecture.md — diagram refresh + Heuristics module section
- docs/troubleshooting.md — 6 new "Resolved by v1.0.2" rows
- Full design spec
Backwards compatibility
All v1.0.1 frontmatter parsing continues to work. The new produced_by field is additive (downstream parsers ignore unknown keys). The organization value change for software-creator PDFs is a bug fix in the strict sense — the prior value was incorrect; the new value (empty) is correct.