Skip to content

any2md 1.0.2 — heuristics module + 11 quality fixes

Choose a tag to compare

@rocklambros rocklambros released this 26 Apr 23:14

Patch release. Closes #15 plus 8 additional quality issues discovered during deep investigation against four real-world inputs (arxiv academic paper, ISO/IEC 27002 standard, COMP4441 academic DOCX, Wikipedia article via URL).

What's new

  • New any2md/heuristics.py module — pure-functions module for frontmatter field refinement: refine_title, refine_abstract, extract_authors (with arxiv API enrichment), filter_organization, arxiv_lookup, is_arxiv_filename.
  • New produced_by extension field on the frontmatter contract. Records the software that produced the source file (PDF Creator, DOCX Application). Distinct from extracted_via which records the any2md backend that produced the markdown.
  • 5 new pipeline stages: C8 decode_html_entities (universal HTML-entity removal, code-block aware) + T7-T10 (table-formatted-TOC dedupe, cover-page-artifact strip, repeated-byline removal, web-fragment cleanup).
  • New --no-arxiv-lookup CLI flag to disable the arxiv API metadata enrichment (default-on, useful for airgapped envs).
  • Project logo and brand assets under assets/.

Eleven quality fixes (full list)

# Severity Symptom Affected
A1 HIGH Authors not extracted from PDF body byline Academic PDFs
A2 HIGH Abstract picked the byline / cover blurb / TOC line Most docs
A3 MED Organization populated with PDF Creator software junk Most PDFs
B1 HIGH HTML entities (&, <) in body All formats
C1 MED ISO/TR titles detected as cover-page boilerplate Standards docs
C3 LOW DOCX titles concatenated course code + project Some DOCX
C4 LOW Wikipedia titles kept "Wikipedia:" namespace prefix Wikipedia
D1 MED TOC dumped as markdown table after abstract Academic PDFs
D2 MED trafilatura fragments leaked into web outputs Some web pages
D3 MED ISO cover-page QR/license blurb in body Standards docs
E1 LOW "Author's Contact Information:" duplicated byline Academic PDFs

Real-world before/after on the AI Governance paper

# v1.0.1
authors: []                                                 #
organization: "LaTeX with acmart 2024/08/25 v2.09 ..."     # ❌ tooling, not org
abstract_for_rag: "PHILIP MOREIRA TOMEI 1, 2, RUPAL JAIN ..." # ❌ byline

# v1.0.2
authors: ["Philip Moreira Tomei", "Rupal Jain", "Matija Franklin"]  # ✓ from arxiv API
organization: ""                                                    #
produced_by: "LaTeX with acmart 2024/08/25 v2.09 ..."              # ✓ tooling
abstract_for_rag: "This paper argues that market governance        # ✓ real abstract
  mechanisms should be considered a key approach in the
  governance of artificial intelligence (AI), alongside
  traditional regulatory frameworks. ..."

Install

pip install --upgrade "any2md[high-fidelity]"   # Docling + heuristics
pip install --upgrade any2md                    # Lightweight

Documentation

Backwards compatibility

All v1.0.1 frontmatter parsing continues to work. The new produced_by field is additive (downstream parsers ignore unknown keys). The organization value change for software-creator PDFs is a bug fix in the strict sense — the prior value was incorrect; the new value (empty) is correct.