Release any2md 1.0.2 — heuristics module + 11 quality fixes · rocklambros/any2md

Patch release. Closes #15 plus 8 additional quality issues discovered during deep investigation against four real-world inputs (arxiv academic paper, ISO/IEC 27002 standard, COMP4441 academic DOCX, Wikipedia article via URL).

What's new

New any2md/heuristics.py module — pure-functions module for frontmatter field refinement: refine_title, refine_abstract, extract_authors (with arxiv API enrichment), filter_organization, arxiv_lookup, is_arxiv_filename.
New produced_by extension field on the frontmatter contract. Records the software that produced the source file (PDF Creator, DOCX Application). Distinct from extracted_via which records the any2md backend that produced the markdown.
5 new pipeline stages: C8 decode_html_entities (universal HTML-entity removal, code-block aware) + T7-T10 (table-formatted-TOC dedupe, cover-page-artifact strip, repeated-byline removal, web-fragment cleanup).
New --no-arxiv-lookup CLI flag to disable the arxiv API metadata enrichment (default-on, useful for airgapped envs).
Project logo and brand assets under assets/.

Eleven quality fixes (full list)

#	Severity	Symptom	Affected
A1	HIGH	Authors not extracted from PDF body byline	Academic PDFs
A2	HIGH	Abstract picked the byline / cover blurb / TOC line	Most docs
A3	MED	Organization populated with PDF Creator software junk	Most PDFs
B1	HIGH	HTML entities (`&`, `<`) in body	All formats
C1	MED	ISO/TR titles detected as cover-page boilerplate	Standards docs
C3	LOW	DOCX titles concatenated course code + project	Some DOCX
C4	LOW	Wikipedia titles kept "Wikipedia:" namespace prefix	Wikipedia
D1	MED	TOC dumped as markdown table after abstract	Academic PDFs
D2	MED	trafilatura fragments leaked into web outputs	Some web pages
D3	MED	ISO cover-page QR/license blurb in body	Standards docs
E1	LOW	"Author's Contact Information:" duplicated byline	Academic PDFs

Real-world before/after on the AI Governance paper

# v1.0.1
authors: []                                                 # ❌
organization: "LaTeX with acmart 2024/08/25 v2.09 ..."     # ❌ tooling, not org
abstract_for_rag: "PHILIP MOREIRA TOMEI 1, 2, RUPAL JAIN ..." # ❌ byline

# v1.0.2
authors: ["Philip Moreira Tomei", "Rupal Jain", "Matija Franklin"]  # ✓ from arxiv API
organization: ""                                                    # ✓
produced_by: "LaTeX with acmart 2024/08/25 v2.09 ..."              # ✓ tooling
abstract_for_rag: "This paper argues that market governance        # ✓ real abstract
  mechanisms should be considered a key approach in the
  governance of artificial intelligence (AI), alongside
  traditional regulatory frameworks. ..."

Install

pip install --upgrade "any2md[high-fidelity]"   # Docling + heuristics
pip install --upgrade any2md                    # Lightweight

Documentation

README — refreshed with v1.0.2 features and architecture diagram
docs/output-format.md — produced_by field reference
docs/cli-reference.md — --no-arxiv-lookup flag
docs/architecture.md — diagram refresh + Heuristics module section
docs/troubleshooting.md — 6 new "Resolved by v1.0.2" rows
Full design spec

Backwards compatibility

All v1.0.1 frontmatter parsing continues to work. The new produced_by field is additive (downstream parsers ignore unknown keys). The organization value change for software-creator PDFs is a bug fix in the strict sense — the prior value was incorrect; the new value (empty) is correct.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

any2md 1.0.2 — heuristics module + 11 quality fixes

Choose a tag to compare

Sorry, something went wrong.