any2md 1.0.0 — first stable release
First stable release of any2md v1.0.
any2md converts PDFs, DOCX files, HTML pages (and live URLs), and plain text into structured, machine-consumable Markdown for downstream RAG pipelines. Every output carries SSRM-compatible frontmatter — title, authors, dates, deterministic SHA-256 content_hash, token estimates, chunking guidance, and a derived abstract — followed by NFC- and LF-normalized body content.
Install
Lightweight (pymupdf4llm fallback only, ~50 MB):
pip install any2mdHigh-fidelity (adds Docling, ~2 GB ML models, much better tables and multi-column):
pip install "any2md[high-fidelity]"What's in 1.0.0
- SSRM-compatible YAML frontmatter with deterministic content hashing.
- Two-lane post-processing pipeline: structured (Docling) and text (trafilatura, mammoth, pymupdf4llm fallback, TXT heuristic). Six text-lane stages (line-wrap repair, dehyphenation, paragraph dedupe, TOC dedupe, header/footer strip, list/code restoration), four structured-lane stages (figure caption lift, table compaction, citation normalization, heading hierarchy enforcement), and seven shared cleanup stages.
- Docling primary backend for PDF/DOCX with automatic fallback to pymupdf4llm/mammoth when Docling isn't installed.
- Configuration and ergonomics:
--profile {conservative,aggressive,maximum},--high-fidelity/-H,--ocr-figures,--save-images,--auto-id,--meta KEY=VAL,--meta-file,.any2md.tomlauto-discovery,--strict,--quiet,--verbose. New exit codes 0/1/2/3. - Comprehensive documentation set: rewritten README,
docs/output-format.md(the SSRM-compat contract),docs/cli-reference.md,docs/architecture.md,docs/troubleshooting.md,docs/upgrading-from-0.7.md,CONTRIBUTING.md, plus GitHub issue templates.
Migration from v0.7
The frontmatter shape changed (this is a breaking change for downstream consumers parsing v0.7 output). Field-by-field migration table is in docs/upgrading-from-0.7.md. Pin v0.7 if you need the old shape.
Validation
Validated against:
- ISO/IEC 27002:2022 (164-page technical standard PDF, Docling)
- A multi-page academic DOCX (Docling)
- A Wikipedia article via URL (trafilatura)
All produce SSRM-compatible output with correct titles, dates, and content hashes.
Acknowledgements
Built on Docling, PyMuPDF, pymupdf4llm, mammoth, markdownify, trafilatura, and BeautifulSoup.