Skip to content

any2md 1.0.0 — first stable release

Choose a tag to compare

@rocklambros rocklambros released this 26 Apr 18:59

First stable release of any2md v1.0.

any2md converts PDFs, DOCX files, HTML pages (and live URLs), and plain text into structured, machine-consumable Markdown for downstream RAG pipelines. Every output carries SSRM-compatible frontmatter — title, authors, dates, deterministic SHA-256 content_hash, token estimates, chunking guidance, and a derived abstract — followed by NFC- and LF-normalized body content.

Install

Lightweight (pymupdf4llm fallback only, ~50 MB):

pip install any2md

High-fidelity (adds Docling, ~2 GB ML models, much better tables and multi-column):

pip install "any2md[high-fidelity]"

What's in 1.0.0

  • SSRM-compatible YAML frontmatter with deterministic content hashing.
  • Two-lane post-processing pipeline: structured (Docling) and text (trafilatura, mammoth, pymupdf4llm fallback, TXT heuristic). Six text-lane stages (line-wrap repair, dehyphenation, paragraph dedupe, TOC dedupe, header/footer strip, list/code restoration), four structured-lane stages (figure caption lift, table compaction, citation normalization, heading hierarchy enforcement), and seven shared cleanup stages.
  • Docling primary backend for PDF/DOCX with automatic fallback to pymupdf4llm/mammoth when Docling isn't installed.
  • Configuration and ergonomics: --profile {conservative,aggressive,maximum}, --high-fidelity / -H, --ocr-figures, --save-images, --auto-id, --meta KEY=VAL, --meta-file, .any2md.toml auto-discovery, --strict, --quiet, --verbose. New exit codes 0/1/2/3.
  • Comprehensive documentation set: rewritten README, docs/output-format.md (the SSRM-compat contract), docs/cli-reference.md, docs/architecture.md, docs/troubleshooting.md, docs/upgrading-from-0.7.md, CONTRIBUTING.md, plus GitHub issue templates.

Migration from v0.7

The frontmatter shape changed (this is a breaking change for downstream consumers parsing v0.7 output). Field-by-field migration table is in docs/upgrading-from-0.7.md. Pin v0.7 if you need the old shape.

Validation

Validated against:

  • ISO/IEC 27002:2022 (164-page technical standard PDF, Docling)
  • A multi-page academic DOCX (Docling)
  • A Wikipedia article via URL (trafilatura)

All produce SSRM-compatible output with correct titles, dates, and content hashes.

Acknowledgements

Built on Docling, PyMuPDF, pymupdf4llm, mammoth, markdownify, trafilatura, and BeautifulSoup.