v1.0.5 — DOCX auto-retry through mammoth on Docling msword warnings
Patch release. Recovers DOCX content that Docling's msword backend silently drops on malformed list structures.
Highlights
- DOCX auto-retry on Docling warnings. When Docling's DOCX backend emits any
msword_backendwarning (most commonly: "Parent element of the list item is not a ListGroup. The list item will be ignored." —msword_backend.py:1377/1675), any2md now re-runs the file through the mammoth lane and uses that output. Docling's guard fires when it can't reconcile a DOCX list-item parent and silently returns without emitting the item — so a warning means the Markdown is missing content. Default ON; disable with--no-docx-fallback-on-warn. - Real-world impact: A policy-template DOCX that previously rendered to 12,740 words via Docling (with 29 dropped list items) now renders to 15,566 words via mammoth fallback — +22% content recovered on that single file.
- Captured Docling warnings are forwarded into any2md's run-level warning bucket so
--strictstill fails on them. AFALLBACK:line is printed to stderr to make the swap visible.
Added
PipelineOptions.docx_fallback_on_warn: bool = True.- CLI flag pair
--docx-fallback-on-warn/--no-docx-fallback-on-warn.
Tests
- New
tests/integration/test_docx_fallback_on_warn.py(7 tests): warning-capture context manager, default-on fallback path (output swap,extracted_via, captured warnings forwarded intocollected_warnings(), user-visible message),--no-docx-fallback-on-warnopt-out, no-warning fast-path, and the explicit--backend doclinginteraction. - Full suite: 290 passed, 1 skipped.
Internal
_extract_via_docling()now returns(markdown, "docling", captured_warnings). Internal-only signature change.