Skip to content

v1.0.5 — DOCX auto-retry through mammoth on Docling msword warnings

Choose a tag to compare

@rocklambros rocklambros released this 27 Apr 12:07
· 23 commits to main since this release
1c3d990

Patch release. Recovers DOCX content that Docling's msword backend silently drops on malformed list structures.

Highlights

  • DOCX auto-retry on Docling warnings. When Docling's DOCX backend emits any msword_backend warning (most commonly: "Parent element of the list item is not a ListGroup. The list item will be ignored." — msword_backend.py:1377/1675), any2md now re-runs the file through the mammoth lane and uses that output. Docling's guard fires when it can't reconcile a DOCX list-item parent and silently returns without emitting the item — so a warning means the Markdown is missing content. Default ON; disable with --no-docx-fallback-on-warn.
  • Real-world impact: A policy-template DOCX that previously rendered to 12,740 words via Docling (with 29 dropped list items) now renders to 15,566 words via mammoth fallback — +22% content recovered on that single file.
  • Captured Docling warnings are forwarded into any2md's run-level warning bucket so --strict still fails on them. A FALLBACK: line is printed to stderr to make the swap visible.

Added

  • PipelineOptions.docx_fallback_on_warn: bool = True.
  • CLI flag pair --docx-fallback-on-warn / --no-docx-fallback-on-warn.

Tests

  • New tests/integration/test_docx_fallback_on_warn.py (7 tests): warning-capture context manager, default-on fallback path (output swap, extracted_via, captured warnings forwarded into collected_warnings(), user-visible message), --no-docx-fallback-on-warn opt-out, no-warning fast-path, and the explicit --backend docling interaction.
  • Full suite: 290 passed, 1 skipped.

Internal

  • _extract_via_docling() now returns (markdown, "docling", captured_warnings). Internal-only signature change.