Release v1.0.3 — empty-title fix, Docling-lane orphan-punctuation, double-encode loop · rocklambros/any2md

Patch release closing two regressions surfaced by audit on the v1.0.2-regenerated corpus, plus a Docling-lane reach for the v1.0.2 body-cleanup stages.

Highlights

Empty-title fix (heuristics.refine_title): cover-page-titled docs (# INTERNATIONAL STANDARD etc.) emitted title: "" when the first H2 stripped to empty (markdown emphasis only, NBSP-equivalent unicode, regex \s+ span crossing into the next paragraph). Now walks H2 lines line-by-line and skips any that strip to empty after dropping markdown emphasis. Wikipedia-prefix strip got the same non-empty guard.
Docling-lane orphan-punctuation: lone |/> lines from Docling's malformed table parsing leaked through because T10 was text-lane-only by design. The orphan-punct portion is extracted into a new lane-agnostic stage strip_orphan_punctuation. T10's trafilatura-specific short-fragment heuristic stays text-lane-only.
Lane-agnostic body cleanup on Docling lane: T7 dedupe_toc_table, T8 strip_cover_artifacts, and T9 strip_repeated_byline are now appended to the structured-lane STAGES list so Docling output gets the same body-cleanup pass that text-lane output already had.
Double-encoded HTML entities (C8 decode_html_entities): now loops html.unescape until output stabilizes (max 5 iterations) so &amp; → & → & is fully decoded.

Verified

280 unit/integration tests pass; 1 skipped, 0 failures.
Audit on the SafeBreach + ISO regen corpus shows 0 empty-title, 0 orphan-punctuation, 0 cover-page-title flags.
v1.0.3-rc1 published cleanly to TestPyPI.

Full per-fix detail in CHANGELOG. Design rationale in docs/superpowers/specs/2026-04-26-v1.0.3-empty-title-orphan-punct-design.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.0.3 — empty-title fix, Docling-lane orphan-punctuation, double-encode loop

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

Verified

Uh oh!