Skip to content

v1.0.3 — empty-title fix, Docling-lane orphan-punctuation, double-encode loop

Choose a tag to compare

@rocklambros rocklambros released this 27 Apr 01:48
· 25 commits to main since this release
41d5770

Patch release closing two regressions surfaced by audit on the v1.0.2-regenerated corpus, plus a Docling-lane reach for the v1.0.2 body-cleanup stages.

Highlights

  • Empty-title fix (heuristics.refine_title): cover-page-titled docs (# INTERNATIONAL STANDARD etc.) emitted title: "" when the first H2 stripped to empty (markdown emphasis only, NBSP-equivalent unicode, regex \s+ span crossing into the next paragraph). Now walks H2 lines line-by-line and skips any that strip to empty after dropping markdown emphasis. Wikipedia-prefix strip got the same non-empty guard.
  • Docling-lane orphan-punctuation: lone |/> lines from Docling's malformed table parsing leaked through because T10 was text-lane-only by design. The orphan-punct portion is extracted into a new lane-agnostic stage strip_orphan_punctuation. T10's trafilatura-specific short-fragment heuristic stays text-lane-only.
  • Lane-agnostic body cleanup on Docling lane: T7 dedupe_toc_table, T8 strip_cover_artifacts, and T9 strip_repeated_byline are now appended to the structured-lane STAGES list so Docling output gets the same body-cleanup pass that text-lane output already had.
  • Double-encoded HTML entities (C8 decode_html_entities): now loops html.unescape until output stabilizes (max 5 iterations) so &&& is fully decoded.

Verified

  • 280 unit/integration tests pass; 1 skipped, 0 failures.
  • Audit on the SafeBreach + ISO regen corpus shows 0 empty-title, 0 orphan-punctuation, 0 cover-page-title flags.
  • v1.0.3-rc1 published cleanly to TestPyPI.

Full per-fix detail in CHANGELOG. Design rationale in docs/superpowers/specs/2026-04-26-v1.0.3-empty-title-orphan-punct-design.md.