v1.0.3 — empty-title fix, Docling-lane orphan-punctuation, double-encode loop
Patch release closing two regressions surfaced by audit on the v1.0.2-regenerated corpus, plus a Docling-lane reach for the v1.0.2 body-cleanup stages.
Highlights
- Empty-title fix (
heuristics.refine_title): cover-page-titled docs (# INTERNATIONAL STANDARDetc.) emittedtitle: ""when the first H2 stripped to empty (markdown emphasis only, NBSP-equivalent unicode, regex\s+span crossing into the next paragraph). Now walks H2 lines line-by-line and skips any that strip to empty after dropping markdown emphasis. Wikipedia-prefix strip got the same non-empty guard. - Docling-lane orphan-punctuation: lone
|/>lines from Docling's malformed table parsing leaked through because T10 was text-lane-only by design. The orphan-punct portion is extracted into a new lane-agnostic stagestrip_orphan_punctuation. T10's trafilatura-specific short-fragment heuristic stays text-lane-only. - Lane-agnostic body cleanup on Docling lane: T7
dedupe_toc_table, T8strip_cover_artifacts, and T9strip_repeated_bylineare now appended to the structured-laneSTAGESlist so Docling output gets the same body-cleanup pass that text-lane output already had. - Double-encoded HTML entities (C8
decode_html_entities): now loopshtml.unescapeuntil output stabilizes (max 5 iterations) so&→&→&is fully decoded.
Verified
- 280 unit/integration tests pass; 1 skipped, 0 failures.
- Audit on the SafeBreach + ISO regen corpus shows 0 empty-title, 0 orphan-punctuation, 0 cover-page-title flags.
- v1.0.3-rc1 published cleanly to TestPyPI.
Full per-fix detail in CHANGELOG. Design rationale in docs/superpowers/specs/2026-04-26-v1.0.3-empty-title-orphan-punct-design.md.