Skip to content

refactor: paragraph-level hardwrap detection with three invariants#20

Merged
ultimatile merged 1 commit into
mainfrom
fix/19-hardwrap-clause-vs-structural
May 19, 2026
Merged

refactor: paragraph-level hardwrap detection with three invariants#20
ultimatile merged 1 commit into
mainfrom
fix/19-hardwrap-clause-vs-structural

Conversation

@ultimatile
Copy link
Copy Markdown
Owner

@ultimatile ultimatile commented May 19, 2026

Summary

Closes #19.

Rewrites detect_hardwrap from a streaming counter machine into paragraph-level analysis.
The streaming design reset all per-line state on any out-of-band line, which left the fat-fat-thin column-wrap shape (two in-band non-terminal lines followed by a short terminal final line) undetectable: the short final line was out of band and reset both lanes before either could fire on the paragraph.
Reframes detection around three hard-reject invariants applied to each paragraph, with a normalized terminator that distinguishes clause boundaries from structural markup closes.

A pre-hypothesis external review pass framed the detection problem and proposed the three-invariant rule set summarized in the issue plan; this PR implements it.

Changes

  • gh_post/validators.py:
    split _TERMINATORS into _CLAUSE_TERMINATORS (.!?:;,) and _STRUCTURAL_CLOSERS (`)]}>"'`);
    add _is_normalized_terminal which strips trailing whitespace and structural closers before classifying the line;
    rewrite detect_hardwrap as paragraph-level analysis with three lanes (semantic-run, width-control, remainder-tail).
  • gh_post/__init__.py: drop the retired _TERMINATORS re-export, add _CLAUSE_TERMINATORS, _STRUCTURAL_CLOSERS, _is_normalized_terminal.
  • README.md: rewrite the Validators section to describe paragraph-level detection and the three invariants.
  • test_gh_post.py: 16 new tests — 6 for _is_normalized_terminal, 6 positive hardwrap fixtures (issue detect_hardwrap misses column wraps ending in structural closers #19 reproduction, each structural-closer class, fat-fat-thin, periods-clustered), 4 negative fixtures (two-nonterm-without-tail, short-opener, short-emphasis-final, real-terminator-inside-closer).

Three invariants (priority order, per paragraph)

  1. Semantic-run. Three or more consecutive in-band lines whose normalized endings are non-terminal.
    Multiple runs in one paragraph each emit a flag.
  2. Width-control. A sliding 3-window of in-band widths spanning at most _HARDWRAP_WIDTH_SPAN bytes.
    Suppressed when semantic-run fired in the paragraph.
  3. Remainder-tail. Final line below the band and clause-terminator-ending, preceded by two or more in-band non-terminal-normalized lines.
    Suppressed when semantic-run fired.

Normalized termination

A line is normalized-terminal iff stripping trailing whitespace and any run of structural closers leaves a last character in _CLAUSE_TERMINATORS.
...code` is mid-clause (the backtick is a markup close, not a sentence end); foo.) and foo." remain terminal because the clause terminator survives the strip.

Test plan

  • uv run pytest test_gh_post.py -q — 188 passed (16 new + 172 prior).
  • uv run ruff check / uv run ruff format --check — clean.
  • All 172 prior test_hardwrap_* cases continue to pass against the new detector.
  • Codex review against this branch — clean, no regressions identified.

Out of scope

  • Warn-only tier (one-signal weak detections written to stderr without rejecting).
  • CJK / mixed-script display-width detection (the byte-band skip posture is retained).
  • Lowercase / conjunction continuation-start as a tiebreaker signal.

Each of these is deferred to a separate issue when an author surfaces a concrete need; the issue body lays out the rationale.

Closes #19.

Rewrites `detect_hardwrap` from a streaming counter machine into
paragraph-level analysis. The streaming design reset all per-line
state on any out-of-band line, which left the fat-fat-thin
column-wrap shape (two in-band non-terminal lines followed by a
short terminal final line) undetectable: the short final line was
out of band and reset both lanes before either could fire on the
paragraph.

Split `_TERMINATORS` into two disjoint sets:

- `_CLAUSE_TERMINATORS` = `.!?:;,` (semantic boundaries)
- `_STRUCTURAL_CLOSERS` = `)]}>"'`` (markup closes)

Add `_is_normalized_terminal` which strips trailing whitespace and
structural closers before checking the remaining last character
against `_CLAUSE_TERMINATORS`. A line ending in inline-code
closing backtick (`...comment`) is now correctly classified as
mid-clause; `foo.)` and `foo."` remain terminal because the
clause-terminator survives the strip.

Apply three hard-reject invariants per paragraph:

- semantic-run — 3+ consecutive in-band non-terminal-normalized
  lines. Multiple independent runs in one paragraph each emit a
  flag (preserves the pre-paragraph-level behavior).
- width-control — sliding 3-window of in-band widths spanning at
  most `_HARDWRAP_WIDTH_SPAN` bytes. Suppressed if semantic-run
  fired in the paragraph.
- remainder-tail — final line below the band and clause-terminator-
  ending, preceded by 2+ in-band non-terminal-normalized lines.
  Catches the issue #19 fat-fat-thin shape. Suppressed if
  semantic-run fired.

Add 16 tests: 6 for `_is_normalized_terminal`, 6 positive
hardwrap fixtures (issue #19 reproduction, each structural-closer
class, fat-fat-thin, periods-clustered), 4 negative (two-nonterm-
without-tail, short-opener, short-emphasis-final, real-terminator-
inside-closer). All 172 prior tests continue to pass.

CJK / mixed-script display-width support and a warn-only tier are
explicitly deferred to separate issues.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors detect_hardwrap from a streaming counter machine to paragraph-level analysis, fixing a class of column-wrap misses (issue #19) where structural closers like `, ), ] were treated as clause terminators and prematurely reset the no-terminator counter.

Changes:

  • Split _TERMINATORS into disjoint _CLAUSE_TERMINATORS and _STRUCTURAL_CLOSERS sets and add _is_normalized_terminal (strips trailing whitespace and structural closers before checking).
  • Rewrite detect_hardwrap to buffer prose paragraphs and evaluate three invariants (semantic-run, width-control, remainder-tail) with suppression rules to avoid duplicate flags.
  • Update README Validators section and add 16 unit tests covering _is_normalized_terminal, the issue #19 reproduction, structural-closer variants, fat-fat-thin shape, and negative fixtures.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
gh_post/validators.py Splits terminator sets, adds _is_normalized_terminal, and replaces streaming detector with paragraph-level _evaluate_paragraph + three-lane invariants.
gh_post/init.py Drops _TERMINATORS re-export and adds the new public-by-test symbols.
README.md Documents paragraph-level detection and the three invariants.
test_gh_post.py Adds 16 tests for normalization and new positive/negative hardwrap fixtures.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ultimatile ultimatile merged commit 78d5691 into main May 19, 2026
1 check passed
@ultimatile ultimatile deleted the fix/19-hardwrap-clause-vs-structural branch May 19, 2026 05:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

detect_hardwrap misses column wraps ending in structural closers

2 participants