refactor: paragraph-level hardwrap detection with three invariants#20
Merged
Conversation
Closes #19. Rewrites `detect_hardwrap` from a streaming counter machine into paragraph-level analysis. The streaming design reset all per-line state on any out-of-band line, which left the fat-fat-thin column-wrap shape (two in-band non-terminal lines followed by a short terminal final line) undetectable: the short final line was out of band and reset both lanes before either could fire on the paragraph. Split `_TERMINATORS` into two disjoint sets: - `_CLAUSE_TERMINATORS` = `.!?:;,` (semantic boundaries) - `_STRUCTURAL_CLOSERS` = `)]}>"'`` (markup closes) Add `_is_normalized_terminal` which strips trailing whitespace and structural closers before checking the remaining last character against `_CLAUSE_TERMINATORS`. A line ending in inline-code closing backtick (`...comment`) is now correctly classified as mid-clause; `foo.)` and `foo."` remain terminal because the clause-terminator survives the strip. Apply three hard-reject invariants per paragraph: - semantic-run — 3+ consecutive in-band non-terminal-normalized lines. Multiple independent runs in one paragraph each emit a flag (preserves the pre-paragraph-level behavior). - width-control — sliding 3-window of in-band widths spanning at most `_HARDWRAP_WIDTH_SPAN` bytes. Suppressed if semantic-run fired in the paragraph. - remainder-tail — final line below the band and clause-terminator- ending, preceded by 2+ in-band non-terminal-normalized lines. Catches the issue #19 fat-fat-thin shape. Suppressed if semantic-run fired. Add 16 tests: 6 for `_is_normalized_terminal`, 6 positive hardwrap fixtures (issue #19 reproduction, each structural-closer class, fat-fat-thin, periods-clustered), 4 negative (two-nonterm- without-tail, short-opener, short-emphasis-final, real-terminator- inside-closer). All 172 prior tests continue to pass. CJK / mixed-script display-width support and a warn-only tier are explicitly deferred to separate issues.
There was a problem hiding this comment.
Pull request overview
Refactors detect_hardwrap from a streaming counter machine to paragraph-level analysis, fixing a class of column-wrap misses (issue #19) where structural closers like `, ), ] were treated as clause terminators and prematurely reset the no-terminator counter.
Changes:
- Split
_TERMINATORSinto disjoint_CLAUSE_TERMINATORSand_STRUCTURAL_CLOSERSsets and add_is_normalized_terminal(strips trailing whitespace and structural closers before checking). - Rewrite
detect_hardwrapto buffer prose paragraphs and evaluate three invariants (semantic-run, width-control, remainder-tail) with suppression rules to avoid duplicate flags. - Update README Validators section and add 16 unit tests covering
_is_normalized_terminal, the issue #19 reproduction, structural-closer variants, fat-fat-thin shape, and negative fixtures.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| gh_post/validators.py | Splits terminator sets, adds _is_normalized_terminal, and replaces streaming detector with paragraph-level _evaluate_paragraph + three-lane invariants. |
| gh_post/init.py | Drops _TERMINATORS re-export and adds the new public-by-test symbols. |
| README.md | Documents paragraph-level detection and the three invariants. |
| test_gh_post.py | Adds 16 tests for normalization and new positive/negative hardwrap fixtures. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #19.
Rewrites
detect_hardwrapfrom a streaming counter machine into paragraph-level analysis.The streaming design reset all per-line state on any out-of-band line, which left the fat-fat-thin column-wrap shape (two in-band non-terminal lines followed by a short terminal final line) undetectable: the short final line was out of band and reset both lanes before either could fire on the paragraph.
Reframes detection around three hard-reject invariants applied to each paragraph, with a normalized terminator that distinguishes clause boundaries from structural markup closes.
A pre-hypothesis external review pass framed the detection problem and proposed the three-invariant rule set summarized in the issue plan; this PR implements it.
Changes
gh_post/validators.py:split
_TERMINATORSinto_CLAUSE_TERMINATORS(.!?:;,) and_STRUCTURAL_CLOSERS(`)]}>"'`);add
_is_normalized_terminalwhich strips trailing whitespace and structural closers before classifying the line;rewrite
detect_hardwrapas paragraph-level analysis with three lanes (semantic-run, width-control, remainder-tail).gh_post/__init__.py: drop the retired_TERMINATORSre-export, add_CLAUSE_TERMINATORS,_STRUCTURAL_CLOSERS,_is_normalized_terminal.README.md: rewrite the Validators section to describe paragraph-level detection and the three invariants.test_gh_post.py: 16 new tests — 6 for_is_normalized_terminal, 6 positive hardwrap fixtures (issue detect_hardwrap misses column wraps ending in structural closers #19 reproduction, each structural-closer class, fat-fat-thin, periods-clustered), 4 negative fixtures (two-nonterm-without-tail, short-opener, short-emphasis-final, real-terminator-inside-closer).Three invariants (priority order, per paragraph)
Multiple runs in one paragraph each emit a flag.
_HARDWRAP_WIDTH_SPANbytes.Suppressed when semantic-run fired in the paragraph.
Suppressed when semantic-run fired.
Normalized termination
A line is normalized-terminal iff stripping trailing whitespace and any run of structural closers leaves a last character in
_CLAUSE_TERMINATORS....code`is mid-clause (the backtick is a markup close, not a sentence end);foo.)andfoo."remain terminal because the clause terminator survives the strip.Test plan
uv run pytest test_gh_post.py -q— 188 passed (16 new + 172 prior).uv run ruff check/uv run ruff format --check— clean.test_hardwrap_*cases continue to pass against the new detector.Out of scope
Each of these is deferred to a separate issue when an author surfaces a concrete need; the issue body lays out the rationale.