Skip to content

feat: opt-out paragraph-reflow auto-format via markdown-it-py#21

Merged
ultimatile merged 2 commits into
mainfrom
feat/12-markdown-it-py-autoformat
May 20, 2026
Merged

feat: opt-out paragraph-reflow auto-format via markdown-it-py#21
ultimatile merged 2 commits into
mainfrom
feat/12-markdown-it-py-autoformat

Conversation

@ultimatile
Copy link
Copy Markdown
Owner

Summary

Adopt markdown-it-py as the wrapper's first runtime dependency and add a default-on paragraph-reflow auto-format step before every body is forwarded to gh. The reflow collapses soft-break newlines inside top-level CommonMark paragraphs to single spaces and preserves every other region of the body byte-identical to the input, so column-wrapped and over-fragmented bodies render cleanly on GitHub without the author having to hand-tune line breaks per post. Opt-out is supported per-invocation via --no-format on every body-bearing subcommand and per-process via GH_POST_NO_FORMAT.

Closes #12

Changes

  • pyproject.toml: add markdown-it-py>=3.0,<4 to [project].dependencies.
  • gh_post/markdown.py (new): adapter over markdown-it-py exposing iter_prose_paragraphs, iter_non_prose_spans, and reflow_paragraphs. iter_non_prose_spans is implemented as the complement of yielded top-level paragraph spans, so link reference definitions, duplicate references, and any future block type are handled without enumerating the parser's token vocabulary.
  • gh_post/_format.py (new): opt-out resolution helper (resolve_no_format, apply_format, emit_env_error). The --no-format flag wins over GH_POST_NO_FORMAT; invalid env values raise before any body read.
  • gh_post/validators.py: drop the bespoke _FENCE_OPEN_RE / _FENCE_CLOSE_RE / _STRUCTURAL_RE regexes and refactor detect_hardwrap to consume the adapter. validate_body gains a keyword-only format_mode parameter that defaults to False (rejector mode) so existing direct callers and tests are unaffected; format-on entry paths pass format_mode=True to switch the hardwrap detector to a silent tripwire.
  • gh_post/__init__.py: drop the regex re-exports, add the new adapter symbols to the flat-package surface.
  • gh_post/cli.py: register --no-format on the shared issue|pr argparse parser before parse_known_args, so the flag is consumed by the wrapper and never leaks to gh.
  • gh_post/subcommands/post.py, comment_edit.py, reply_inline.py: register --no-format on each local parser, resolve the opt-out before any body read, and apply reflow_paragraphs between body read and validate_body. reply-inline applies reflow per JSONL line inside _parse_reply_entries so the body that gets POSTed matches the body that the validator approved.
  • test_gh_post.py: 50 new tests; three existing rejection tests acquired --no-format so their assertion semantics survive the default change, with new sibling tests covering the default-on path.
  • README.md: new "Auto-format" section documenting scope, preservation guarantees, opt-out flag, and env precedence.

Impact

  • Every body-bearing subcommand (issue create|edit|comment, pr create|edit|comment, comment-edit, reply-inline) reflows paragraphs by default. The body that GitHub stores is therefore the reflowed body, not the raw input, unless --no-format / env opt-out is set.
  • detect_hardwrap under --no-format is now CommonMark-conformant in three classes where the pre-feature regex detector diverged: lazy continuations after lists/blockquotes are now part of the nested paragraph (no longer over-eagerly flagged as top-level prose); link reference definitions are non-prose (no longer treated as paragraph content); over-broad regex matches like #not heading, 1234567890. item, and | not a table are correctly treated as prose (no longer wrongly exempted from the hardwrap lanes). All three classes are covered by dedicated regression tests.
  • First declared runtime dependency for the project. markdown-it-py 3.x pulls in mdurl only at the Python 3.11+ floor (no typing_extensions, no optional linkify-it-py or plugins); both are pure-Python wheels, so the uv tool install . posture is unchanged in shape.

Test plan

uv run pytest: 238 passed (188 original + 50 new). uv run ruff check: clean. uv run ruff format --check: clean. uv run pyright: 0 errors.

Block-preservation tests assert reflow_paragraphs(input) == input byte-identical for each preserved shape: backtick fence, tilde fence containing literal backticks (the mandatory fence-char collision regression), bullet list both tight and loose, ordered list, blockquote with nested multi-line paragraph (Strategy A: nested paragraphs are not reflowed), ATX and setext headings, HTML comment block, HTML <div> block with nested markdown, GFM table, horizontal rule, indented code block, paragraph ending with trailing-two-space hard break, paragraph ending with unescaped backslash hard break, link reference definitions including duplicates.

Positive reflow tests cover: multi-line paragraph collapse, two consecutive paragraphs reflowed independently with the blank line preserved between them, paragraph immediately followed by a fenced code block with no blank line, paragraph immediately preceded by a list, and the over-fragmentation shape that motivated the feature (a paragraph with line breaks on prepositions and intra-clause commas).

Opt-out wiring tests cover: --no-format on every subcommand sends the raw body byte-for-byte; GH_POST_NO_FORMAT=true without the flag has the same effect; GH_POST_NO_FORMAT=false keeps format-on; a garbage value exits non-zero before any body read (including against --body-stdin, which would otherwise block); the flag wins over a conflicting env value; --no-format is consumed by the wrapper and does not appear in the argv forwarded to gh.

Tripwire tests cover: validate_body(clean_body, format_mode=True) is silent; validate_body(hardwrap_body, format_mode=True) emits a stderr diagnostic but returns [] (no rejection).

The CommonMark-conformance regression tests for the three behavior-change classes named in Impact are individually pinned.

Post-implement codex review surfaced two P2 correctness findings, both fixed in the second commit: reflow_paragraphs("para\n\n") was dropping the trailing blank line (algorithm rewritten to walk source lines uniformly), and GH_POST_NO_FORMAT validation ran after the body read so an invalid env value with --body-stdin blocked on stdin (validation moved to the start of each entry point). Both fixes have dedicated regression tests.

Discovery contract status

The plan's Inconclusive / Deferred items section lists two deferrals, both honored verbatim in this PR:

  • Strategy B (reflow nested paragraphs inside list items and blockquotes) — deferred. The current Strategy A skips nested paragraphs because they sit under structural prefixes that paragraph-level reflow would have to recompute to preserve. A follow-up issue can lift this restriction if a real nested-paragraph case appears in the wild.
  • mdit-py-plugins-based extras (footnote, deflist, anchors, frontmatter) — deferred. GitHub bodies do not use these extensions in the wrapper's validation path; adding them would grow the configuration surface without current benefit.

No new Inconclusive items surfaced during implementation. The two pre-PR review findings were implementation-level bugs against the documented contract, not premise gaps.

Notes

The plan that produced this PR (with the pre-implementation review trail and derivations for the three core invariants — format idempotency, GFM HTML rendering equivalence, tripwire correctness) is preserved locally; the public contract surface is the README's Auto-format section plus the regression tests committed here.

Adopts markdown-it-py as the wrapper's first runtime dependency and
enables default-on paragraph reflow before every gh-post body forward.
The reflow collapses soft-break newlines inside top-level CommonMark
paragraphs to single spaces and preserves every other region
byte-identical to input: code fences (both backtick and tilde,
including fence-char content-collision cases), lists, blockquotes,
tables, headings, HTML blocks (including HTML comments), link
reference definitions, and hard breaks (trailing two-space or
unescaped backslash).

Opt-out per invocation via --no-format on every body-bearing
subcommand (issue/pr create/edit/comment, comment-edit, reply-inline)
and per process via GH_POST_NO_FORMAT env. The flag wins over env when
both are set; invalid env values exit non-zero before any body is read.

New gh_post.markdown adapter exposes iter_prose_paragraphs,
iter_non_prose_spans (complement-of-paragraphs construction so link
refdefs and any future block type are handled automatically), and
reflow_paragraphs. detect_hardwrap is refactored to consume the
adapter, removing the in-file fence and structural regexes.
validate_body gains a keyword-only format_mode kwarg (default False)
so existing direct callers and tests keep their current behavior.

Test suite: 232 passing (188 original + 44 new covering adapter,
reflow byte-identity for every preserved block type, opt-out flag
and env wiring, env value validation, and CommonMark behavior-change
regression cases for lazy continuation, overlong ordered-list marker,
hash-without-space, pipe-without-delimiter).

Closes #12
Two corrections to the auto-format path surfaced by post-implement
review:

- reflow_paragraphs("para\n\n") returned "para\n", silently dropping
  trailing blank lines. The previous algorithm used "\n".join on the
  tail non-prose span, which collapsed an empty trailing line into
  the body's existing terminator. Rewrites the function to walk
  source lines one at a time, emitting each line plus its terminator
  uniformly, then strips the final newline only when the input
  itself had none.

- GH_POST_NO_FORMAT validation in post.py and comment-edit.py ran
  after read_body, so an invalid env value with --body-stdin blocked
  on stdin before the error could surface, and with --body-file
  read the file before reporting the env error. Moves the env
  resolution to the start of cmd_post / cmd_comment_edit so the
  invalid-env contract ("exit non-zero before any body is read")
  actually holds. reply-inline already resolved env before reading
  stdin and is unchanged.

Adds six regression tests: four for trailing-blank-line byte-
identity (single trailing blank, double trailing blank, no trailing
newline, blank-only body), two for env-validation-before-stdin-block
across post and comment-edit.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a Markdown-aware auto-formatting step (default-on) that reflows soft line breaks inside top-level CommonMark paragraphs before bodies are sent to gh, with opt-out via --no-format and GH_POST_NO_FORMAT. It also refactors the existing hardwrap detector to use markdown-it-py tokenization rather than bespoke regex heuristics, improving CommonMark conformance while preserving byte-identical behavior outside reflowed paragraphs.

Changes:

  • Add markdown-it-py as the first runtime dependency and introduce a Markdown adapter (iter_prose_paragraphs, iter_non_prose_spans, reflow_paragraphs) to drive both validation and formatting.
  • Implement shared opt-out resolution + formatting application (--no-format flag precedence over GH_POST_NO_FORMAT, env validation before any body read).
  • Update all body-bearing entry points to apply reflow by default and adjust validator behavior via validate_body(..., format_mode=...) tripwire semantics; expand tests and document behavior in README.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pyproject.toml Adds markdown-it-py>=3.0,<4 runtime dependency.
gh_post/markdown.py New adapter over markdown-it-py providing top-level paragraph spans and paragraph reflow with byte-preservation outside paragraphs.
gh_post/_format.py New shared opt-out/env parsing and formatting application helpers.
gh_post/validators.py Refactors detect_hardwrap to consume adapter paragraphs; adds format_mode tripwire behavior to validate_body.
gh_post/cli.py Registers --no-format for issue/pr subcommands so it is consumed by the wrapper parser.
gh_post/subcommands/post.py Resolves opt-out before reading bodies, applies formatting, and validates in rejector vs tripwire mode depending on opt-out.
gh_post/subcommands/comment_edit.py Adds --no-format, resolves opt-out before body read, applies formatting before validation.
gh_post/subcommands/reply_inline.py Adds --no-format; applies formatting per JSONL entry prior to validation to keep “validated == sent”.
gh_post/__init__.py Updates flat re-exports: removes regex exports and exposes new markdown adapter API.
README.md Documents auto-format scope, preservation guarantees, and opt-out precedence/behavior.
test_gh_post.py Adds extensive coverage for adapter behavior, preservation invariants, opt-out wiring, and tripwire semantics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ultimatile ultimatile merged commit c530f3b into main May 20, 2026
1 check passed
@ultimatile ultimatile deleted the feat/12-markdown-it-py-autoformat branch May 20, 2026 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

research: evaluate markdown-it-py for Markdown-aware validators

2 participants