fix(parser): preserve blank lines inside multi-line YAML block scalars (#387)#388
Conversation
…scalars Multi-line YAML block scalars (`- prompt: |`, `- command: |`, etc.) silently truncated at the first blank line; the post-blank content was absorbed as entity prose. Workflows validated and executed but the LLM/shell never saw most of the value. The official `pflow guide prompt-caching` example hit this. Root cause: the YAML-continuation state machine in `markdown_parser.py` treated any blank line as an item terminator. It never inspected the `|`/`>` sigil — block-scalar handling was an emergent side effect of which lines were collected before yaml.safe_load ran. Fix: collapse the continuation rule from two conditions to one — a line continues the current item if it is blank OR indented past the bullet. Strip trailing blanks in `_flush_yaml_item` so single-line items stay on the `_coerce_yaml_scalar` fast path (preserves intentional PyYAML divergences for octal/hex/dates). This matches YAML's actual block-scalar termination semantics and applies uniformly to every multi-line value in every section. Also updates the `prompt-caching.md` guide examples to use the preferred fenced-block syntax (```prompt ... ```) and adds a note to `core/CLAUDE.md` documenting the continuation rule. 9 new tests cover: `|`/`>`/plain multi-line preservation, non-prompt params (command/body/description), Inputs/Outputs sections, trailing-blank handling, and a fast-path regression guard.
Code Review — PR #388Reviewed as an TL;DRWell-scoped fix. Root cause correctly identified (the continuation rule had two conditions where YAML semantics demand one). The diff replaces the wrong rule rather than papering it over with a Critical — must fix before merge(none) Warnings — should be addressed1. Test comment is wrong about what # `_coerce_yaml_scalar` keeps "010" as the literal string "010"; PyYAML's
# yaml.safe_load would parse it as int 8 (octal). The fast path must run.
assert node[\"params\"][\"max-retries\"] == 10The assertion is correct, but the comment is wrong: # `_coerce_yaml_scalar` coerces \"010\" to decimal int 10 (Python int()); PyYAML's
# yaml.safe_load would parse it as int 8 (YAML 1.1 octal). The fast path must run.Suggestions — optional improvements2. Mention the chomp-indicator caveat in # is lossless for multi-line items (yaml.safe_load chomps trailing
# newlines for `|` block scalars by default).True for the default (CLIP) chomping, but for 3. Pre-existing-issue acknowledgement is good — consider opening the follow-up issue 4. Guide-doc rewrite is helpful but slight scope creep Things I verified and liked
Test quality9 new tests; none look redundant. The shapes are well-chosen:
No tests to remove. The only minor gap is the |
There was a problem hiding this comment.
Code Review
This pull request enhances the Markdown parser to correctly handle and preserve blank lines within multi-line YAML items, such as block scalars. Key changes include updating the continuation logic to include blank lines and modifying _flush_yaml_item to strip trailing whitespace. The documentation and test suite have been updated to reflect these improvements. Review feedback suggests optimizing the continuation check by using the pre-calculated stripped variable for better consistency and performance.
| yaml_current_item_lines.append(line) | ||
| continue | ||
| content_start = len(line) - len(line.lstrip()) | ||
| if line.strip() == "" or content_start >= yaml_indent_level: |
There was a problem hiding this comment.
The variable stripped is already calculated at line 384. Using it here instead of calling line.strip() again is more efficient and consistent with the rest of the loop.
| if line.strip() == "" or content_start >= yaml_indent_level: | |
| if not stripped or content_start >= yaml_indent_level: |
…ip review] Addresses two review comments on #388: - `_flush_yaml_item`: document that the trailing-blank strip is lossless only for default-chomping (`|`/`>`) block scalars; `|+`/`>+` keep-chomping would preserve all trailing blanks and this strip would drop them. Accepted edge case — no shipped `.pflow.md` uses `|+`/`>+`. - `test_single_line_item_between_block_scalars_uses_fast_path`: fix incorrect comment claiming `_coerce_yaml_scalar("010")` returns the string `"010"`. It actually coerces to decimal int 10 (Python `int()`), and the divergence vs PyYAML is 10 vs 8 (YAML 1.1 octal). Assertion was correct, comment was misleading. Follow-up issue for the pre-existing heading/code-fence-detection-inside- block-scalar issue filed as #389.
Summary
Multi-line YAML block scalars in
.pflow.mdfiles (- prompt: |,- command: |, etc.) silently truncated at the first blank line, with post-blank content silently absorbed into the entity'spurpose. Workflows validated and executed but only a fraction of the intended value reached the LLM/shell/HTTP body. The officialpflow guide prompt-cachingexample hit this exact pattern.Fixes #387
Changes
src/pflow/core/markdown_parser.py: collapse the YAML-continuation rule from two conditions to one (blank OR indented ≥ bullet column); strip trailing blanks in_flush_yaml_itemso single-line items stay on the_coerce_yaml_scalarfast path.src/pflow/core/CLAUDE.md: document the continuation rule so future agents don't reintroduce the bug.src/pflow/guide/features/prompt-caching.md: rewrite the two- prompt: |examples (score-each-chorus,process-items) to use the preferred```promptfenced-block syntax.tests/test_core/test_markdown_parser.py: 9 new regression tests.Explanation
The parser's YAML-continuation state machine never inspected the
|/>sigil — block-scalar handling was an emergent side effect of which lines were collected beforeyaml.safe_loadran. A blank line unconditionally flushed the in-progress item, dropping every subsequent line from PyYAML's view. The dropped lines then matched the parser's "prose" path (no leading-, line.strip() != "") and were silently captured as the entity's description.The fix replaces the wrong rule rather than adding a special case: under standard YAML semantics, a
|block ends only when a line is less indented than the block's content — blank lines are preserved as empty literal lines. The new continuation logic implements exactly that within markdown's nested context. PyYAML then handles|,>, plain, octal, hex, dates, and every other block-scalar variant correctly because it now sees the full block.Trailing-blank stripping at flush time keeps the
_coerce_yaml_scalarfast path active for single-line items between blank-padded multi-line ones — important because_coerce_yaml_scalarintentionally diverges from PyYAML for edge cases (e.g.,010stays10, not octal8), as documented incore/CLAUDE.md.Coverage map (parser-fix scope is the first row; rows 2-3 already worked):
- key: |/- key: >/- key:\n ...in any section, any node type```prompt,```yaml batch,```shell command, etc.)- prompt: ./foo.prompt.md)File stats:
Testing
test_yaml_block_scalar_literal_preserves_blank_lines— exactscore-each-chorusguide pattern; asserts prompt body + cleanpurpose.test_yaml_folded_scalar_preserves_blank_lines—>folded scalar with paragraph break.test_plain_multiline_scalar_preserves_blank_lines—- prompt:indented plain scalar.test_block_scalar_with_blank_lines_in_non_prompt_params—command:(shell),body:(http).test_block_scalar_blank_lines_in_inputs_and_outputs—description:in## Inputsand## Outputs.test_block_scalar_followed_by_another_bullet— trailing-blank handling vs next bullet.test_single_line_item_between_block_scalars_uses_fast_path— regression guard; assertsmax-retries: 010parses as decimal10(fast path), not octal8(PyYAML path).make checkclean.Manual verification (verification-specialist pass):
command: |+ internal blanks — resolved command in trace matches source byte-for-byte; stdout shows expected paragraphs.${message}template substitution works.pflow analyze-cacheon a multi-line-prompt workflow — sees the full prompt (17 tokens, not the truncated ~5) and correctly firescache.prompt-body-duplicates-cachefor${rubric}duplication.pflow save→ re-load → re-run: file preserved verbatim, saved workflow re-runs correctly.${check.nonexistent_field}template error in## Outputsafter a multi-line block with 2 internal blank lines correctly reportsfile:21(the- source:line).- item) inside|block scalar content: preserved as content, not stolen as new YAML items.examples/invalid/(10 files): all still fail with their original errors — no false-positive regressions.pflow --dry-runandpflow visualizeon multi-line-prompt workflows: parse and render correctly.Pre-existing issue flagged (out of scope)
Indented markdown-heading-looking content (
# foo,## bar,### baz) inside a|block scalar is incorrectly detected as a heading because the heading check usesstripped.startswith(...)and runs before the YAML-continuation check. Verified same behavior onmain(this PR did not introduce it). Worth a follow-up issue — change heading checks toline.startswith(...)(column-0 only, matching markdown spec).