Skip to content

Refactor HTML parser: lexer-based architecture with improved Twig support#1013

Merged
Soner (shyim) merged 29 commits into
nextfrom
claude/fix-shopware-cli-1002-BAqh1
May 19, 2026
Merged

Refactor HTML parser: lexer-based architecture with improved Twig support#1013
Soner (shyim) merged 29 commits into
nextfrom
claude/fix-shopware-cli-1002-BAqh1

Conversation

@shyim

Copy link
Copy Markdown
Member

Summary

This PR replaces the old recursive-descent HTML parser with a new lexer-based architecture that provides better separation of concerns, improved error handling, and more robust Twig template support. The old internal/twigparser package is removed entirely in favor of the enhanced internal/html package.

Key Changes

Architecture Refactoring

  • New lexer layer (lexer.go, tokens.go, pos.go): Single-pass tokenizer that recognizes HTML elements, attributes, and Twig delimiters ({% %}, {{ }}, {# #}) with precise position tracking
  • Token-driven parser (parser.go): Replaces recursive descent with a cleaner state machine that walks the token stream, making it easier to handle complex nesting and error recovery
  • Separated formatting logic (format.go): All Dump/rendering code moved to a dedicated file, making the parser logic independent of output formatting

Twig Tag System

  • Tag registry (tags.go): Pluggable tag parser system with TagSpec for declarative tag definitions
  • Dedicated tag parsers: Each Twig tag type (block, if, for, set, verbatim, etc.) has its own file (tag_*.go) with focused parsing logic
  • Improved if/elseif/else handling (tag_if.go): Proper branch parsing with correct terminator detection
  • Generic block support (tag_generic.go): Handles unknown/future Twig tags gracefully by preserving them as raw text

AST Improvements

  • Cleaner node types (ast.go): Consolidated node definitions with clear interfaces
  • Better context awareness: nodeContext enum distinguishes top-level parsing from element-children parsing, enabling smarter whitespace handling
  • Inline-mixed content detection: Preserves formatting of embedded expressions (e.g., {% for %}{{ x }}{% endfor %}) by detecting when a block contains only text + expressions

Error Handling

  • Structured errors (errors.go): ParseError with source location, filename, and related locations for better diagnostics
  • Position tracking: Every token carries precise line/column information for accurate error reporting

Testing & Validation

  • Smoke tests (smoke_test.go): Can parse real Shopware storefront templates when HTML_SMOKE_CORPUS is set
  • Fuzz testing (fuzz_test.go): Lexer fuzzed with fixture inputs to catch edge cases
  • Fixture-based tests (lexer_test.go): Validates lexer against testdata files

Notable Implementation Details

  • Whitespace handling: Top-level nodes trim whitespace before deciding whether to emit RawNodes; element children preserve all whitespace for faithful formatting
  • Inline-mixed detection: Blocks containing only text + template expressions flow children verbatim to avoid re-indenting on every format pass
  • Void element recognition: isVoidElement() correctly identifies self-closing HTML tags
  • Backward compatibility: NewParser(), NewAdminParser(), and NewStorefrontParser() entry points maintain the existing API; ConfiguredNodeList wraps nodes with their formatting config

Removed

  • internal/twigparser/ package entirely (twig.go, nodes.go, node_list.go, parser_types.go, twig_test.go)
  • Old parser helpers and attribute encoding logic (now in format.go)

Updated

  • internal/verifier/admin_twig.go and storefront_twig.go: Now use NewAdminParser() and NewStorefrontParser() from the new html package

https://claude.ai/code/session_019e6qYC1a5j1TStxyedAaEy

parseIfBranch did not recursively recognize nested {% if %} blocks. When
an inner if used its own {% else %}/{% endif %}, the outer parser
consumed the inner {% endif %} as its own and then failed to find the
real outer {% endif %}, surfacing as "missing endif at pos N".

Refs: #1002
This package has no Go callers anywhere in the repository. The real
parser used by the verifier and twig-linter is internal/html. Removing
the dead code clears the way for the planned parser rewrite.
Lexer handles HTML elements, attributes, Twig statements/expressions/
comments, and is string-literal + bracket-depth aware on the close
delimiter scan. Supports whitespace-trim modifiers ({%- / -%} etc.).
Word-boundary identifier scan fixes the silent '{% iff %}' bug.

Pos tracks byte offset + 1-based line + 1-based byte column, advanced
once per byte (no more O(n) strings.Count per AST node).

ParseError formats as path:line:col: msg with optional source snippet.

Parser does not yet consume these tokens; this is a foundational
checkpoint before wiring up tag dispatch and AST construction.
tags.go defines TagSpec, TagParseFunc, registerTag, lookupTag — the
dispatch surface that future tag_*.go files plug into.

parser_new.go defines the parser struct (source, filename, tokens, idx)
and helpers (peek, advance, errf). Not yet wired into the public entry
points — NewParser still calls the legacy byte-indexed parser, so all
45 fixtures continue to format byte-identically.

This is checkpoint scaffolding for the planned rewrite. The next steps
are:
1. tag_block.go, tag_if.go, tag_parent.go — handlers using parser+lexer
2. Swap NewParser/NewAdminParser/NewStorefrontParser to drive the new
   parser, verifying byte-identical fixture output as each handler lands
3. Delete the legacy parser code from parser.go
4. Add structured handlers for for/set/include/extends/macro/embed/etc.
5. Extract Dump methods into format.go (visitor pattern)
6. Smoke-test against shopware/storefront *.twig
The legacy parser in internal/html/parser.go did three things in one
1.6k-line file: tokenizing, parsing, and formatting. It advanced
through the source one byte at a time, used HasPrefix() on raw bytes
to recognize tags (breaking on '{% iff %}' and '{%else%}'), and called
strings.Count(input[:pos], "\\n") on every node to compute Line —
making line-number resolution O(n) per node.

This swaps in a clean three-layer pipeline:

  lexer (lexer.go, ~520 lines)
    Walks the source once, emits structured tokens with byte+line+col
    positions tracked incrementally. Recognizes HTML elements/attrs,
    Twig {%...%} / {{...}} / {#...#} delimiters, whitespace-trim
    modifiers ({%- -%} etc.), and is string-literal + bracket-depth
    aware on the close delimiter scan so '{% set x = "a%}b" %}' no
    longer terminates early.

  parser (parser_new.go, ~380 lines)
    Consumes the token stream. parseNodesUntil collects nodes until a
    parent tag's EndTag or Followers fire, or until EOF / element
    close tag. Two chunking contexts (top-level vs. element-children)
    preserve the legacy RawNode boundaries byte-for-byte.

  tag handlers (tags.go + tag_*.go)
    TagSpec registry replaces the if/else chain in the legacy
    parseNodes. block, if, parent are registered in this commit;
    for/set/include/macro/embed/etc. plug in via the same interface.

All 45 testdata fixtures continue to produce byte-identical formatter
output (TestFormatting passes). The new TestNewParserMatchesLegacy was
the parity gate during development; TestLexerNoErrorOnFixtures keeps
the lexer-level guarantee.

Public API unchanged: NewParser, NewAdminParser, NewStorefrontParser
return the same AST types. All 27 caller files in internal/verifier/
continue to compile and pass tests without modification.

Refs: #1002
Smoke-testing against shopware/storefront's 326 .twig files surfaced a
case where the lexer fused tokens like 'A{% else %}B{% endif %}' into a
single tokHTMLAttrName because lexHTMLAttr only broke on whitespace, =,
>, and /. Inside element-attribute context (e.g. <input {% if x %}A{%
else %}B{% endif %}>), this kept the inner {% else %} / {% endif %}
out of reach of the parser and surfaced as 'missing {% endif %}'.

Fix: in lexHTMLAttr, also break on '{%', '{{', '{#' so the Twig
delimiters are tokenized separately.

Adds an opt-in smoke test (HTML_SMOKE_CORPUS=/path/to/storefront) that
walks every .twig file and checks lexer, parser, and formatter
idempotency. Against shopware/storefront@main (326 files):

  before: 24 parse failures, 257 fmt failures
  after:  14 parse failures, 267 fmt failures

Comparing against the pre-rewrite parser on the same corpus:
  parse:  legacy 17, new 14 (3 better, 0 regressions)
  format: legacy 264, new 267 (pre-existing formatter quirk)

The remaining parse failures are pre-existing limitations (blocks that
don't enclose their HTML span, e.g. {% block %}<html><body>{%
endblock %}...</body></html>) — they fail in legacy too.
When a Twig file fails to parse, the verifier now extracts the line
number from the ParseError and attaches it to the CheckResult, instead
of hardcoding Line: 0. Editors and CI summaries that group results by
line will land on the actual problem location.

The error message itself also no longer reads 'missing endif at pos
370' but 'path:12:5: missing endif' (the change is in ParseError.Error,
shipped earlier in the rewrite). Together this addresses the secondary
ask in #1002 to report line/column instead of byte position.
doc.go documents the four-layer architecture (pos / lexer / parser /
ast+formatter) and shows how to add new Twig tags via the registry.

fuzz_test.go adds FuzzLexer and FuzzParser, seeded from every testdata
fixture plus hand-picked edge cases. The lexer fuzzer immediately
surfaced a crash on '<!-->': lexHTMLComment indexed for the closer in
the whole remainder, including the '<!--' prefix, so a degenerate input
like '<!-->' computed end=2 and then sliced rem[4:2]. Fix: search for
the closer starting at offset 4 (past '<!--') and adjust.

After the fix, both fuzzers ran 10s × ~40k execs/sec with zero panics.
parser.go shrinks from 778 to 167 lines by moving all formatter code
(every node's Dump method, NodeList.Dump, ConfiguredNodeList.Dump,
IndentConfig, the indentConfig global, isTemplateElement helper) into
a dedicated format.go.

The AST type definitions stay in parser.go alongside the public entry
points (NewParser, NewParserWithConfig, NewAdminParser, NewStore-
frontParser), isVoidElement, and TraverseNode. Methods remain attached
to their types — Go lets us split type and method across files in the
same package.

No behavior change: all 45 testdata fixtures continue to format byte-
identically and the shopware/storefront smoke test shows the same
0-regression result as before the move.

Dead-code cleanup: fromEntitiesToText (used only by the deleted legacy
parseAttrValue) is removed. htmlCommentStart was already gone.

SetIndentConfig is marked Deprecated in its doc comment, pointing
callers toward future explicit-Formatter use.
Twig tags beyond block/if/parent now parse into dedicated AST nodes
instead of being folded into RawNode. Each handler plugs into the
tag registry from a single ~5 line init() block.

New AST node types in tag_generic.go:
  TwigGenericBlockNode    {% name args %}body{% endname %}
  TwigStandaloneTagNode   {% name args %} (no body)

Plus a dedicated TwigVerbatimNode for {% verbatim %}...{% endverbatim %}
whose body must NOT be recursively parsed — the parser sweeps tokens
verbatim until {% endverbatim %} and stores the body as a string.

Registered tags:
  Block form (body + EndTag):
    for / endfor (with else follower), embed, macro, apply, with, set
    (block form, when args don't contain '='), verbatim

  Standalone form (no body):
    include, sw_include, extends, sw_extends, import, from, use, do,
    set (inline form, when args contain '='), deprecated, flush

Set has dual form: tag_set.go branches on whether args contain '=' so
both '{% set x = 1 %}' and '{% set x %}body{% endset %}' work.

All 45 fixtures stay byte-identical (none use these tags). The
shopware/storefront smoke test idempotency failures drop from 267 to
105 — 162 real-world files now round-trip cleanly through parse →
format → parse → format. Parse failures stay at 14 (pre-existing
block-doesn't-enclose-html-span pattern that also fails in the legacy
parser).
Previously {# ... #} comments were folded into RawNode via the
unrecognized-tag fallback. Because raw bytes preserved a trailing
newline, the formatter's between-nodes "\\n" insertion would
double up on the second format pass and shift everything down a line.

Now {# ... #} is a structured TwigCommentNode that round-trips
through Dump as '{#body#}' without trailing whitespace, so
parse → format → parse → format is stable.

Net smoke-test improvement on shopware/storefront:
  before this commit: 105 format-idempotency failures
  after:              33 format-idempotency failures

Together with the structured tag handlers in the previous commit, the
parser+formatter now round-trips 293/326 real-world templates cleanly
(up from 62/326 before this PR's work began).

The remaining 33 idempotency failures are a pre-existing formatter
limitation: when a RawNode containing leading whitespace precedes a
structured node like TwigBlockNode inside an HTML element's children
(notably <p>), the RawNode's whitespace and the TwigBlockNode's own
indent compound on each format pass. Fixing that requires coordinating
adjacent-node whitespace in the formatter, which is out of scope for
this rewrite.
…n inside <p>

When <p>'s direct children include a Twig structural tag like
{% block %} or {% if %} preceded by a RawNode whose text ends with
line-leading whitespace, the format pass would emit
"<rawnode whitespace> + <child's own indent>", compounding the indent.
A round-trip through parse → format → parse → format then added one
indent level per pass to the structured child.

The fix trims trailing whitespace from the already-emitted buffer
just before rendering a Twig structural child. Inline children
(<span>, {{ x }}, raw text) are unaffected because their Dump output
doesn't start with its own indent at <p>'s call-indent (fixture 41
still passes byte-identical).

Smoke test on shopware/storefront:
  before: 33 idempotency failures
  after:  29 idempotency failures
When a {% block %} body wraps inline content like JS or CSS with
embedded {{ x }} interpolations, the default 'block content' formatting
that inserts \"\\n\\n\" between every child produced blank lines
between adjacent code segments. Each parse → format round absorbed those
blank lines into the surrounding RawNode and the gap grew on the next
pass.

The fix detects 'inline-mixed' bodies — every child is RawNode /
TemplateExpressionNode / CommentNode AND at least one RawNode carries
non-whitespace text — and concatenates children as-is so the embedded
whitespace from the RawNodes drives layout.

Fixtures with single-TemplateExpressionNode block bodies (43-template-
expression-in-nested-block) still flow through the existing branch
because they don't carry a meaningful-text RawNode.

Smoke test on shopware/storefront:
  before: 29 idempotency failures
  after:   9 idempotency failures (323/326 files round-trip cleanly)
Apply the same blockHasInlineMixedContent path to TwigGenericBlockNode
that was added to TwigBlockNode in the previous commit. {% for %},
{% embed %}, {% apply %}, etc. with bodies that mix RawNode text and
{{ x }} interpolations now flow as-is rather than per-child re-indent
+ TrimSpace, which was stripping trailing spaces from RawNodes like
'User-agent: ' before an embedded expression.

No fixture change (existing 45 don't use for-with-inline-body), no
smoke-test regression. Improvement is for templates that mix structured
tag bodies with inline content not covered by TwigBlockNode.
Two formatter fixes that together drive smoke-test idempotency failures
to zero on shopware/storefront's 326 templates.

1. TwigBlockNode non-inline body now trims RawNode children and adds
   its own indent prefix, rather than echoing whatever incidental
   whitespace the source had. Files like sitemap.xml.twig (which mix a
   raw '<?xml ?>' header with nested blocks at the same level) no
   longer compound newlines on each format pass.

2. The inline-mixed body path (both TwigBlockNode and TwigGenericBlock-
   Node) now calls Dump(0) for TwigCommentNode children. The preceding
   RawNode supplies the visible indent; without this the TwigCommentNode
   would add its own indent on top, growing the comment line by one
   indent level per format pass. Affected files: analytics.html.twig
   where {# @deprecated #} sits between JS statements.

Smoke test on shopware/storefront:
  before: 4 idempotency failures
  after:  0 idempotency failures

Final smoke totals: 326 files total, 0 lex failures, 14 parse failures
(pre-existing block-doesn't-enclose-html-span pattern shared with the
legacy parser), 0 format-idempotency failures.

Net round-trip-clean count: 62 → 312 (out of 312 parseable).
…failures)

Real-world Twig templates routinely wrap just the opening (or just the
closing) HTML tag in {% if %} or {% block %}:

    {% block base_html %}
    <html lang="...">
    {% endblock %}
    ...
    </html>

or

    {% if config.verticalAlign.value %}
        <div class="...">
    {% endif %}
        content
    {% if config.verticalAlign.value %}
        </div>
    {% endif %}

Previously the parser swallowed {% endif %} / {% endblock %} as raw
text while searching for </tag>, and the outer Twig tag then failed to
find its terminator. All 14 parse failures on shopware/storefront were
this pattern (the same pattern also fails in the pre-rewrite legacy
parser).

Fix has three parts:

1. parseElement now receives the enclosing TagSpec from its caller and
   passes it through to parseNodesUntil. The element-children parser
   yields on the outer Twig terminator without consuming it.

2. When children parsing yields on something other than </tag>, the
   element is marked Unclosed; the formatter skips emitting </tag>
   because the real </tag> appears later as raw text. Trailing
   whitespace-only RawNodes are dropped from the element's children so
   incidental indentation doesn't round-trip as a stray indented blank
   line.

3. NodeList.Dump no longer inserts \n between siblings when the next
   node's output already starts with \n. This avoids compounding
   newlines on each format pass at top-level RawNode boundaries that
   carry their own leading whitespace.

Public surface: new ElementNode.Unclosed field (defaults to false;
existing callers and fixtures unaffected).

Smoke test on shopware/storefront's 326 templates:
  before: 14 parse failures, 0 idempotency failures
  after:   0 parse failures, 0 idempotency failures

All 326 real-world templates now lex, parse, and round-trip cleanly.
… parser

Several leftovers from the parser-swap transition were still in the
tree:

- parser_new_test.go held a 'new vs legacy' parity test. Both code
  paths run through the same parser now, so the test was a tautology.
  Removed.

- parser_new.go was a misleading filename — it's the *only* parser.
  Merged its contents into parser.go.

- parser.go was a mix of AST type definitions and parser entry points.
  Split: AST types move to ast.go (84 lines), parser orchestrator stays
  in parser.go (488 lines).

- Comments throughout the package referenced 'the legacy parser' or
  'legacy parity'. Updated to describe current behavior on its own
  terms.

- A redundant parseNodes wrapper around parseNodesUntil was dropped
  (only one call site, inlined).

- Restored missing doc comment on IndentConfig, updated doc.go's
  architecture description (no more 'parser_new.go', no more 'legacy
  combined file').

No behavior change: all 45 testdata fixtures still byte-identical, the
shopware/storefront smoke test still 0/0/0 on 326 files, all 27 caller
files in internal/verifier/ unchanged.
… slices

Replace TwigIfNode's split between top-level Condition/Children and
parallel ElseIfConditions/ElseIfChildren with a single Branches
slice. Branches[0] is always the "if"; Branches[1..] are the
"elseif"s. The else clause (no condition) stays as ElseChildren.

  type TwigIfBranch struct {
      Condition string
      Body      NodeList
  }

  type TwigIfNode struct {
      Branches     []TwigIfBranch
      ElseChildren NodeList
      Line         int
  }

The old shape was easy to desync — appending to ElseIfConditions
without also appending to ElseIfChildren would silently wedge the
formatter. A single slice of {condition, body} pairs makes the
invariant local and obvious.

The formatter consolidates three nearly-identical branches into one
loop over Branches plus a shared writeIfBranchBody helper, dropping
~80 lines of duplication.

TwigIfNode is not part of the public surface used by callers in
internal/verifier/ (they only read ElementNode.Line / Tag / Attributes
/ Children), so the field rename is safe. All 45 testdata fixtures
remain byte-identical, the shopware/storefront smoke test stays at
0 lex / 0 parse / 0 format failures, and the lexer + parser fuzzers
run clean.
Until now the registerTag function was unexported, and adding a custom
Twig tag required code inside the internal/html package. Plugins and
other callers wanting to teach the parser about project-specific tags
had to either patch the package or fall back to raw-text round-trip
(which works for standalone tags but breaks structurally for block
tags whose body is supposed to be opaque).

Expose two friendly wrappers over registerTag:

  func RegisterStandaloneTag(name string)
  func RegisterBlockTag(name, endTag string, followers ...string)

Both delegate to the existing makeStandaloneTagParser /
makeBlockTagParser helpers, so callers don't need access to the
unexported parser type. Call from your own init() — duplicate
registration panics so name collisions surface at startup.

Also register four tags that show up in real Shopware storefront
templates but had been falling through to the raw-text fallback:

  break, return         — Twig core (flow control inside for/macro)
  sw_icon, sw_thumbnails — Shopware Storefront helpers

They're standalone tags so the parser already round-tripped them
correctly; registration just gives downstream AST consumers a
TwigStandaloneTagNode to reason about.

tags_custom_test.go adds three regression tests:
- TestCustomTagRegistration covers a registered block tag
- TestCustomStandaloneTagRegistration covers a standalone tag
- TestUnregisteredTagFallback documents that unknown standalone tags
  still round-trip via raw text

doc.go now shows the RegisterStandaloneTag / RegisterBlockTag pattern
as the recommended way to teach the parser about new tags; the
internal registerTag stays for tags with bespoke Parse functions.

All 45 testdata fixtures byte-identical, 326/326 storefront smoke
files pass, FuzzParser clean.
Opt-in (HTML_SMOKE_CORPUS=/path/to/storefront) benchmark that parses
every .twig file in a checked-out shopware/storefront tree and reports
bytes/sec throughput. Useful as a quick regression gate during parser
changes.

Current numbers on a single Xeon @ 2.8GHz core:
  326 files, 1.17 MB total
  23.8 ms per pass, ~47 MB/s
PR #1013 surfaced 17 lint issues across the new files. All fixed:

  exhaustive (3): tokenType switches with default arms get
    //exhaustive:ignore comments since the default branch intentionally
    handles uncovered cases by folding into RawNode / continuing.

  gocyclo (3): ElementNode.Dump (77), parseNodesUntil (37), lexContent
    (31) are all natural top-level dispatchers — one arm per token kind
    or one special case per HTML quirk. Tagged //nolint:gocyclo with a
    one-line rationale.

  gocritic ifElseChain (1): the if/else-if cascade matching '<!--',
    '<!DOCTYPE', '</name', '<name' inside lexContent is now a switch{}.

  nilerr (6): smoke_test.go and bench_test.go intentionally swallow
    per-file errors after appending them to the failures slice — that's
    the contract of a smoke test. Tagged //nolint:nilerr on each branch
    plus a header comment explaining the policy.

  predeclared (2): renamed 'close' → 'closer' in scanToTwigClose,
    'max' → 'limit' in smoke_test.go.

  staticcheck SA9003 (1): empty 'if trimRight {}' branch in
    lexTwigComment was leftover scaffolding from a refactor; the close
    emission below already handles the case correctly. Comment kept,
    branch removed.

  unparam (1): scanToTwigClose's error return was always nil — dropped
    the error return type and updated both callers.

  unused (1): removed dead peekStr helper.

No behavior change. All 45 testdata fixtures byte-identical, 326/326
storefront smoke files still 0/0/0, go test ./... green,
golangci-lint run ./... reports 0 issues.
CI's golangci-lint (v2.12.x) runs a newer staticcheck than my local
2.5.0 and flagged PrettyError's loop:

    internal/html/errors.go:39: QF1012
    Use fmt.Fprintf(...) instead of WriteString(fmt.Sprintf(...))

Replace the offending b.WriteString(fmt.Sprintf(...)) with
fmt.Fprintf(&b, ...). No behavior change.

Verified locally with golangci-lint v2.12.2 (latest, matches CI):
  golangci-lint run ./...  →  0 issues

@shyim Soner (shyim) left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline comments from local branch review.

Comment thread internal/html/parser.go
Comment thread internal/html/tag_more.go
Comment thread internal/html/parser.go
Comment thread internal/html/format.go
1. TraverseNode now recurses into TwigIfNode, TwigGenericBlockNode, and
   ElementNode.Attributes. Without this, downstream linters/fixers
   couldn't see HTML elements nested inside the new structural nodes.

2. for/else is now supported. makeBlockTagParser previously only
   accepted stopGenericEndTag so for/else reported missing endfor.
   Generic block tags can now consume an "else" follower; for is the
   built-in user. TwigGenericBlockNode gains an Else NodeList field.

3. Twig expressions and comments embedded in an HTML opening tag are no
   longer dropped. Patterns like <div {{ attributes }}> and
   <div {# note #}> round-trip through Dump. The attribute loop has
   explicit cases for tokTwigExprOpen and tokTwigCommentOpen, plus a
   raw fallback for unknown statements. The lexer also gained a "{#"
   recognizer inside HTML opening tags.

4. Whitespace-control delimiters now round-trip:
   - New TwigTrim{Left, Right bool} struct.
   - Open/Close trim fields on every Twig structural node.
   - consumeStmtHeader / consumeEndTag return TwigTrim alongside body.
   - delimiters.go holds openStmt/closeStmt/openExpr/closeExpr/
     openComment/closeComment helpers; all formatters use these.

   Without this fix, "{%- if x -%}" formatted back as "{% if x %}",
   silently changing Twig output (Twig strips surrounding whitespace
   at marked sides).

New regression tests:
  TestForLoopWithElse
  TestTwigExprInAttributePreserved
  TestTrimModifiersRoundTrip

Validation:
  go test ./...                          all green
  golangci-lint run ./...                0 issues
  shopware/storefront smoke (326 files)  0 lex / 0 parse / 0 format
  all 45 testdata fixtures               byte-identical
@shyim Soner (shyim) marked this pull request as ready for review May 18, 2026 04:36

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0aea6fbc72

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

Comment thread internal/html/tag_parent.go
PR #1013 review (chatgpt-codex-connector):

> The parent-tag parser currently accepts any raw body after `parent`
> and silently discards it, so malformed tags like `{% parent foo %}`
> parse successfully and are later rewritten as `parent()` instead of
> surfacing a parse error.

parseParentTag now allows only an empty body or exactly "()" between
the parent identifier and the closing delimiter. Anything else returns
a parse error with the offending text, so authoring mistakes surface
through the verifier instead of being silently rewritten.

New regression test TestParentRejectsArguments covers:
  rejected: {% parent foo %}, {% parent 1 %}, {% parent(x) %}
  accepted: {% parent %}, {% parent() %}, {%- parent -%}

go test ./...                          all green
golangci-lint run ./...                0 issues
shopware/storefront smoke (326 files)  0 lex / 0 parse / 0 format
User reported that the formatter "kept adding new lines" on every
format pass. Confirmed against the legacy byte-indexed parser:

  legacy: pass1=381, pass2=395, pass3=409 bytes

(Each pass added 14 bytes — the inner {# Twig comment #}'s leading
whitespace was being absorbed into the next RawNode chunk and then
re-indented on the next pass.)

The rewrite produces byte-identical output across passes — the smoke
test on 326 storefront templates already shows fmt_fail=0. This commit
just adds an explicit regression test using the exact admin component
the user posted (nested {% block %}s with {# #} placeholders for child
themes to override).

TestNestedBlocksWithCommentsIdempotent verifies:
  - parse(src).Dump() ==  parse(pass1).Dump()
  - parse(pass1).Dump() == parse(pass2).Dump()
  - output stays ~38 lines (hard upper bound 40)

The bug is fixed by the rewrite; this test pins it so a future format
change can't silently reintroduce blank-line growth on this shape.
Replace the inline Go test TestNestedBlocksWithCommentsIdempotent with
a proper testdata fixture (46-nested-blocks-with-twig-comments.txt),
which is how the rest of the test corpus expresses formatting
expectations.

The fixture is exercised by the existing TestFormatting harness, which:
  - parses the input
  - asserts admin output matches expected admin section
  - re-parses the admin output and asserts idempotency
  - asserts storefront output matches expected storefront section
  - re-parses storefront output and asserts idempotency

So one fixture covers both the format-output and the
parse-format-parse-format byte-identity pinning that the inline Go
test was checking by hand.

The fixture uses the verbatim template from the user report (a Shopware
admin component with nested {% block %}s and {# #} placeholders for
child themes). The legacy parser grew the output by 14 bytes per
format pass on this shape; the rewrite is byte-identical.
The testdata harness in parser_test.go covers exactly what these tests
were checking by hand: parses input, asserts admin output matches, re-
parses the output and asserts idempotency, then the same for storefront.
Converting them removes duplication and matches how every other format
regression in the project is expressed.

New fixtures:
  47-trim-expression.txt        {{- x -}}
  48-trim-if-else.txt           {%- if -%}...{%- else -%}...{%- endif -%}
  49-trim-block.txt             {%- block -%}...{%- endblock -%}
  50-trim-for-loop.txt          {%- for -%}...{%- endfor -%}
  51-trim-set.txt               {%- set x = 1 -%}
  52-trim-comment.txt           {#- comment -#}
  53-trim-parent.txt            {%- parent -%}
  54-twig-expr-in-attribute.txt   <div {{ attributes }}></div>
  55-twig-expr-mixed-with-attrs.txt <div data-x {{ attributes }} class></div>
  56-twig-comment-in-attribute.txt <div {# note #}></div>
  57-for-else-loop.txt          {% for %}...{% else %}...{% endfor %}
  58-unknown-twig-tag-fallback.txt unknown {% %} round-trips as raw

Trimmed from tags_custom_test.go:
  - TestTrimModifiersRoundTrip   -> 47..53
  - TestTwigExprInAttributePreserved -> 54..56
  - TestForLoopWithElse          -> 57 (round-trip) + a slim
                                   TestForLoopAstShape kept for the AST-
                                   inspection bits
  - TestUnregisteredTagFallback  -> 58

Kept as Go tests (testdata format can't express these):
  - TestCustomTagRegistration / TestCustomStandaloneTagRegistration
    (require runtime RegisterBlockTag / RegisterStandaloneTag calls plus
    AST type assertions)
  - TestParentRejectsArguments (testdata can only assert successful
    formatting; error cases stay in Go)
  - TestForLoopAstShape (AST shape inspection, complements fixture 57)

Net change: -127 lines of Go, +101 lines of fixtures. Every fixture goes
through the same parse-format-parse-format idempotency loop as the
existing 46.
Running the parser over shopware/administration (993 .twig files)
surfaced 7 idempotency failures across two patterns. Both fixed:

1. CSV/XML export templates use {#- -#} between text segments to strip
   the surrounding newline at render time:

       aid,{#- -#}
       brand,{#- -#}

   Each parse → format pass added a blank line before each {#- -#}
   because NodeList.Dump inserted a "\n" separator between a RawNode
   ending in "\n" and the following structured TwigCommentNode. The
   earlier compound-newline fix only checked the NEXT node's leading
   "\n"; now it also skips when the previous output already ended in
   "\n".

   Affected admin templates:
     - billiger-de/{header,body}.csv.twig
     - google-product-search-de/header.xml.twig
     - idealo/{header,body}.csv.twig

2. HTML <!-- comment --> inside <p> grew one indent level per format
   pass. The <p>-children formatter already strips trailing whitespace
   from the buffer before Twig structural children (so they don't
   compound with the preceding RawNode's whitespace). HTML CommentNode
   has the same shape — its Dump emits its own indent prefix — so it
   joins the list. Renamed isTwigStructured → isStructuredChild and
   added *CommentNode to the type switch.

   Affected admin template:
     sw-sales-channel-detail-product-comparison-preview.html.twig
     (and the sw-user-sso-status-label template)

Smoke totals:
                       before this commit   after
   administration       0 lex, 0 parse,     0 lex, 0 parse,
                        7 idempotency       0 idempotency
   storefront           0 / 0 / 0           0 / 0 / 0

Two new regression fixtures:
   59-csv-template-with-trim-comments.txt
   60-html-comment-inside-p.txt

1,319 real-world Twig files across both corpora now round-trip cleanly
through parse → format → parse → format.
@shyim Soner (shyim) changed the base branch from main to next May 19, 2026 11:53
@shyim Soner (shyim) merged commit 6d35c9b into next May 19, 2026
3 checks passed
@shyim Soner (shyim) deleted the claude/fix-shopware-cli-1002-BAqh1 branch May 19, 2026 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants