Refactor HTML parser: lexer-based architecture with improved Twig support#1013
Merged
Conversation
parseIfBranch did not recursively recognize nested {% if %} blocks. When
an inner if used its own {% else %}/{% endif %}, the outer parser
consumed the inner {% endif %} as its own and then failed to find the
real outer {% endif %}, surfacing as "missing endif at pos N".
Refs: #1002
This package has no Go callers anywhere in the repository. The real parser used by the verifier and twig-linter is internal/html. Removing the dead code clears the way for the planned parser rewrite.
Lexer handles HTML elements, attributes, Twig statements/expressions/
comments, and is string-literal + bracket-depth aware on the close
delimiter scan. Supports whitespace-trim modifiers ({%- / -%} etc.).
Word-boundary identifier scan fixes the silent '{% iff %}' bug.
Pos tracks byte offset + 1-based line + 1-based byte column, advanced
once per byte (no more O(n) strings.Count per AST node).
ParseError formats as path:line:col: msg with optional source snippet.
Parser does not yet consume these tokens; this is a foundational
checkpoint before wiring up tag dispatch and AST construction.
tags.go defines TagSpec, TagParseFunc, registerTag, lookupTag — the dispatch surface that future tag_*.go files plug into. parser_new.go defines the parser struct (source, filename, tokens, idx) and helpers (peek, advance, errf). Not yet wired into the public entry points — NewParser still calls the legacy byte-indexed parser, so all 45 fixtures continue to format byte-identically. This is checkpoint scaffolding for the planned rewrite. The next steps are: 1. tag_block.go, tag_if.go, tag_parent.go — handlers using parser+lexer 2. Swap NewParser/NewAdminParser/NewStorefrontParser to drive the new parser, verifying byte-identical fixture output as each handler lands 3. Delete the legacy parser code from parser.go 4. Add structured handlers for for/set/include/extends/macro/embed/etc. 5. Extract Dump methods into format.go (visitor pattern) 6. Smoke-test against shopware/storefront *.twig
The legacy parser in internal/html/parser.go did three things in one
1.6k-line file: tokenizing, parsing, and formatting. It advanced
through the source one byte at a time, used HasPrefix() on raw bytes
to recognize tags (breaking on '{% iff %}' and '{%else%}'), and called
strings.Count(input[:pos], "\\n") on every node to compute Line —
making line-number resolution O(n) per node.
This swaps in a clean three-layer pipeline:
lexer (lexer.go, ~520 lines)
Walks the source once, emits structured tokens with byte+line+col
positions tracked incrementally. Recognizes HTML elements/attrs,
Twig {%...%} / {{...}} / {#...#} delimiters, whitespace-trim
modifiers ({%- -%} etc.), and is string-literal + bracket-depth
aware on the close delimiter scan so '{% set x = "a%}b" %}' no
longer terminates early.
parser (parser_new.go, ~380 lines)
Consumes the token stream. parseNodesUntil collects nodes until a
parent tag's EndTag or Followers fire, or until EOF / element
close tag. Two chunking contexts (top-level vs. element-children)
preserve the legacy RawNode boundaries byte-for-byte.
tag handlers (tags.go + tag_*.go)
TagSpec registry replaces the if/else chain in the legacy
parseNodes. block, if, parent are registered in this commit;
for/set/include/macro/embed/etc. plug in via the same interface.
All 45 testdata fixtures continue to produce byte-identical formatter
output (TestFormatting passes). The new TestNewParserMatchesLegacy was
the parity gate during development; TestLexerNoErrorOnFixtures keeps
the lexer-level guarantee.
Public API unchanged: NewParser, NewAdminParser, NewStorefrontParser
return the same AST types. All 27 caller files in internal/verifier/
continue to compile and pass tests without modification.
Refs: #1002
Smoke-testing against shopware/storefront's 326 .twig files surfaced a
case where the lexer fused tokens like 'A{% else %}B{% endif %}' into a
single tokHTMLAttrName because lexHTMLAttr only broke on whitespace, =,
>, and /. Inside element-attribute context (e.g. <input {% if x %}A{%
else %}B{% endif %}>), this kept the inner {% else %} / {% endif %}
out of reach of the parser and surfaced as 'missing {% endif %}'.
Fix: in lexHTMLAttr, also break on '{%', '{{', '{#' so the Twig
delimiters are tokenized separately.
Adds an opt-in smoke test (HTML_SMOKE_CORPUS=/path/to/storefront) that
walks every .twig file and checks lexer, parser, and formatter
idempotency. Against shopware/storefront@main (326 files):
before: 24 parse failures, 257 fmt failures
after: 14 parse failures, 267 fmt failures
Comparing against the pre-rewrite parser on the same corpus:
parse: legacy 17, new 14 (3 better, 0 regressions)
format: legacy 264, new 267 (pre-existing formatter quirk)
The remaining parse failures are pre-existing limitations (blocks that
don't enclose their HTML span, e.g. {% block %}<html><body>{%
endblock %}...</body></html>) — they fail in legacy too.
When a Twig file fails to parse, the verifier now extracts the line number from the ParseError and attaches it to the CheckResult, instead of hardcoding Line: 0. Editors and CI summaries that group results by line will land on the actual problem location. The error message itself also no longer reads 'missing endif at pos 370' but 'path:12:5: missing endif' (the change is in ParseError.Error, shipped earlier in the rewrite). Together this addresses the secondary ask in #1002 to report line/column instead of byte position.
doc.go documents the four-layer architecture (pos / lexer / parser / ast+formatter) and shows how to add new Twig tags via the registry. fuzz_test.go adds FuzzLexer and FuzzParser, seeded from every testdata fixture plus hand-picked edge cases. The lexer fuzzer immediately surfaced a crash on '<!-->': lexHTMLComment indexed for the closer in the whole remainder, including the '<!--' prefix, so a degenerate input like '<!-->' computed end=2 and then sliced rem[4:2]. Fix: search for the closer starting at offset 4 (past '<!--') and adjust. After the fix, both fuzzers ran 10s × ~40k execs/sec with zero panics.
parser.go shrinks from 778 to 167 lines by moving all formatter code (every node's Dump method, NodeList.Dump, ConfiguredNodeList.Dump, IndentConfig, the indentConfig global, isTemplateElement helper) into a dedicated format.go. The AST type definitions stay in parser.go alongside the public entry points (NewParser, NewParserWithConfig, NewAdminParser, NewStore- frontParser), isVoidElement, and TraverseNode. Methods remain attached to their types — Go lets us split type and method across files in the same package. No behavior change: all 45 testdata fixtures continue to format byte- identically and the shopware/storefront smoke test shows the same 0-regression result as before the move. Dead-code cleanup: fromEntitiesToText (used only by the deleted legacy parseAttrValue) is removed. htmlCommentStart was already gone. SetIndentConfig is marked Deprecated in its doc comment, pointing callers toward future explicit-Formatter use.
Twig tags beyond block/if/parent now parse into dedicated AST nodes
instead of being folded into RawNode. Each handler plugs into the
tag registry from a single ~5 line init() block.
New AST node types in tag_generic.go:
TwigGenericBlockNode {% name args %}body{% endname %}
TwigStandaloneTagNode {% name args %} (no body)
Plus a dedicated TwigVerbatimNode for {% verbatim %}...{% endverbatim %}
whose body must NOT be recursively parsed — the parser sweeps tokens
verbatim until {% endverbatim %} and stores the body as a string.
Registered tags:
Block form (body + EndTag):
for / endfor (with else follower), embed, macro, apply, with, set
(block form, when args don't contain '='), verbatim
Standalone form (no body):
include, sw_include, extends, sw_extends, import, from, use, do,
set (inline form, when args contain '='), deprecated, flush
Set has dual form: tag_set.go branches on whether args contain '=' so
both '{% set x = 1 %}' and '{% set x %}body{% endset %}' work.
All 45 fixtures stay byte-identical (none use these tags). The
shopware/storefront smoke test idempotency failures drop from 267 to
105 — 162 real-world files now round-trip cleanly through parse →
format → parse → format. Parse failures stay at 14 (pre-existing
block-doesn't-enclose-html-span pattern that also fails in the legacy
parser).
Previously {# ... #} comments were folded into RawNode via the
unrecognized-tag fallback. Because raw bytes preserved a trailing
newline, the formatter's between-nodes "\\n" insertion would
double up on the second format pass and shift everything down a line.
Now {# ... #} is a structured TwigCommentNode that round-trips
through Dump as '{#body#}' without trailing whitespace, so
parse → format → parse → format is stable.
Net smoke-test improvement on shopware/storefront:
before this commit: 105 format-idempotency failures
after: 33 format-idempotency failures
Together with the structured tag handlers in the previous commit, the
parser+formatter now round-trips 293/326 real-world templates cleanly
(up from 62/326 before this PR's work began).
The remaining 33 idempotency failures are a pre-existing formatter
limitation: when a RawNode containing leading whitespace precedes a
structured node like TwigBlockNode inside an HTML element's children
(notably <p>), the RawNode's whitespace and the TwigBlockNode's own
indent compound on each format pass. Fixing that requires coordinating
adjacent-node whitespace in the formatter, which is out of scope for
this rewrite.
…n inside <p>
When <p>'s direct children include a Twig structural tag like
{% block %} or {% if %} preceded by a RawNode whose text ends with
line-leading whitespace, the format pass would emit
"<rawnode whitespace> + <child's own indent>", compounding the indent.
A round-trip through parse → format → parse → format then added one
indent level per pass to the structured child.
The fix trims trailing whitespace from the already-emitted buffer
just before rendering a Twig structural child. Inline children
(<span>, {{ x }}, raw text) are unaffected because their Dump output
doesn't start with its own indent at <p>'s call-indent (fixture 41
still passes byte-identical).
Smoke test on shopware/storefront:
before: 33 idempotency failures
after: 29 idempotency failures
When a {% block %} body wraps inline content like JS or CSS with
embedded {{ x }} interpolations, the default 'block content' formatting
that inserts \"\\n\\n\" between every child produced blank lines
between adjacent code segments. Each parse → format round absorbed those
blank lines into the surrounding RawNode and the gap grew on the next
pass.
The fix detects 'inline-mixed' bodies — every child is RawNode /
TemplateExpressionNode / CommentNode AND at least one RawNode carries
non-whitespace text — and concatenates children as-is so the embedded
whitespace from the RawNodes drives layout.
Fixtures with single-TemplateExpressionNode block bodies (43-template-
expression-in-nested-block) still flow through the existing branch
because they don't carry a meaningful-text RawNode.
Smoke test on shopware/storefront:
before: 29 idempotency failures
after: 9 idempotency failures (323/326 files round-trip cleanly)
Apply the same blockHasInlineMixedContent path to TwigGenericBlockNode
that was added to TwigBlockNode in the previous commit. {% for %},
{% embed %}, {% apply %}, etc. with bodies that mix RawNode text and
{{ x }} interpolations now flow as-is rather than per-child re-indent
+ TrimSpace, which was stripping trailing spaces from RawNodes like
'User-agent: ' before an embedded expression.
No fixture change (existing 45 don't use for-with-inline-body), no
smoke-test regression. Improvement is for templates that mix structured
tag bodies with inline content not covered by TwigBlockNode.
Two formatter fixes that together drive smoke-test idempotency failures
to zero on shopware/storefront's 326 templates.
1. TwigBlockNode non-inline body now trims RawNode children and adds
its own indent prefix, rather than echoing whatever incidental
whitespace the source had. Files like sitemap.xml.twig (which mix a
raw '<?xml ?>' header with nested blocks at the same level) no
longer compound newlines on each format pass.
2. The inline-mixed body path (both TwigBlockNode and TwigGenericBlock-
Node) now calls Dump(0) for TwigCommentNode children. The preceding
RawNode supplies the visible indent; without this the TwigCommentNode
would add its own indent on top, growing the comment line by one
indent level per format pass. Affected files: analytics.html.twig
where {# @deprecated #} sits between JS statements.
Smoke test on shopware/storefront:
before: 4 idempotency failures
after: 0 idempotency failures
Final smoke totals: 326 files total, 0 lex failures, 14 parse failures
(pre-existing block-doesn't-enclose-html-span pattern shared with the
legacy parser), 0 format-idempotency failures.
Net round-trip-clean count: 62 → 312 (out of 312 parseable).
…failures)
Real-world Twig templates routinely wrap just the opening (or just the
closing) HTML tag in {% if %} or {% block %}:
{% block base_html %}
<html lang="...">
{% endblock %}
...
</html>
or
{% if config.verticalAlign.value %}
<div class="...">
{% endif %}
content
{% if config.verticalAlign.value %}
</div>
{% endif %}
Previously the parser swallowed {% endif %} / {% endblock %} as raw
text while searching for </tag>, and the outer Twig tag then failed to
find its terminator. All 14 parse failures on shopware/storefront were
this pattern (the same pattern also fails in the pre-rewrite legacy
parser).
Fix has three parts:
1. parseElement now receives the enclosing TagSpec from its caller and
passes it through to parseNodesUntil. The element-children parser
yields on the outer Twig terminator without consuming it.
2. When children parsing yields on something other than </tag>, the
element is marked Unclosed; the formatter skips emitting </tag>
because the real </tag> appears later as raw text. Trailing
whitespace-only RawNodes are dropped from the element's children so
incidental indentation doesn't round-trip as a stray indented blank
line.
3. NodeList.Dump no longer inserts \n between siblings when the next
node's output already starts with \n. This avoids compounding
newlines on each format pass at top-level RawNode boundaries that
carry their own leading whitespace.
Public surface: new ElementNode.Unclosed field (defaults to false;
existing callers and fixtures unaffected).
Smoke test on shopware/storefront's 326 templates:
before: 14 parse failures, 0 idempotency failures
after: 0 parse failures, 0 idempotency failures
All 326 real-world templates now lex, parse, and round-trip cleanly.
… parser Several leftovers from the parser-swap transition were still in the tree: - parser_new_test.go held a 'new vs legacy' parity test. Both code paths run through the same parser now, so the test was a tautology. Removed. - parser_new.go was a misleading filename — it's the *only* parser. Merged its contents into parser.go. - parser.go was a mix of AST type definitions and parser entry points. Split: AST types move to ast.go (84 lines), parser orchestrator stays in parser.go (488 lines). - Comments throughout the package referenced 'the legacy parser' or 'legacy parity'. Updated to describe current behavior on its own terms. - A redundant parseNodes wrapper around parseNodesUntil was dropped (only one call site, inlined). - Restored missing doc comment on IndentConfig, updated doc.go's architecture description (no more 'parser_new.go', no more 'legacy combined file'). No behavior change: all 45 testdata fixtures still byte-identical, the shopware/storefront smoke test still 0/0/0 on 326 files, all 27 caller files in internal/verifier/ unchanged.
… slices
Replace TwigIfNode's split between top-level Condition/Children and
parallel ElseIfConditions/ElseIfChildren with a single Branches
slice. Branches[0] is always the "if"; Branches[1..] are the
"elseif"s. The else clause (no condition) stays as ElseChildren.
type TwigIfBranch struct {
Condition string
Body NodeList
}
type TwigIfNode struct {
Branches []TwigIfBranch
ElseChildren NodeList
Line int
}
The old shape was easy to desync — appending to ElseIfConditions
without also appending to ElseIfChildren would silently wedge the
formatter. A single slice of {condition, body} pairs makes the
invariant local and obvious.
The formatter consolidates three nearly-identical branches into one
loop over Branches plus a shared writeIfBranchBody helper, dropping
~80 lines of duplication.
TwigIfNode is not part of the public surface used by callers in
internal/verifier/ (they only read ElementNode.Line / Tag / Attributes
/ Children), so the field rename is safe. All 45 testdata fixtures
remain byte-identical, the shopware/storefront smoke test stays at
0 lex / 0 parse / 0 format failures, and the lexer + parser fuzzers
run clean.
Until now the registerTag function was unexported, and adding a custom Twig tag required code inside the internal/html package. Plugins and other callers wanting to teach the parser about project-specific tags had to either patch the package or fall back to raw-text round-trip (which works for standalone tags but breaks structurally for block tags whose body is supposed to be opaque). Expose two friendly wrappers over registerTag: func RegisterStandaloneTag(name string) func RegisterBlockTag(name, endTag string, followers ...string) Both delegate to the existing makeStandaloneTagParser / makeBlockTagParser helpers, so callers don't need access to the unexported parser type. Call from your own init() — duplicate registration panics so name collisions surface at startup. Also register four tags that show up in real Shopware storefront templates but had been falling through to the raw-text fallback: break, return — Twig core (flow control inside for/macro) sw_icon, sw_thumbnails — Shopware Storefront helpers They're standalone tags so the parser already round-tripped them correctly; registration just gives downstream AST consumers a TwigStandaloneTagNode to reason about. tags_custom_test.go adds three regression tests: - TestCustomTagRegistration covers a registered block tag - TestCustomStandaloneTagRegistration covers a standalone tag - TestUnregisteredTagFallback documents that unknown standalone tags still round-trip via raw text doc.go now shows the RegisterStandaloneTag / RegisterBlockTag pattern as the recommended way to teach the parser about new tags; the internal registerTag stays for tags with bespoke Parse functions. All 45 testdata fixtures byte-identical, 326/326 storefront smoke files pass, FuzzParser clean.
Opt-in (HTML_SMOKE_CORPUS=/path/to/storefront) benchmark that parses every .twig file in a checked-out shopware/storefront tree and reports bytes/sec throughput. Useful as a quick regression gate during parser changes. Current numbers on a single Xeon @ 2.8GHz core: 326 files, 1.17 MB total 23.8 ms per pass, ~47 MB/s
PR #1013 surfaced 17 lint issues across the new files. All fixed: exhaustive (3): tokenType switches with default arms get //exhaustive:ignore comments since the default branch intentionally handles uncovered cases by folding into RawNode / continuing. gocyclo (3): ElementNode.Dump (77), parseNodesUntil (37), lexContent (31) are all natural top-level dispatchers — one arm per token kind or one special case per HTML quirk. Tagged //nolint:gocyclo with a one-line rationale. gocritic ifElseChain (1): the if/else-if cascade matching '<!--', '<!DOCTYPE', '</name', '<name' inside lexContent is now a switch{}. nilerr (6): smoke_test.go and bench_test.go intentionally swallow per-file errors after appending them to the failures slice — that's the contract of a smoke test. Tagged //nolint:nilerr on each branch plus a header comment explaining the policy. predeclared (2): renamed 'close' → 'closer' in scanToTwigClose, 'max' → 'limit' in smoke_test.go. staticcheck SA9003 (1): empty 'if trimRight {}' branch in lexTwigComment was leftover scaffolding from a refactor; the close emission below already handles the case correctly. Comment kept, branch removed. unparam (1): scanToTwigClose's error return was always nil — dropped the error return type and updated both callers. unused (1): removed dead peekStr helper. No behavior change. All 45 testdata fixtures byte-identical, 326/326 storefront smoke files still 0/0/0, go test ./... green, golangci-lint run ./... reports 0 issues.
CI's golangci-lint (v2.12.x) runs a newer staticcheck than my local
2.5.0 and flagged PrettyError's loop:
internal/html/errors.go:39: QF1012
Use fmt.Fprintf(...) instead of WriteString(fmt.Sprintf(...))
Replace the offending b.WriteString(fmt.Sprintf(...)) with
fmt.Fprintf(&b, ...). No behavior change.
Verified locally with golangci-lint v2.12.2 (latest, matches CI):
golangci-lint run ./... → 0 issues
Soner (shyim)
commented
May 15, 2026
Soner (shyim)
left a comment
Member
Author
There was a problem hiding this comment.
Inline comments from local branch review.
1. TraverseNode now recurses into TwigIfNode, TwigGenericBlockNode, and
ElementNode.Attributes. Without this, downstream linters/fixers
couldn't see HTML elements nested inside the new structural nodes.
2. for/else is now supported. makeBlockTagParser previously only
accepted stopGenericEndTag so for/else reported missing endfor.
Generic block tags can now consume an "else" follower; for is the
built-in user. TwigGenericBlockNode gains an Else NodeList field.
3. Twig expressions and comments embedded in an HTML opening tag are no
longer dropped. Patterns like <div {{ attributes }}> and
<div {# note #}> round-trip through Dump. The attribute loop has
explicit cases for tokTwigExprOpen and tokTwigCommentOpen, plus a
raw fallback for unknown statements. The lexer also gained a "{#"
recognizer inside HTML opening tags.
4. Whitespace-control delimiters now round-trip:
- New TwigTrim{Left, Right bool} struct.
- Open/Close trim fields on every Twig structural node.
- consumeStmtHeader / consumeEndTag return TwigTrim alongside body.
- delimiters.go holds openStmt/closeStmt/openExpr/closeExpr/
openComment/closeComment helpers; all formatters use these.
Without this fix, "{%- if x -%}" formatted back as "{% if x %}",
silently changing Twig output (Twig strips surrounding whitespace
at marked sides).
New regression tests:
TestForLoopWithElse
TestTwigExprInAttributePreserved
TestTrimModifiersRoundTrip
Validation:
go test ./... all green
golangci-lint run ./... 0 issues
shopware/storefront smoke (326 files) 0 lex / 0 parse / 0 format
all 45 testdata fixtures byte-identical
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0aea6fbc72
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "Codex (@codex) review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".
PR #1013 review (chatgpt-codex-connector): > The parent-tag parser currently accepts any raw body after `parent` > and silently discards it, so malformed tags like `{% parent foo %}` > parse successfully and are later rewritten as `parent()` instead of > surfacing a parse error. parseParentTag now allows only an empty body or exactly "()" between the parent identifier and the closing delimiter. Anything else returns a parse error with the offending text, so authoring mistakes surface through the verifier instead of being silently rewritten. New regression test TestParentRejectsArguments covers: rejected: {% parent foo %}, {% parent 1 %}, {% parent(x) %} accepted: {% parent %}, {% parent() %}, {%- parent -%} go test ./... all green golangci-lint run ./... 0 issues shopware/storefront smoke (326 files) 0 lex / 0 parse / 0 format
User reported that the formatter "kept adding new lines" on every
format pass. Confirmed against the legacy byte-indexed parser:
legacy: pass1=381, pass2=395, pass3=409 bytes
(Each pass added 14 bytes — the inner {# Twig comment #}'s leading
whitespace was being absorbed into the next RawNode chunk and then
re-indented on the next pass.)
The rewrite produces byte-identical output across passes — the smoke
test on 326 storefront templates already shows fmt_fail=0. This commit
just adds an explicit regression test using the exact admin component
the user posted (nested {% block %}s with {# #} placeholders for child
themes to override).
TestNestedBlocksWithCommentsIdempotent verifies:
- parse(src).Dump() == parse(pass1).Dump()
- parse(pass1).Dump() == parse(pass2).Dump()
- output stays ~38 lines (hard upper bound 40)
The bug is fixed by the rewrite; this test pins it so a future format
change can't silently reintroduce blank-line growth on this shape.
Replace the inline Go test TestNestedBlocksWithCommentsIdempotent with
a proper testdata fixture (46-nested-blocks-with-twig-comments.txt),
which is how the rest of the test corpus expresses formatting
expectations.
The fixture is exercised by the existing TestFormatting harness, which:
- parses the input
- asserts admin output matches expected admin section
- re-parses the admin output and asserts idempotency
- asserts storefront output matches expected storefront section
- re-parses storefront output and asserts idempotency
So one fixture covers both the format-output and the
parse-format-parse-format byte-identity pinning that the inline Go
test was checking by hand.
The fixture uses the verbatim template from the user report (a Shopware
admin component with nested {% block %}s and {# #} placeholders for
child themes). The legacy parser grew the output by 14 bytes per
format pass on this shape; the rewrite is byte-identical.
The testdata harness in parser_test.go covers exactly what these tests
were checking by hand: parses input, asserts admin output matches, re-
parses the output and asserts idempotency, then the same for storefront.
Converting them removes duplication and matches how every other format
regression in the project is expressed.
New fixtures:
47-trim-expression.txt {{- x -}}
48-trim-if-else.txt {%- if -%}...{%- else -%}...{%- endif -%}
49-trim-block.txt {%- block -%}...{%- endblock -%}
50-trim-for-loop.txt {%- for -%}...{%- endfor -%}
51-trim-set.txt {%- set x = 1 -%}
52-trim-comment.txt {#- comment -#}
53-trim-parent.txt {%- parent -%}
54-twig-expr-in-attribute.txt <div {{ attributes }}></div>
55-twig-expr-mixed-with-attrs.txt <div data-x {{ attributes }} class></div>
56-twig-comment-in-attribute.txt <div {# note #}></div>
57-for-else-loop.txt {% for %}...{% else %}...{% endfor %}
58-unknown-twig-tag-fallback.txt unknown {% %} round-trips as raw
Trimmed from tags_custom_test.go:
- TestTrimModifiersRoundTrip -> 47..53
- TestTwigExprInAttributePreserved -> 54..56
- TestForLoopWithElse -> 57 (round-trip) + a slim
TestForLoopAstShape kept for the AST-
inspection bits
- TestUnregisteredTagFallback -> 58
Kept as Go tests (testdata format can't express these):
- TestCustomTagRegistration / TestCustomStandaloneTagRegistration
(require runtime RegisterBlockTag / RegisterStandaloneTag calls plus
AST type assertions)
- TestParentRejectsArguments (testdata can only assert successful
formatting; error cases stay in Go)
- TestForLoopAstShape (AST shape inspection, complements fixture 57)
Net change: -127 lines of Go, +101 lines of fixtures. Every fixture goes
through the same parse-format-parse-format idempotency loop as the
existing 46.
Running the parser over shopware/administration (993 .twig files)
surfaced 7 idempotency failures across two patterns. Both fixed:
1. CSV/XML export templates use {#- -#} between text segments to strip
the surrounding newline at render time:
aid,{#- -#}
brand,{#- -#}
Each parse → format pass added a blank line before each {#- -#}
because NodeList.Dump inserted a "\n" separator between a RawNode
ending in "\n" and the following structured TwigCommentNode. The
earlier compound-newline fix only checked the NEXT node's leading
"\n"; now it also skips when the previous output already ended in
"\n".
Affected admin templates:
- billiger-de/{header,body}.csv.twig
- google-product-search-de/header.xml.twig
- idealo/{header,body}.csv.twig
2. HTML <!-- comment --> inside <p> grew one indent level per format
pass. The <p>-children formatter already strips trailing whitespace
from the buffer before Twig structural children (so they don't
compound with the preceding RawNode's whitespace). HTML CommentNode
has the same shape — its Dump emits its own indent prefix — so it
joins the list. Renamed isTwigStructured → isStructuredChild and
added *CommentNode to the type switch.
Affected admin template:
sw-sales-channel-detail-product-comparison-preview.html.twig
(and the sw-user-sso-status-label template)
Smoke totals:
before this commit after
administration 0 lex, 0 parse, 0 lex, 0 parse,
7 idempotency 0 idempotency
storefront 0 / 0 / 0 0 / 0 / 0
Two new regression fixtures:
59-csv-template-with-trim-comments.txt
60-html-comment-inside-p.txt
1,319 real-world Twig files across both corpora now round-trip cleanly
through parse → format → parse → format.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR replaces the old recursive-descent HTML parser with a new lexer-based architecture that provides better separation of concerns, improved error handling, and more robust Twig template support. The old
internal/twigparserpackage is removed entirely in favor of the enhancedinternal/htmlpackage.Key Changes
Architecture Refactoring
lexer.go,tokens.go,pos.go): Single-pass tokenizer that recognizes HTML elements, attributes, and Twig delimiters ({% %},{{ }},{# #}) with precise position trackingparser.go): Replaces recursive descent with a cleaner state machine that walks the token stream, making it easier to handle complex nesting and error recoveryformat.go): All Dump/rendering code moved to a dedicated file, making the parser logic independent of output formattingTwig Tag System
tags.go): Pluggable tag parser system withTagSpecfor declarative tag definitionstag_*.go) with focused parsing logictag_if.go): Proper branch parsing with correct terminator detectiontag_generic.go): Handles unknown/future Twig tags gracefully by preserving them as raw textAST Improvements
ast.go): Consolidated node definitions with clear interfacesnodeContextenum distinguishes top-level parsing from element-children parsing, enabling smarter whitespace handling{% for %}{{ x }}{% endfor %}) by detecting when a block contains only text + expressionsError Handling
errors.go):ParseErrorwith source location, filename, and related locations for better diagnosticsTesting & Validation
smoke_test.go): Can parse real Shopware storefront templates whenHTML_SMOKE_CORPUSis setfuzz_test.go): Lexer fuzzed with fixture inputs to catch edge caseslexer_test.go): Validates lexer against testdata filesNotable Implementation Details
isVoidElement()correctly identifies self-closing HTML tagsNewParser(),NewAdminParser(), andNewStorefrontParser()entry points maintain the existing API;ConfiguredNodeListwraps nodes with their formatting configRemoved
internal/twigparser/package entirely (twig.go, nodes.go, node_list.go, parser_types.go, twig_test.go)Updated
internal/verifier/admin_twig.goandstorefront_twig.go: Now useNewAdminParser()andNewStorefrontParser()from the new html packagehttps://claude.ai/code/session_019e6qYC1a5j1TStxyedAaEy