feat(parser): recognize bare < as Str token (bd-j9cf)#213
Merged
Conversation
A literal `<` outside of math, code, or a recognized HTML construct
used to produce a hard parse error in qmd. The minimal trigger was
a one-line document containing `1 < 2`. After this change, a `<`
that is not the start of a recognized HTML element / autolink /
HTML comment / raw specifier parses as `Str "<"` instead.
Scope is conservative: the scanner still attempts the existing
HTML constructs first and only falls back to the new
`LT_STR_LITERAL` token when the scan loop reaches EOF without
finding a closing `>` / `}`. So `<5>`, `<,>`, etc. continue to
lex as HTML_ELEMENT and emit Q-2-9 — no user-visible change there.
Implementation:
- `scanner.c::parse_open_angle_brace` advances past `<`, calls
`mark_end` at `<+1` as a fallback, and emits `LT_STR_LITERAL`
if the scan loop reaches EOF without finding a closing
delimiter. Each successful HTML/autolink emit gained an
explicit `mark_end` after its advance past `>` so the fallback
end position is updated.
- `grammar.js` declares `$._pandoc_lt_str` as an external token
and adds it as a third choice inside `pandoc_str` so downstream
consumers see a uniform `Str` shape.
- `treesitter.rs` pandoc_str handler now splits leading ASCII
whitespace into a separate `Space` inline (mirrors the existing
html_element / autolink handlers). Leading-only is deliberate —
`\<space>` backslash escapes match the `\\.` regex branch and
rely on trailing whitespace in the token text to produce U+00A0
via `process_backslash_escapes`.
Generated files updated:
- `parser.c` and `grammar.json` regenerated by
`tree-sitter generate; tree-sitter build`.
- `_autogen-table.json` regenerated by
`crates/pampa/scripts/build_error_table.ts` — adding the new
external token shifted LR state numbers, so the frozen Merr-style
state→error-code mapping needed a refresh. All error codes are
unchanged; only the (state, sym) keys moved.
Tests:
- New corpus file
`crates/tree-sitter-qmd/tree-sitter-markdown/test/corpus/lt-as-str.txt`
with 7 cases (4 new behavior + 3 regression).
- New `crates/pampa/tests/test_bare_lt_str.rs` — 10 unit tests
through `readers::qmd::read` (4 new behavior + 6 regression for
html_element, autolink, comment, `\<`, math, code span).
- New roundtrip fixtures `bare_lt_{simple,eol,unclosed_tag}.qmd`
under `tests/roundtrip_tests/qmd-json-qmd/`.
Verification: `cargo nextest run --workspace --no-fail-fast` →
8998/8998. `cargo xtask verify` (Rust + lints + hub-client + WASM
+ trace-viewer) → all steps passed.
Plan: `claude-notes/plans/2026-05-18-bare-lt-as-str.md`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
<outside math/code/HTML used to produce a hard parse error in qmd (minimal trigger:1 < 2). After this PR, a<that does not start a recognized HTML construct parses asStr "<".LT_STR_LITERALtoken when the scan loop reaches EOF without finding a closing>/}.<5>,<,>, etc. continue to lex asHTML_ELEMENTand emit Q-2-9 — no behavior change for those.claude-notes/plans/2026-05-18-bare-lt-as-str.md.What changed
scanner.c::parse_open_angle_brace— advances past<, callsmark_endat<+1as a fallback, and emitsLT_STR_LITERALif the scan loop reaches EOF without finding a closing delimiter. Each successful HTML/autolink emit gained an explicitmark_endafter itsadvancepast>so the fallback end position is updated when a real HTML construct does match.grammar.js— declares$._pandoc_lt_stras an external token and adds it as a third choice insidepandoc_strso downstream consumers see a uniformStrshape.treesitter.rs— extends thepandoc_strhandler to split leading ASCII whitespace into a separateSpaceinline (the block-level scanner chomps preceding indentation into the external-token range; same pattern as the existinghtml_element/autolinkhandlers). Leading-only is deliberate:\<space>backslash escapes match the\\.regex branch and rely on trailing whitespace to produce U+00A0 viaprocess_backslash_escapes.Generated / regen files
parser.candgrammar.jsonregenerated bytree-sitter generate; tree-sitter build._autogen-table.jsonregenerated bycrates/pampa/scripts/build_error_table.ts— the new external token shifted LR state numbers, so the Merr-style state→error-code mapping needed a refresh. All error codes are unchanged; only the(state, sym)keys moved.Tests
crates/tree-sitter-qmd/tree-sitter-markdown/test/corpus/lt-as-str.txt— 7 corpus cases (4 new behavior + 3 regression for<b>/ autolink / HTML comment).crates/pampa/tests/test_bare_lt_str.rs— 10 unit tests throughreaders::qmd::read(4 new behavior + 6 regression: html_element, autolink, comment,\<, math$1 < 2$, code span`1 < 2`).crates/pampa/tests/roundtrip_tests/qmd-json-qmd/bare_lt_{simple,eol,unclosed_tag}.qmd— auto-picked up by the existing roundtrip runner.End-to-end verification
The qmd writer already escaped
<as\<defensively (crates/pampa/src/writers/qmd.rs:1414), so round-trip works for free: emitStr "<", write\<, re-read via the\\.branch ofpandoc_str.Test plan
tree-sitter test— 508/508 passing (501 existing + 7 new)cargo nextest run -p pampa --no-fail-fast— 3742/3742 passing, 2 skippedcargo nextest run --workspace --no-fail-fast— 8998/8998 passing, 195 skippedcargo xtask verify --skip-hub-build --skip-hub-tests— all steps passed (Rust + lints + fmt + trace-viewer)cargo xtask verify --skip-rust-build --skip-rust-tests— all steps passed (hub-client build + WASM tests)1 < 2,foo <,a <foofixtures with native / qmd / html / json writers — outputs inspected manuallyOut of scope / notes
<is the regenerated_autogen-table.json(same error codes, new state keys).>is unnecessary:>is already insidePANDOC_REGEX_STR's punctuation class ([>.,;!?]).🤖 Generated with Claude Code