Fix unclosed-inline error misclassification + supporting infrastructure (quotes + emphasis)#217
Fix unclosed-inline error misclassification + supporting infrastructure (quotes + emphasis)#217rundel wants to merge 10 commits into
Conversation
…uote" When a paragraph contains an unmatched emphasis marker between matching double quotes — e.g. `The "_blank" word.` — the parser reported the double quote as unclosed (Q-2-11) instead of the emphasis. The "unclosed quote" message sent users hunting for a missing `"` that was already present. Add four Merr-style corpus cases (one per emphasis marker) that redirect the (LR state, lookahead) pair to the matching emphasis diagnostic: The "_blank" word. → Q-2-5 (Unclosed Underscore Emphasis) The "__blank" word. → Q-2-15 (Unclosed Strong Underscore Emphasis) The "*blank" word. → Q-2-12 (Unclosed Star Emphasis) The "**blank" word. → Q-2-13 (Unclosed Strong Star Emphasis) Each cite also gets a hint telling the user to escape the literal marker (`\_` / `\*`). The hint bumps the diagnostic score so the emphasis-aware mapping wins the (state, sym) tie against the older Q-2-11 entry. Out of scope: single-quote contexts (`'_blank'`) collapse to the same LR (state, sym) as `qmd-syntax-helper`'s apostrophe-quotes test shapes (`*a' b.*`, `**a' b.**`, `_a' b._`, `__a' b.__`). Redirecting those would silently break the auto-fix path, so they're left at Q-2-7/Q-2-10. Double-quote shapes only. Side effect of the (state, sym) sharing: rare shapes that previously fell under Q-2-11 like `*a" b.*` (an inch-mark inside emphasis) now emit Q-2-12. No regression test covers those, but the matrix is documented in the plan. Tests ----- New regression file `crates/pampa/tests/test_emphasis_in_quote.rs` covers all four emphasis variants plus a pipe-table fixture. Modeled on `tests/test_link_destination_linebreak.rs`. Verified `cargo nextest run --workspace` → 8993 passed, 0 failed. No snapshot files were affected (the snapshot test iterates over `resources/error-corpus/*.qmd`, which is empty in the new format). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previously the parser misclassified errors when emphasis/quote markers nested inside other emphasis/quote scopes. For example, `The '_blank' word.` reported Q-2-10 (Closed Quote Without Matching Open) instead of Q-2-5 (Unclosed Underscore Emphasis), because tree-sitter's LR state minimisation collapsed otherwise-distinct (state, sym) parse-error keys between inputs like `_a' b._` and `The '_blank' word.`. See inline_issue.md for the full bug write-up. Fix: extend the Merr error-table key from (state, sym) to (state, sym, outer_scope), where outer_scope is the innermost open inline scope at the error position. A walker in quarto-parse-errors reconstructs the scope stack from the parser's token log (all_tokens + consumed_tokens), pushing on each scope-open token, popping on each top-of-stack matching close, and clearing on every block-boundary token. The same walker runs at corpus-build time so each test case records its natural outer_scope alongside (state, sym). Additional fixes folded in: - crates/pampa/src/readers/qmd.rs: when had_errors() is true but the resulting tree has no ERROR nodes, treat the parse as successful. GLR speculation hits detect_error in dead branches; if another branch reaches accept cleanly, those speculative errors shouldn't surface. Fixes `*a\" b.\"*` parsing as <em>a<dq>b.</dq></em>. - Reverts the Option 1 scanner-gate experiment from the WIP version of this commit. LR state minimisation merges post-OPEN and post-CLOSE destination states, so scanner-level gating was invisible at the parser level; the experiment also broke valid nested-emphasis parses (e.g. `*_*triple*_*`). - New corpus cases: - Q-2-5/12/13/15 in-single-quote: `The '_blank' word.` family - Q-2-15 in-single-quoted-emphasis: `The ' *__blank*' word.` family - 22 tests in test_emphasis_in_quote cover the original 5 shipped cases, the 4 single-quoted variants, the 4 unclosed-double-quote-in-emphasis variants, the 4 apostrophe-in-emphasis controls, the multi-paragraph regression case, three-level-nesting Q-2-15 cases, and the clean-parse case for nested double-quote-in-emphasis. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
inline_issue.md and options.md were committed at the repo root in a01e3bc but belong nowhere in tree per the claude-notes/ convention. The implementation they describe is now fully captured in the code and tests, so they are obsolete as forward plans. Three code-comment back-references to inline_issue.md are rewritten to stand on their own. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Refactor the inner-in-outer corpus cases for Q-2-5/12/13/15 to a single `inside-paired-outer` case each, using prefixesAndSuffixes to cover all six outer wrappers (four emphasis flavours plus single and double quote) with both bare and trailing-text variants. The diagnostic content doesn't depend on the outer scope, so the dedicated per-outer cases added earlier were redundant. Add Q-2-9 "Unclosed Single Quote" — previously no diagnostic existed for this shape because qmd's apostrophe heuristic treats a whitespace- prefixed `'` as an opener that can be unclosed (distinct from Q-2-10's stray-apostrophe-close interpretation). The new code uses the same unified prefixesAndSuffixes pattern. Add trailing-text variants (`* x\n`-style suffixes) to every unified case so realistic prose inputs like `*a _b c* word.` map to the right Q-code instead of falling through to generic Parse error. Add `trimLeadingSpace`/`trimTrailingSpace` to Q-2-5/13/15 notes so the opening-delimiter indicator renders 1-wide instead of including the parser's leading-whitespace prefix. Fix prior mislabeling: the test asserting Q-2-10 for `The "a 'b c" word.` was wrong — that input has a whitespace-prefixed `'` (Q-2-9), not a stray-close apostrophe (Q-2-10). Updated the test and removed the corresponding mislabeled case from Q-2-10.json. The trailing "I reached the end of the block..." indicator still anchors at the parser-failure position (one past the closing outer delimiter for these cases). A follow-up will reroute it to the closing outer delimiter itself. Test suite: 51 tests covering all bare and long-form combinations, including the original same-marker location regression and the new Q-2-9 cases. Full workspace nextest run (9055 tests) passes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Route the primary diagnostic location (header and "Reached the end of the outer inline scope..." indicator) to the closing delimiter of the outermost open inline scope when one exists, rather than letting it float at the parser-failure position just past the end of the block. Add `find_outermost_close` in `outer_scope.rs`: walks the parse log with the same push/pop and block-boundary semantics as `compute_outer_scope`, identifies the bottom of the scope stack, and returns the position of the latest token of that scope's kind appearing after the opener. The returned span is trimmed to the consecutive delimiter bytes so the indicator renders 1-wide on `*` (not 2-wide on whitespace + `*`). Wire this into `error_generation.rs`: when `outer_scope != None`, use `find_outermost_close` to override the parse_state-derived source_info; fall back to parse_state when no closer exists (e.g. `*hello`). Reword the unclosed-inline diagnostic messages (Q-2-5/9/11/12/13/15): drop the first-person "I reached" wording and replace "block" with "outer inline scope" so the message describes where the indicator actually lands. For `a *b _c* jeloasd asdasd`, the indicator now sits on the closing `*` at col 7 instead of floating past the end of the line. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…stics
Replace the combinatorial `prefixesAndSuffixes` corpus entries with a
single fallback dispatch path in error_generation.rs. The previous
approach required one corpus variant per (state, sym, outer_scope)
triple, which doesn't scale to arbitrary nesting depth — 3-level
nestings like `'*b _c*'` already need new variants per outer-outer
combination, and N-level nestings explode further.
Key insight: for "unclosed inline scope" diagnostics, the diagnostic
content depends only on `outer_scope` (the innermost open scope at
error time). The state and sym don't add information — they're noise
from tree-sitter's LR state minimisation.
Add `find_innermost_open_position` in outer_scope.rs: returns the
innermost open scope's opener span (trimmed to the consecutive
delimiter bytes) so the "opening mark" indicator lands on the actual
delimiter regardless of nesting depth.
Add `OuterScope::to_qcode`: maps each open inline scope to its default
Q-code (single_quote → Q-2-9, emph_star → Q-2-12, etc.).
In error_generation.rs, after the Merr lookup fails, check
`outer_scope != None && sym in {_close_block, _whitespace}`. If so,
look up the Q-code via `to_qcode`, find any existing corpus entry with
that Q-code to source the title/message/notes/hints content, and
build a synthetic diagnostic whose positions come from the scope-stack
walker (opener) and from find_outermost_close (closing-outer
indicator, already added in the previous commit).
This handles arbitrary-depth nesting uniformly. Specific Merr entries
still take priority where they exist, preserving distinctions like
Q-2-10 (apostrophe-close interpretation) vs Q-2-12 (unclosed `*`).
Remove the now-redundant corpus cases (`inside-paired-outer`,
`same-marker`) from Q-2-5/12/13/15 and Q-2-11. Simplify Q-2-9 to a
single minimal case used only as the fallback content template. Net:
~150 LOC of new code, ~2,750 lines of corpus/autogen-table removed,
78 generated case files deleted.
Add 4 regression tests for 3-level nesting shapes that previously
emitted generic "Parse error" and now route to the correct Q-code via
the fallback path.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Exercise the generic fallback dispatch with shapes where the scope stack at error time isn't trivially "everything still open": - `a '*b _c_ jeloasd' asdasd` — 3-level nesting where the middle `*` is unmatched but the inner `_c_` pairs. Stack at error is depth 2 (single_quote + emph_star); fallback emits Q-2-12. - `a "**b _c_ jeloasd" asdasd` — same shape with strong-star middle; Q-2-13. - `a '*"_b c"*' x` — 4-level deep nest, innermost `_` unmatched. Stack at error is depth 4; fallback walks down to the innermost emph_underscore and emits Q-2-5. - `a '*_b "c d_*' x` — 4-level deep, innermost is a double quote; Q-2-11. These verify the walker correctly tracks push/pop across arbitrary depths and that the to_qcode mapping fires off the innermost open scope regardless of how many scopes are stacked above the actual unmatched marker. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…atch
The bug fixes on this branch (`The "_blank" word.` and friends mapping to
the correct Q-code) had grown to extend the Merr error-table key from
`(state, sym)` to `(state, sym, outer_scope)` across `ErrorTableEntry`,
both lookup helpers, the `include_error_table!` proc-macro, the build
script, every Q-*.json corpus file, every row of `_autogen-table.json`,
and the QMD-side wrappers. Of 115 `(state, sym)` collision buckets, only
8 actually needed `outer_scope` to disambiguate, and the same approach
did not scale to the other inline scopes (span, code span, math, etc.)
that share the same bug class.
Revert the schema change and replace it with a smaller mechanism that
keeps the corpus as the single source of truth for which Q-code owns
which scope kind.
Schema revert
- Drop `outer_scope` from `ErrorTableEntry`, `lookup_error_message`,
`lookup_error_entry`, the `include_error_table!` macro, the
`qmd_error_message_table.rs` wrappers, the build script's autogen-row
emission, and `produce_error_message_json` (back to taking only the
observer; the `input_bytes` argument is gone).
- Regenerate `_autogen-table.json` (~700 entries, no per-row `outerScope`
field).
Corpus-driven scope-owner table
- Each "Unclosed X" Q-*.json gains one top-level annotation:
`"unclosedScope": "<scope-name>"`.
- The build script collects those into
`resources/error-corpus/_autogen-scope-owners.json` and a new sibling
proc-macro `include_scope_owner_table!` embeds it at compile time.
- `OuterScope::to_qcode` now reads from this table instead of a
hardcoded `match`, so renaming a Q-code in the corpus propagates
automatically — same trust model as the Merr table.
Outer-scope dispatch ahead of Merr lookup
- For innermost scopes whose walker view matches the parser's
interpretation (emphasis flavors + double-quote), the dispatcher in
`error_diagnostic_from_parse_state` builds the diagnostic directly
from the scope's owning Q-code template and skips the Merr lookup.
- SingleQuote falls through to Merr by default (the apostrophe-as-close
heuristic means a `'` after a letter is Q-2-10 / Q-2-7 territory,
which the existing Merr entries already handle). When Merr's match
doesn't carry one of those apostrophe-heuristic codes — i.e. the
match is a state-collision false positive — fall through to the
synthetic Q-2-9 path instead.
Walker unification
- Collapse `compute_outer_scope`, `find_outermost_close`, and
`find_innermost_open_position` in `outer_scope.rs` onto a single
internal `scope_walk` helper that returns the sorted token list plus
the final stack. The three public functions become thin readers of
the snapshot.
Corpus cleanup
- Remove the now-redundant `in-double-quote` / `in-single-quote` /
`in-single-quoted-emphasis` corpus cases from Q-2-5/12/13/15 (the
schema extension was their only consumer; the fallback dispatch
subsumes them).
Snapshot test changes (per repo policy)
- 9 generated case-files (`Q-2-{5,12,13,15}-in-*.qmd`) deleted.
- 1 new generated case-file (`Q-2-9-simple.qmd`).
- `_autogen-table.json` regenerated. Row count unchanged at ~691;
per-row `outerScope` field removed (~1100 lines smaller).
- `_autogen-scope-owners.json` new (6 entries).
Verification
- `cargo nextest run -p pampa --test test_emphasis_in_quote`: 59/59
pass (same as before this commit).
- End-to-end CLI runs verify Q-codes and indicator positions match the
pre-refactor branch for all canonical bug shapes (`The "_blank"
word.` → Q-2-5, `_a' b._` → Q-2-10, `*a 'b c*` → Q-2-9,
`'foo bar` → Q-2-9, etc.).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…nclosed-X messages
The unclosed-inline diagnostics (Q-2-5/9/11/12/13/15) previously read
"Reached the end of the outer inline scope before finding a closing ..."
in every context. For a block-level input like `a * b` the parser
actually gave up at the end of the block, not inside any enclosing
scope, and the wording read awkwardly.
Switch the corpus messages to a `{anchor}` placeholder and substitute at
emit time based on where the diagnostic's primary indicator landed:
- "the block" when the anchor falls back to the parser-failure
position (no enclosing scope, or no candidate
outer closer exists)
- "the inline scope" when the anchor lands on the closing delimiter
of an enclosing inline scope
The selection is driven by `find_outermost_close` returning `Some` /
`None` — exactly the same logic that already decides the indicator's
column position, so the wording and the indicator stay in lockstep.
Verification examples
a * b → Q-2-12, "Reached the end of the block before ..."
'a * b' → Q-2-12, "Reached the end of the inline scope before ..."
'foo bar → Q-2-9, "Reached the end of the block before ..." (truly unclosed)
The "_blank" word.
→ Q-2-5, "Reached the end of the inline scope before ..."
Snapshot test changes (per repo policy)
- `_autogen-table.json` regenerated: the `message` field of every
Unclosed-X entry now carries the `{anchor}` placeholder rather than
the literal "outer inline scope" phrase. All 691 row count + key
positions unchanged.
Tests
- Two new regression tests pin the wording for both branches on
otherwise-identical shapes:
- `block_level_unclosed_star_says_block` ("a * b")
- `single_quoted_unclosed_star_says_inline_scope` ("'a * b'")
- All 61 tests in `test_emphasis_in_quote.rs` pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
😮 |
|
👀 This is a little terrifying. I... I don't know if I want to merge this before understanding this very thoroughly! |
Yeah, this is all a bit cursed (yay Markdown). Consider the following: Should this parse correctly? I can make arguments in both directions. Commonmark doesn't appear to say anything about which way it would go, but it does have opinions about nested links, namely that they're not allowed (https://spec.commonmark.org/0.31.2/#links). So one could argue that the spirit of Commonmark would require us to disallow nested quotations. This in principle can be solved by created a (much) more complex version of the grammar that restricts all inlines inside of quotations to be non-quotation inlines. But there's an argument to be made that nested quotations should be allowed, actually. For example, And if that's an acceptable Markdown document, why shouldn't this be? And from there, the only reason we don't support truly nested quotes is typographical. But I note in passing that this example parses successfully in pampa: |
|
More generally:
:lolsob: I get it, I think you're going through the same process I went when I learned of this parsing technique: "oh my god it solves everything" three weeks later "oh no, this is kind of stiflingly limited actually". In general, I think your proposal is a version of something I thought of: we could benefit from using a stack of context "unreduced shifts" in a direct generalization of the Jeffery technique. At the point where we read the token after But, to be quite frank, I don't want to spend time trying to study this. I was an academic for a decade. I just want to build tools for people to use :) I genuinely think the risks outweigh the benefits here. We're already playing with fire by doing this trick inside tree-sitter and needing to get around its error correction. I don't want to compound that risk. |
|
Totally fair - I in no way can claim to understand a fraction of the code generated, this was the result of many iterations where the proposed solution either did not work or got way more invasive in terms of the necessary changes in the corpuses or scanner.c or even worse. This was the closest I got to something that felt at least like a semi-plausible solution. |
This was the direction I was originally interested in taking - claude was able to modify scanner.c such that it successfully built a context stack that kept track of open and closed inlines via pushing and popping but I wasn't able to get that passed into tree-sitter / the jeffery technique in such a way that it could be surfaced as a useful error. |
Apologies in advanced for all of the non-sense that is about to follow - everything started with good intentions and a minor issue around error locations for unclosed emphasis markers:
How hard could that possibly be to fix?
It got very ugly very quickly - I'm not sure what follows is a good solution or not but given it took a lot of iteration to even get close to something that mostly worked I think it's worthwhile documenting it all and passing it off. At a minimum it will hopefully provide a partial starting point for a better / cleaner solution.
The commits document a lot (but not all) of the iteration (e.g. adding an additional key to the corpus lookup table) that were eventually scraped. So it may also be useful to have claude summarize those for you individually.
Claude's write-up (w/ light editing)
Summary
This branch fixes a class of misleading parse errors where an unclosed
emphasis or quote marker nested inside another inline scope was reported as
if the outer scope were unclosed. The fix evolves through nine commits
that progressively generalize the mechanism, refactor it for
maintainability, and improve the user-facing diagnostic. Along the way it
picks up two unrelated parser fixes that surfaced during the work.
Final shape of the change:
time so the diagnostic can target the correct inline element regardless
of how deeply it was nested.
delimiter of the enclosing scope (when one exists) instead of floating at
the parser-failure position past end of line.
level, "end of the inline scope" when nested.
unclosedScopefield ratherthan any hardcoded Rust mapping or per-(state, sym) corpus authoring.
Adding a new inline scope kind is a one-line corpus edit plus a walker
enum variant.
Diff size: 20 files, +2,883 / −466. Most of the insertion volume is the
new
outer_scope.rsmodule (740 LOC, ~half of it tests), the expandedtest_emphasis_in_quote.rsregression suite (61 tests), and theregenerated
_autogen-table.json. The actual implementation logic is ~400LOC.
Before / after on concrete inputs
Case 1 — Unclosed emphasis between paired double quotes
Before (on
main):After:
The diagnostic now points at the unclosed
_rather than sending the userhunting for a missing
"that's already in the input. The indicator isanchored at the closing
"because the parser actually gave up at thatpoint (the
_couldn't close before the"arrived). A hint suggests theescape if the user meant the marker literally.
Case 2 — Same shape, single quotes
Before:
After:
The single-quote case was even more confusing on
main— the diagnostictalked about apostrophe interpretation while the actual problem was the
same unclosed
_.Case 3 — Unclosed emphasis inside single quotes (with whitespace flanks)
Before:
After:
maininsisted the'was unclosed even though both'markers arepresent and the actual problem is the orphan
*. The new diagnosticpoints at the
*and uses "the inline scope" wording because theindicator lands on the closing
'.Case 4 — Same emphasis at block level (no enclosing scope)
Before:
After:
Two small improvements visible here even though the Q-code is unchanged:
*(col 2) rather thanthe start of the line (col 0). On
main, the opener anchor was offbecause the parser's leading-whitespace token was reported untrimmed;
the new walker trims to the actual delimiter byte.
appropriate for a block-level failure with no enclosing inline scope.
Compare this against Case 3 above: the same
Q-2-12template, but thewording switches between "the block" and "the inline scope" depending on
whether an outer closer anchors the indicator.
Case 5 — Unclosed inner emphasis inside paired outer emphasis
Before:
After:
maininsisted the outer*was unclosed (and let the indicator float atend-of-line, far from any source feature). The actual issue is the inner
_, with the outer*..*correctly closing at col 7. The new diagnosticidentifies the right marker and anchors at the closing
*.Case 6 — Truly unclosed single quote at block level
Before:
After:
The behavior is essentially unchanged here — both branches correctly
identify the unclosed
'at block level. The Q-code has shifted fromQ-2-7 to Q-2-9 (a new code introduced earlier in the branch to
distinguish whitespace-prefixed openers from the apostrophe-as-text-close
interpretation that
qmd-syntax-helperrelies on), but the diagnosticcontent is equivalent.
The bug
The misclassification is a consequence of how tree-sitter's LR state
generator minimises distinct states with identical outgoing transitions.
Multiple input shapes that semantically fail for different reasons
collapse onto the same
(LR state, lookahead symbol)parse-error pair,and the Merr-style error table can only see that pair. For example, both
of these reach
(state=704, sym=_whitespace)at error time:_a' b.__..._The '_blank' word._inside paired'..'The same shape repeats across all six "Unclosed X" inline scopes (single
quote, double quote, single/double emphasis with
*or_).Approach: scope-stack walker plus context-aware dispatch
The core idea is to give the dispatcher a third piece of context beyond
the LR state and lookahead: which inline scope was open at error time.
This is recovered by walking the parser's token log.
outer_scope.rswalkercrates/quarto-parse-errors/src/outer_scope.rs(new, 740 LOC) providesone internal
scope_walkhelper plus three public readers:compute_outer_scope— returns the innermost open inline scope at theerror position (the top of the stack).
find_outermost_close— returns the position of the would-be closerfor the outermost open scope (the bottom of the stack), or
Noneifno candidate closer exists in the remaining tokens.
find_innermost_open_position— returns the position of the innermostopen scope's opening delimiter.
The walker combines the parser's
all_tokensandconsumed_tokenslogs,sorts them by source position, and iterates applying the same push /
pop-if-top / block-boundary-clear rules the external scanner uses. The
result is the scope-stack snapshot at the moment the parser failed.
Recognised scope kinds:
single_quote,double_quote, emphasisdelimiters (
*and_, distinguished by reading the actual byte fromthe input because emphasis tokens can include leading whitespace), and
strong emphasis delimiters.
Corpus-driven
OuterScope → Q-codetableEach "Unclosed X" corpus file carries a top-level annotation:
{ "code": "Q-2-5", "title": "Unclosed Underscore Emphasis", "unclosedScope": "emph_underscore", ... }build_error_table.tscollects these into a generated artifact_autogen-scope-owners.json(6 entries), and a sibling proc-macroinclude_scope_owner_table!embeds it at compile time.OuterScope::to_qcodereads from this generated table rather than ahardcoded
match, so renaming a Q-code in the corpus propagatesautomatically — the same trust model the Merr table itself uses.
Dispatch in
error_generation.rsFor each parse-error state,
error_diagnostic_from_parse_state:enclosing scope.
directly or through the Merr
(state, sym)lookup:double_quoteinnermost scope →outer-scope dispatch: build the diagnostic from the scope's owning
Q-code template and skip the Merr table entirely. This handles
The "_blank" word.-style cases as well as arbitrary nestingdepth (3+ levels) without needing per-shape corpus entries.
single_quoteinnermost scope → fall through to Merr bydefault, because the apostrophe-as-close heuristic (a
'preceded by a letter is text, not an opener) means the walker's
view can diverge from the parser's view here. Pre-existing
Q-2-7 / Q-2-10 entries handle the parser-interpretation
overrides correctly. When Merr's match does not carry one of
those apostrophe-heuristic codes — i.e. the match is a
state-collision false positive from some other input that
happened to reach the same
(state, sym)— fall back to thesynthetic Q-2-9 path.
outermost open scope (via
find_outermost_close) when one exists;otherwise falls back to the parser-failure position.
{anchor}placeholder in the corpus message with"the inline scope" or "the block" depending on which anchor branch
fired in step 3.
The asymmetric handling of single-quote vs. the other scopes is
principled, not a hack: single-quote is the one scope kind in qmd where
the parser applies a heuristic that diverges from a naive push/pop
model of the source characters. Every other inline delimiter has
unambiguous open/close semantics, so the walker's stack and the parser's
interpretation agree.
Q-2-9 vs. Q-2-10: the apostrophe-heuristic split
Single-quote is the one scope kind that needs two distinct
"unclosed-X" diagnostics rather than one, because qmd's apostrophe
heuristic gives
'two parser-side interpretations. The walker pushesSingleQuoteon every'token it sees; the parser decides per-'whether the character is an opener or a close-apostrophe based on
what precedes it.
Q-2-9 "Unclosed Single Quote" is for shapes where the
'isa real opener — typically whitespace-prefixed — and the user forgot
the matching close. Example:
'foo bar(Case 6 above), or*a 'b c*(an unclosed inner quote inside paired emphasis). Thewalker's stack matches the parser's view here, so the dispatcher
hits the synthetic Q-2-9 path through
OuterScope::to_qcode.Q-2-10 "Closed Quote Without Matching Open Quote" is for shapes
where the
'follows a letter, so the parser treats it as text(an apostrophe-as-close), while some surrounding scope is what
actually failed. Example:
_a' b._— the parser's view is"stray apostrophe close inside emphasis"; the user's actual
problem is the unmatched
_. (qmd-syntax-helper's auto-fixpipeline depends on Q-2-10 firing for these shapes so it can
suggest escaping the apostrophe.)
The two interpretations frequently land on the same
(LR state, lookahead)parse-error bucket, so the table-only approach can'tdistinguish them. The dispatch rule that resolves this is in the
single-quote arm of the gate:
In effect, Q-2-7 and Q-2-10 act as explicit overrides for
parser-side reinterpretations of
', and Q-2-9 is the defaultfor everything else with a
SingleQuoteon the stack. The dispatchgate codes this check inline (
matches!(e.error_info.code, Some("Q-2-7") | Some("Q-2-10"))) rather than via another corpus annotation; withonly two override codes and no path to growth (apostrophe is the only
heuristic of its kind in qmd), the hardcoded set is more readable than
a generated table would be. If a future inline delimiter grows a
similar parser-side heuristic, the same pattern can be lifted into a
corpus annotation at that point.
Unrelated fixes folded in
Scanner
:::peek gate (in commita01e3bc6)crates/tree-sitter-qmd/tree-sitter-markdown/src/scanner.chad aregression where the pipe-table-line-ending detection's
:::lookahead peek ran in any context, not just when
PIPE_TABLE_LINE_ENDINGwas a valid symbol. The peek callsadvance()to step past the colons; in non-pipe-table contexts this corrupted
the lexer's position. Subsequent
SOFT_LINE_ENDINGemissions thenmark_end-ed at the wrong byte, breaking fenced divs whose bodycontained an inline-bearing block followed by
:::{attrs}on the nextline.
Fix: gate the peek on
valid_symbols[PIPE_TABLE_LINE_ENDING]. 19-linechange with a long comment explaining the failure mode in detail.
GLR speculative-error filter (in commit
a01e3bc6)Already described above. When
had_errors()is true but the finaltree has no ERROR nodes, treat the parse as successful — the errors
came from dead GLR speculation branches.
What this branch changes structurally
New files
crates/quarto-parse-errors/src/outer_scope.rs— scope-stack walkerand
OuterScopeenum.crates/pampa/resources/error-corpus/_autogen-scope-owners.json—generated
OuterScope → Q-codemapping.crates/pampa/tests/test_emphasis_in_quote.rs— 61-test regressionsuite covering all the bug shapes, control cases, and deep nesting.
Modified APIs
produce_diagnostic_messages— newscope_owners: &[ScopeOwnerEntry]parameter. Pampa's wrapper suppliesget_scope_owner_table().OuterScope::to_qcode— now takes a&[ScopeOwnerEntry]parameter.include_scope_owner_table!proc-macro is a new sibling toinclude_error_table!, sharing the same input syntax.Unchanged APIs
lookup_error_message,lookup_error_entry,ErrorTableEntry— allback to their pre-branch signatures (no
outer_scopeparameter orfield).
produce_error_message_json— back to its pre-branch signature (noinput_bytesparameter).Corpus contract
Each "Unclosed X" Q-*.json file now declares:
{ "code": "Q-2-X", "title": "Unclosed X", "unclosedScope": "<scope-name>", "message": "Reached the end of {anchor} before finding ...", ... }Extending to other inline scopes
The walker today recognises only emphasis and quote scopes. The same
bug class exists for every other inline kind in qmd: span
[..],image
![..], inline footnote^[..], code span`..`, inlinemath
$..$, superscript^..^, subscript~..~, strikeout~~..~~, editorial markers ({+..+},{-..-},{>..<},{!..!},{=..=}), and inline attribute specifiers{..}.A quick probe shows these are still misclassified on both
mainandthis branch:
The diagnostic should say
[Q-2-24] Unclosed Code Spanand point at thebacktick. The walker doesn't track code-span tokens, so the dispatcher
sees
outer_scope = double_quoteand emits the (incorrect) Q-2-11template.
This is now a mechanical fix, not a design problem. Adding a new
inline scope kind is exactly three steps:
OuterScope(e.g.CodeSpan) and a matching arm inscope_for_tokenmapping thetree-sitter token kind (e.g.
code_span_delimiter) to thatvariant. This is the only parser-bound piece — it requires knowing
what tokens the grammar emits, which is inherently in code.
"unclosedScope": "code_span"to thecorresponding
Q-*.jsoncorpus file (e.g.Q-2-24.jsonforunclosed code span). Regenerate via
./scripts/build_error_table.ts. The build script automaticallypulls the annotation into
_autogen-scope-owners.json, andOuterScope::to_qcodewill find the new mapping at compile time.any parser-side heuristic that diverges from the walker's view (the
way single-quote does with its apostrophe-as-text-close rule). If
yes — add an arm to the dispatch gate in
error_generation.rsthatroutes that scope through Merr. If no — let it use the default
outer-scope dispatch like emphasis and double-quote already do.
Most inline scopes have unambiguous open/close semantics, so this
step is a no-op for them.
No new corpus entries per (LR state, lookahead) bucket. No new schema
fields. No combinatorial expansion across N outer × M inner pairs. The
infrastructure built in Phases 2–4 of this branch was specifically
designed so that each new inline scope adds constant work, not
quadratic.
Possible parser-side heuristics worth checking for the remaining inline
kinds, in case they need step 3:
{..}— Pandoc's syntax overlaps;pampa may treat
{after a span specially. Worth checking thescanner for context-sensitive cases.
$..$—$followed by a digit-or-letter is math;$inprose contexts is just a dollar sign. Like the apostrophe heuristic,
this is a parser-level distinction the walker would need to mirror
if it pushed naively on every
$.^..^vs footnote prefix^[..]— distinguishableby lookahead, but the walker would need to consult the next token
before deciding which variant of
^to push.~~..~~vs subscript~..~— distinguishable bytoken kind in tree-sitter, so the walker handles them with separate
scope variants.
Each of these is a self-contained follow-up issue. The framework
established here scales to all of them without architectural change.
Coverage
Test counts
test_emphasis_in_quote.rs: 61 tests covering all six "Unclosed X"shapes across single/double quote, emphasis-star/underscore, and
strong-star/underscore outer wrappers, plus controls for
apostrophe-in-emphasis, deep nesting, multi-paragraph regression
shapes, and the new anchor-wording disambiguation.
CI parity
cargo xtask verify --skip-hub-build --skip-hub-tests --skip-trace-viewer-build --skip-trace-viewer-testspasses: lint,fmt, build with
-D warnings, tree-sitter tests, workspace nextest.Known limitations
These cases are documented here rather than addressed in this PR
because each requires independent design work larger than the scope of
this branch.
Multiple unclosed inlines in a single block
Before and after both produce only one diagnostic (for the innermost
_), even though both*and_are unclosed:The walker correctly sees both on its stack, but the dispatcher emits
one diagnostic per parser error state and currently picks the
innermost frame. Fixable in the dispatch layer (~20 LOC): when
find_outermost_closereturnsNone, emit one diagnostic per stackframe outermost-first.
Errors in subsequent paragraphs after the first failure
Only the first paragraph's error is reported; the second paragraph's
unclosed
_is silent. The verbose trace shows tree-sitter hitsdetect_erroronce, then entersrecover_to_previousandskip_token-s through the rest of the document without re-entering aproductive state — so no second parse-error event ever fires.
This is a parser-level limitation, not a dispatch-layer one. Possible
approaches for a follow-up:
failures, then re-run the scope-walker on each.
boundaries.
instead of skipping to EOF.
The first option is the most plausible without touching tree-sitter
internals and would reuse the existing walker.