Fix unclosed-inline error misclassification + supporting infrastructure (quotes + emphasis) by rundel · Pull Request #217 · quarto-dev/q2

rundel · 2026-05-19T23:09:45Z

Apologies in advanced for all of the non-sense that is about to follow - everything started with good intentions and a minor issue around error locations for unclosed emphasis markers:

$ printf -- 'The "_blank" word.\n' | cargo run --bin pampa -- 2>&1

Error: [Q-2-11] Unclosed Double Quote
   ╭─[ <stdin>:1:13 ]
   │
 1 │ The "_blank" word.
   │     ┬┬
   │     ╰─── This is the opening quote mark.
   │      │
   │      ╰── I reached the end of the block before finding a closing '"' for the quote.
───╯

How hard could that possibly be to fix?

It got very ugly very quickly - I'm not sure what follows is a good solution or not but given it took a lot of iteration to even get close to something that mostly worked I think it's worthwhile documenting it all and passing it off. At a minimum it will hopefully provide a partial starting point for a better / cleaner solution.

The commits document a lot (but not all) of the iteration (e.g. adding an additional key to the corpus lookup table) that were eventually scraped. So it may also be useful to have claude summarize those for you individually.

Claude's write-up (w/ light editing)

Summary

This branch fixes a class of misleading parse errors where an unclosed
emphasis or quote marker nested inside another inline scope was reported as
if the outer scope were unclosed. The fix evolves through nine commits
that progressively generalize the mechanism, refactor it for
maintainability, and improve the user-facing diagnostic. Along the way it
picks up two unrelated parser fixes that surfaced during the work.

Final shape of the change:

A scope-stack walker reconstructs the parser's nesting context at error
time so the diagnostic can target the correct inline element regardless
of how deeply it was nested.
The "Unclosed X" diagnostics now anchor their indicator at the closing
delimiter of the enclosing scope (when one exists) instead of floating at
the parser-failure position past end of line.
The message wording adjusts to context: "end of the block" at block
level, "end of the inline scope" when nested.
Dispatch is driven by a corpus-annotated unclosedScope field rather
than any hardcoded Rust mapping or per-(state, sym) corpus authoring.
Adding a new inline scope kind is a one-line corpus edit plus a walker
enum variant.

Diff size: 20 files, +2,883 / −466. Most of the insertion volume is the
new outer_scope.rs module (740 LOC, ~half of it tests), the expanded
test_emphasis_in_quote.rs regression suite (61 tests), and the
regenerated _autogen-table.json. The actual implementation logic is ~400
LOC.

Before / after on concrete inputs

Case 1 — Unclosed emphasis between paired double quotes

printf 'The "_blank" word.\n' | cargo run --bin pampa --

Before (on main):

Error: [Q-2-11] Unclosed Double Quote
   ╭─[ <stdin>:1:13 ]
   │
 1 │ The "_blank" word.
   │            ┬┬
   │            ╰─── This is the opening quote mark.
   │             │
   │             ╰── I reached the end of the block before finding a closing '"' for the quote.
───╯

After:

Error: [Q-2-5] Unclosed Underscore Emphasis
   ╭─[ <stdin>:1:12 ]
   │
 1 │ The "_blank" word.
   │      ┬     ┬
   │      ╰──────── This is the opening '_' mark.
   │            │
   │            ╰── Reached the end of the inline scope before finding a closing '_' for the emphasis.
───╯
ℹ If you want a literal underscore, escape it with `\_`.

The diagnostic now points at the unclosed _ rather than sending the user
hunting for a missing " that's already in the input. The indicator is
anchored at the closing " because the parser actually gave up at that
point (the _ couldn't close before the " arrived). A hint suggests the
escape if the user meant the marker literally.

Case 2 — Same shape, single quotes

printf "The '_blank' word.\n" | cargo run --bin pampa --

Before:

Error: [Q-2-10] Closed Quote Without Matching Open Quote
   ╭─[ <stdin>:1:13 ]
   │
 1 │ The '_blank' word.
   │            ┬┬
   │            ╰─── This is the opening quote. If you need an apostrophe, escape it with a backslash.
   │             │
   │             ╰── A space is causing a quote mark to be interpreted as a quotation close.
───╯

After:

Error: [Q-2-5] Unclosed Underscore Emphasis
   ╭─[ <stdin>:1:12 ]
   │
 1 │ The '_blank' word.
   │      ┬     ┬
   │      ╰──────── This is the opening '_' mark.
   │            │
   │            ╰── Reached the end of the inline scope before finding a closing '_' for the emphasis.
───╯
ℹ If you want a literal underscore, escape it with `\_`.

The single-quote case was even more confusing on main — the diagnostic
talked about apostrophe interpretation while the actual problem was the
same unclosed _.

Case 3 — Unclosed emphasis inside single quotes (with whitespace flanks)

printf "'a * b'\n" | cargo run --bin pampa --

Before:

Error: [Q-2-7] Unclosed Single Quote
   ╭─[ <stdin>:1:8 ]
   │
 1 │ 'a * b'
   │       ┬┬
   │       ╰─── This is the opening quote. If you need an apostrophe, escape it with a backslash.
   │        │
   │        ╰── I reached the end of the block before finding a closing "'" for the quote.
───╯

After:

Error: [Q-2-12] Unclosed Star Emphasis
   ╭─[ <stdin>:1:7 ]
   │
 1 │ 'a * b'
   │    ┬  ┬
   │    ╰───── This is the opening '*' mark.
   │       │
   │       ╰── Reached the end of the inline scope before finding a closing '*' for the emphasis.
───╯
ℹ If you want a literal asterisk, escape it with `\*`.

main insisted the ' was unclosed even though both ' markers are
present and the actual problem is the orphan *. The new diagnostic
points at the * and uses "the inline scope" wording because the
indicator lands on the closing '.

Case 4 — Same emphasis at block level (no enclosing scope)

printf 'a * b\n' | cargo run --bin pampa --

Before:

Error: [Q-2-12] Unclosed Star Emphasis
   ╭─[ <stdin>:1:6 ]
   │
 1 │ a * b
   │ ┬    ┬
   │ ╰─────── This is the opening '*' mark.
   │      │
   │      ╰── I reached the end of the block before finding a closing '*' for the emphasis.
───╯

After:

Error: [Q-2-12] Unclosed Star Emphasis
   ╭─[ <stdin>:1:6 ]
   │
 1 │ a * b
   │   ┬  ┬
   │   ╰───── This is the opening '*' mark.
   │      │
   │      ╰── Reached the end of the block before finding a closing '*' for the emphasis.
───╯
ℹ If you want a literal asterisk, escape it with `\*`.

Two small improvements visible here even though the Q-code is unchanged:

The opening-mark indicator now points at the * (col 2) rather than
the start of the line (col 0). On main, the opener anchor was off
because the parser's leading-whitespace token was reported untrimmed;
the new walker trims to the actual delimiter byte.
The wording stays "Reached the end of the block before…" — context-
appropriate for a block-level failure with no enclosing inline scope.

Compare this against Case 3 above: the same Q-2-12 template, but the
wording switches between "the block" and "the inline scope" depending on
whether an outer closer anchors the indicator.

Case 5 — Unclosed inner emphasis inside paired outer emphasis

printf '*a _b c* trailing\n' | cargo run --bin pampa --

Before:

Error: [Q-2-12] Unclosed Star Emphasis
   ╭─[ <stdin>:1:18 ]
   │
 1 │ *a _b c* trailing
   │                  ┬
   │                  ╰── I reached the end of the block before finding a closing '*' for the emphasis.
───╯

After:

Error: [Q-2-5] Unclosed Underscore Emphasis
   ╭─[ <stdin>:1:8 ]
   │
 1 │ *a _b c* trailing
   │    ┬   ┬
   │    ╰────── This is the opening '_' mark.
   │        │
   │        ╰── Reached the end of the inline scope before finding a closing '_' for the emphasis.
───╯
ℹ If you want a literal underscore, escape it with `\_`.

main insisted the outer * was unclosed (and let the indicator float at
end-of-line, far from any source feature). The actual issue is the inner
_, with the outer *..* correctly closing at col 7. The new diagnostic
identifies the right marker and anchors at the closing *.

Case 6 — Truly unclosed single quote at block level

printf "'foo bar\n" | cargo run --bin pampa --

Before:

Error: [Q-2-7] Unclosed Single Quote
   ╭─[ <stdin>:1:9 ]
   │
 1 │ 'foo bar
   │ ┬       ┬
   │ ╰────────── This is the opening quote. If you need an apostrophe, escape it with a backslash.
   │         │
   │         ╰── I reached the end of the block before finding a closing "'" for the quote.
───╯

After:

Error: [Q-2-9] Unclosed Single Quote
   ╭─[ <stdin>:1:9 ]
   │
 1 │ 'foo bar
   │ ┬       ┬
   │ ╰────────── This is the opening single quote.
   │         │
   │         ╰── Reached the end of the block before finding a closing single quote.
───╯
ℹ If you meant a literal apostrophe, escape it with `\'`.

The behavior is essentially unchanged here — both branches correctly
identify the unclosed ' at block level. The Q-code has shifted from
Q-2-7 to Q-2-9 (a new code introduced earlier in the branch to
distinguish whitespace-prefixed openers from the apostrophe-as-text-close
interpretation that qmd-syntax-helper relies on), but the diagnostic
content is equivalent.

The bug

The misclassification is a consequence of how tree-sitter's LR state
generator minimises distinct states with identical outgoing transitions.
Multiple input shapes that semantically fail for different reasons
collapse onto the same (LR state, lookahead symbol) parse-error pair,
and the Merr-style error table can only see that pair. For example, both
of these reach (state=704, sym=_whitespace) at error time:

Input	Real failure	Wanted
`_a' b._`	apostrophe-as-text-close collides with `_..._`	Q-2-10 "Closed Quote Without Matching Open"
`The '_blank' word.`	unclosed `_` inside paired `'..'`	Q-2-5 "Unclosed Underscore Emphasis"

The same shape repeats across all six "Unclosed X" inline scopes (single
quote, double quote, single/double emphasis with * or _).

Approach: scope-stack walker plus context-aware dispatch

The core idea is to give the dispatcher a third piece of context beyond
the LR state and lookahead: which inline scope was open at error time.
This is recovered by walking the parser's token log.

`outer_scope.rs` walker

crates/quarto-parse-errors/src/outer_scope.rs (new, 740 LOC) provides
one internal scope_walk helper plus three public readers:

compute_outer_scope — returns the innermost open inline scope at the
error position (the top of the stack).
find_outermost_close — returns the position of the would-be closer
for the outermost open scope (the bottom of the stack), or None if
no candidate closer exists in the remaining tokens.
find_innermost_open_position — returns the position of the innermost
open scope's opening delimiter.

The walker combines the parser's all_tokens and consumed_tokens logs,
sorts them by source position, and iterates applying the same push /
pop-if-top / block-boundary-clear rules the external scanner uses. The
result is the scope-stack snapshot at the moment the parser failed.

Recognised scope kinds: single_quote, double_quote, emphasis
delimiters (* and _, distinguished by reading the actual byte from
the input because emphasis tokens can include leading whitespace), and
strong emphasis delimiters.

Corpus-driven `OuterScope → Q-code` table

Each "Unclosed X" corpus file carries a top-level annotation:

{
  "code": "Q-2-5",
  "title": "Unclosed Underscore Emphasis",
  "unclosedScope": "emph_underscore",
  ...
}

build_error_table.ts collects these into a generated artifact
_autogen-scope-owners.json (6 entries), and a sibling proc-macro
include_scope_owner_table! embeds it at compile time.
OuterScope::to_qcode reads from this generated table rather than a
hardcoded match, so renaming a Q-code in the corpus propagates
automatically — the same trust model the Merr table itself uses.

Dispatch in `error_generation.rs`

For each parse-error state, error_diagnostic_from_parse_state:

Computes the outer scope and the would-be closer of the outermost
enclosing scope.
Decides whether to route the diagnostic through outer-scope dispatch
directly or through the Merr (state, sym) lookup:
- Emphasis-flavored or double_quote innermost scope →
  outer-scope dispatch: build the diagnostic from the scope's owning
  Q-code template and skip the Merr table entirely. This handles
  The "_blank" word.-style cases as well as arbitrary nesting
  depth (3+ levels) without needing per-shape corpus entries.
- single_quote innermost scope → fall through to Merr by
  default, because the apostrophe-as-close heuristic (a '
  preceded by a letter is text, not an opener) means the walker's
  view can diverge from the parser's view here. Pre-existing
  Q-2-7 / Q-2-10 entries handle the parser-interpretation
  overrides correctly. When Merr's match does not carry one of
  those apostrophe-heuristic codes — i.e. the match is a
  state-collision false positive from some other input that
  happened to reach the same (state, sym) — fall back to the
  synthetic Q-2-9 path.
- No outer scope → Merr lookup as usual.
Anchors the primary indicator at the closing delimiter of the
outermost open scope (via find_outermost_close) when one exists;
otherwise falls back to the parser-failure position.
Substitutes the {anchor} placeholder in the corpus message with
"the inline scope" or "the block" depending on which anchor branch
fired in step 3.

The asymmetric handling of single-quote vs. the other scopes is
principled, not a hack: single-quote is the one scope kind in qmd where
the parser applies a heuristic that diverges from a naive push/pop
model of the source characters. Every other inline delimiter has
unambiguous open/close semantics, so the walker's stack and the parser's
interpretation agree.

Q-2-9 vs. Q-2-10: the apostrophe-heuristic split

Single-quote is the one scope kind that needs two distinct
"unclosed-X" diagnostics rather than one, because qmd's apostrophe
heuristic gives ' two parser-side interpretations. The walker pushes
SingleQuote on every ' token it sees; the parser decides per-'
whether the character is an opener or a close-apostrophe based on
what precedes it.

Q-2-9 "Unclosed Single Quote" is for shapes where the ' is
a real opener — typically whitespace-prefixed — and the user forgot
the matching close. Example: 'foo bar (Case 6 above), or
*a 'b c* (an unclosed inner quote inside paired emphasis). The
walker's stack matches the parser's view here, so the dispatcher
hits the synthetic Q-2-9 path through OuterScope::to_qcode.
Q-2-10 "Closed Quote Without Matching Open Quote" is for shapes
where the ' follows a letter, so the parser treats it as text
(an apostrophe-as-close), while some surrounding scope is what
actually failed. Example: _a' b._ — the parser's view is
"stray apostrophe close inside emphasis"; the user's actual
problem is the unmatched _. (qmd-syntax-helper's auto-fix
pipeline depends on Q-2-10 firing for these shapes so it can
suggest escaping the apostrophe.)

The two interpretations frequently land on the same (LR state, lookahead) parse-error bucket, so the table-only approach can't
distinguish them. The dispatch rule that resolves this is in the
single-quote arm of the gate:

When the innermost open scope is SingleQuote, look at the Merr
entries for the current (state, sym). If any entry carries
Q-2-7 or Q-2-10, take it — those are the parser-interpretation
overrides authored deliberately into the corpus. If neither code
is present, the matched entry is a state-collision false positive
from some unrelated shape that happened to reach the same bucket;
fall through to the synthetic Q-2-9 path that the walker's view
suggests.

In effect, Q-2-7 and Q-2-10 act as explicit overrides for
parser-side reinterpretations of ', and Q-2-9 is the default
for everything else with a SingleQuote on the stack. The dispatch
gate codes this check inline (matches!(e.error_info.code, Some("Q-2-7") | Some("Q-2-10"))) rather than via another corpus annotation; with
only two override codes and no path to growth (apostrophe is the only
heuristic of its kind in qmd), the hardcoded set is more readable than
a generated table would be. If a future inline delimiter grows a
similar parser-side heuristic, the same pattern can be lifted into a
corpus annotation at that point.

Unrelated fixes folded in

Scanner `:::` peek gate (in commit `a01e3bc6`)

crates/tree-sitter-qmd/tree-sitter-markdown/src/scanner.c had a
regression where the pipe-table-line-ending detection's :::
lookahead peek ran in any context, not just when
PIPE_TABLE_LINE_ENDING was a valid symbol. The peek calls advance()
to step past the colons; in non-pipe-table contexts this corrupted
the lexer's position. Subsequent SOFT_LINE_ENDING emissions then
mark_end-ed at the wrong byte, breaking fenced divs whose body
contained an inline-bearing block followed by :::{attrs} on the next
line.

Fix: gate the peek on valid_symbols[PIPE_TABLE_LINE_ENDING]. 19-line
change with a long comment explaining the failure mode in detail.

GLR speculative-error filter (in commit `a01e3bc6`)

Already described above. When had_errors() is true but the final
tree has no ERROR nodes, treat the parse as successful — the errors
came from dead GLR speculation branches.

What this branch changes structurally

New files

crates/quarto-parse-errors/src/outer_scope.rs — scope-stack walker
and OuterScope enum.
crates/pampa/resources/error-corpus/_autogen-scope-owners.json —
generated OuterScope → Q-code mapping.
crates/pampa/tests/test_emphasis_in_quote.rs — 61-test regression
suite covering all the bug shapes, control cases, and deep nesting.

Modified APIs

produce_diagnostic_messages — new scope_owners: &[ScopeOwnerEntry] parameter. Pampa's wrapper supplies
get_scope_owner_table().
OuterScope::to_qcode — now takes a &[ScopeOwnerEntry] parameter.
The include_scope_owner_table! proc-macro is a new sibling to
include_error_table!, sharing the same input syntax.

Unchanged APIs

lookup_error_message, lookup_error_entry, ErrorTableEntry — all
back to their pre-branch signatures (no outer_scope parameter or
field).
produce_error_message_json — back to its pre-branch signature (no
input_bytes parameter).

Corpus contract

Each "Unclosed X" Q-*.json file now declares:

{
  "code": "Q-2-X",
  "title": "Unclosed X",
  "unclosedScope": "<scope-name>",
  "message": "Reached the end of {anchor} before finding ...",
  ...
}

Extending to other inline scopes

The walker today recognises only emphasis and quote scopes. The same
bug class exists for every other inline kind in qmd: span [..],
image ![..], inline footnote ^[..], code span `..`, inline
math $..$ , superscript ^..^, subscript ~..~, strikeout
~~..~~, editorial markers ({+..+}, {-..-}, {>..<}, {!..!},
{=..=}), and inline attribute specifiers {..}.

A quick probe shows these are still misclassified on both main and
this branch:

printf 'The "`code foo" word.\n' | cargo run --bin pampa --

Error: [Q-2-11] Unclosed Double Quote
   ╭─[ <stdin>:1:6 ]
   │
 1 │ The "`code foo" word.
   │     ┬┬
   │     ╰─── This is the opening quote mark.
   │      │
   │      ╰── Reached the end of the block before finding a closing '"' for the quote.
───╯

The diagnostic should say [Q-2-24] Unclosed Code Span and point at the
backtick. The walker doesn't track code-span tokens, so the dispatcher
sees outer_scope = double_quote and emits the (incorrect) Q-2-11
template.

This is now a mechanical fix, not a design problem. Adding a new
inline scope kind is exactly three steps:

Walker recognition. Add a new variant on OuterScope (e.g.
CodeSpan) and a matching arm in scope_for_token mapping the
tree-sitter token kind (e.g. code_span_delimiter) to that
variant. This is the only parser-bound piece — it requires knowing
what tokens the grammar emits, which is inherently in code.
Corpus annotation. Add "unclosedScope": "code_span" to the
corresponding Q-*.json corpus file (e.g. Q-2-24.json for
unclosed code span). Regenerate via
./scripts/build_error_table.ts. The build script automatically
pulls the annotation into _autogen-scope-owners.json, and
OuterScope::to_qcode will find the new mapping at compile time.
Dispatch gate (if needed). Check whether the new scope kind has
any parser-side heuristic that diverges from the walker's view (the
way single-quote does with its apostrophe-as-text-close rule). If
yes — add an arm to the dispatch gate in error_generation.rs that
routes that scope through Merr. If no — let it use the default
outer-scope dispatch like emphasis and double-quote already do.
Most inline scopes have unambiguous open/close semantics, so this
step is a no-op for them.

No new corpus entries per (LR state, lookahead) bucket. No new schema
fields. No combinatorial expansion across N outer × M inner pairs. The
infrastructure built in Phases 2–4 of this branch was specifically
designed so that each new inline scope adds constant work, not
quadratic.

Possible parser-side heuristics worth checking for the remaining inline
kinds, in case they need step 3:

Span vs. attribute specifier {..} — Pandoc's syntax overlaps;
pampa may treat { after a span specially. Worth checking the
scanner for context-sensitive cases.
Math $..$ — $ followed by a digit-or-letter is math; $ in
prose contexts is just a dollar sign. Like the apostrophe heuristic,
this is a parser-level distinction the walker would need to mirror
if it pushed naively on every $.
Superscript ^..^ vs footnote prefix ^[..] — distinguishable
by lookahead, but the walker would need to consult the next token
before deciding which variant of ^ to push.
Strikeout ~~..~~ vs subscript ~..~ — distinguishable by
token kind in tree-sitter, so the walker handles them with separate
scope variants.

Each of these is a self-contained follow-up issue. The framework
established here scales to all of them without architectural change.

Coverage

Test counts

test_emphasis_in_quote.rs: 61 tests covering all six "Unclosed X"
shapes across single/double quote, emphasis-star/underscore, and
strong-star/underscore outer wrappers, plus controls for
apostrophe-in-emphasis, deep nesting, multi-paragraph regression
shapes, and the new anchor-wording disambiguation.
Workspace: 9,072 tests pass.

CI parity

cargo xtask verify --skip-hub-build --skip-hub-tests --skip-trace-viewer-build --skip-trace-viewer-tests passes: lint,
fmt, build with -D warnings, tree-sitter tests, workspace nextest.

Known limitations

These cases are documented here rather than addressed in this PR
because each requires independent design work larger than the scope of
this branch.

Multiple unclosed inlines in a single block

printf 'a *b _c\n' | cargo run --bin pampa --

Before and after both produce only one diagnostic (for the innermost
_), even though both * and _ are unclosed:

Error: [Q-2-5] Unclosed Underscore Emphasis
   ╭─[ <stdin>:1:8 ]
   │
 1 │ a *b _c
   │      ┬ ┬
   │      ╰──── This is the opening '_' mark.
   │        │
   │        ╰── Reached the end of the block before finding a closing '_' for the emphasis.
───╯

The walker correctly sees both on its stack, but the dispatcher emits
one diagnostic per parser error state and currently picks the
innermost frame. Fixable in the dispatch layer (~20 LOC): when
find_outermost_close returns None, emit one diagnostic per stack
frame outermost-first.

Errors in subsequent paragraphs after the first failure

printf 'a *b\n\nc _d\n' | cargo run --bin pampa --

Only the first paragraph's error is reported; the second paragraph's
unclosed _ is silent. The verbose trace shows tree-sitter hits
detect_error once, then enters recover_to_previous and
skip_token-s through the rest of the document without re-entering a
productive state — so no second parse-error event ever fires.

This is a parser-level limitation, not a dispatch-layer one. Possible
approaches for a follow-up:

Walk ERROR nodes in the final tree to find blocks with unreported
failures, then re-run the scope-walker on each.
Re-invoke the parser per block on failure, syncing at block
boundaries.
Customise tree-sitter recovery so the parser resyncs at block-start
instead of skipping to EOF.

The first option is the most plausible without touching tree-sitter
internals and would reuse the existing walker.

…uote" When a paragraph contains an unmatched emphasis marker between matching double quotes — e.g. `The "_blank" word.` — the parser reported the double quote as unclosed (Q-2-11) instead of the emphasis. The "unclosed quote" message sent users hunting for a missing `"` that was already present. Add four Merr-style corpus cases (one per emphasis marker) that redirect the (LR state, lookahead) pair to the matching emphasis diagnostic: The "_blank" word. → Q-2-5 (Unclosed Underscore Emphasis) The "__blank" word. → Q-2-15 (Unclosed Strong Underscore Emphasis) The "*blank" word. → Q-2-12 (Unclosed Star Emphasis) The "**blank" word. → Q-2-13 (Unclosed Strong Star Emphasis) Each cite also gets a hint telling the user to escape the literal marker (`\_` / `\*`). The hint bumps the diagnostic score so the emphasis-aware mapping wins the (state, sym) tie against the older Q-2-11 entry. Out of scope: single-quote contexts (`'_blank'`) collapse to the same LR (state, sym) as `qmd-syntax-helper`'s apostrophe-quotes test shapes (`*a' b.*`, `**a' b.**`, `_a' b._`, `__a' b.__`). Redirecting those would silently break the auto-fix path, so they're left at Q-2-7/Q-2-10. Double-quote shapes only. Side effect of the (state, sym) sharing: rare shapes that previously fell under Q-2-11 like `*a" b.*` (an inch-mark inside emphasis) now emit Q-2-12. No regression test covers those, but the matrix is documented in the plan. Tests ----- New regression file `crates/pampa/tests/test_emphasis_in_quote.rs` covers all four emphasis variants plus a pipe-table fixture. Modeled on `tests/test_link_destination_linebreak.rs`. Verified `cargo nextest run --workspace` → 8993 passed, 0 failed. No snapshot files were affected (the snapshot test iterates over `resources/error-corpus/*.qmd`, which is empty in the new format). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Previously the parser misclassified errors when emphasis/quote markers nested inside other emphasis/quote scopes. For example, `The '_blank' word.` reported Q-2-10 (Closed Quote Without Matching Open) instead of Q-2-5 (Unclosed Underscore Emphasis), because tree-sitter's LR state minimisation collapsed otherwise-distinct (state, sym) parse-error keys between inputs like `_a' b._` and `The '_blank' word.`. See inline_issue.md for the full bug write-up. Fix: extend the Merr error-table key from (state, sym) to (state, sym, outer_scope), where outer_scope is the innermost open inline scope at the error position. A walker in quarto-parse-errors reconstructs the scope stack from the parser's token log (all_tokens + consumed_tokens), pushing on each scope-open token, popping on each top-of-stack matching close, and clearing on every block-boundary token. The same walker runs at corpus-build time so each test case records its natural outer_scope alongside (state, sym). Additional fixes folded in: - crates/pampa/src/readers/qmd.rs: when had_errors() is true but the resulting tree has no ERROR nodes, treat the parse as successful. GLR speculation hits detect_error in dead branches; if another branch reaches accept cleanly, those speculative errors shouldn't surface. Fixes `*a\" b.\"*` parsing as <em>a<dq>b.</dq></em>. - Reverts the Option 1 scanner-gate experiment from the WIP version of this commit. LR state minimisation merges post-OPEN and post-CLOSE destination states, so scanner-level gating was invisible at the parser level; the experiment also broke valid nested-emphasis parses (e.g. `*_*triple*_*`). - New corpus cases: - Q-2-5/12/13/15 in-single-quote: `The '_blank' word.` family - Q-2-15 in-single-quoted-emphasis: `The ' *__blank*' word.` family - 22 tests in test_emphasis_in_quote cover the original 5 shipped cases, the 4 single-quoted variants, the 4 unclosed-double-quote-in-emphasis variants, the 4 apostrophe-in-emphasis controls, the multi-paragraph regression case, three-level-nesting Q-2-15 cases, and the clean-parse case for nested double-quote-in-emphasis. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

inline_issue.md and options.md were committed at the repo root in a01e3bc but belong nowhere in tree per the claude-notes/ convention. The implementation they describe is now fully captured in the code and tests, so they are obsolete as forward plans. Three code-comment back-references to inline_issue.md are rewritten to stand on their own. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Refactor the inner-in-outer corpus cases for Q-2-5/12/13/15 to a single `inside-paired-outer` case each, using prefixesAndSuffixes to cover all six outer wrappers (four emphasis flavours plus single and double quote) with both bare and trailing-text variants. The diagnostic content doesn't depend on the outer scope, so the dedicated per-outer cases added earlier were redundant. Add Q-2-9 "Unclosed Single Quote" — previously no diagnostic existed for this shape because qmd's apostrophe heuristic treats a whitespace- prefixed `'` as an opener that can be unclosed (distinct from Q-2-10's stray-apostrophe-close interpretation). The new code uses the same unified prefixesAndSuffixes pattern. Add trailing-text variants (`* x\n`-style suffixes) to every unified case so realistic prose inputs like `*a _b c* word.` map to the right Q-code instead of falling through to generic Parse error. Add `trimLeadingSpace`/`trimTrailingSpace` to Q-2-5/13/15 notes so the opening-delimiter indicator renders 1-wide instead of including the parser's leading-whitespace prefix. Fix prior mislabeling: the test asserting Q-2-10 for `The "a 'b c" word.` was wrong — that input has a whitespace-prefixed `'` (Q-2-9), not a stray-close apostrophe (Q-2-10). Updated the test and removed the corresponding mislabeled case from Q-2-10.json. The trailing "I reached the end of the block..." indicator still anchors at the parser-failure position (one past the closing outer delimiter for these cases). A follow-up will reroute it to the closing outer delimiter itself. Test suite: 51 tests covering all bare and long-form combinations, including the original same-marker location regression and the new Q-2-9 cases. Full workspace nextest run (9055 tests) passes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Route the primary diagnostic location (header and "Reached the end of the outer inline scope..." indicator) to the closing delimiter of the outermost open inline scope when one exists, rather than letting it float at the parser-failure position just past the end of the block. Add `find_outermost_close` in `outer_scope.rs`: walks the parse log with the same push/pop and block-boundary semantics as `compute_outer_scope`, identifies the bottom of the scope stack, and returns the position of the latest token of that scope's kind appearing after the opener. The returned span is trimmed to the consecutive delimiter bytes so the indicator renders 1-wide on `*` (not 2-wide on whitespace + `*`). Wire this into `error_generation.rs`: when `outer_scope != None`, use `find_outermost_close` to override the parse_state-derived source_info; fall back to parse_state when no closer exists (e.g. `*hello`). Reword the unclosed-inline diagnostic messages (Q-2-5/9/11/12/13/15): drop the first-person "I reached" wording and replace "block" with "outer inline scope" so the message describes where the indicator actually lands. For `a *b _c* jeloasd asdasd`, the indicator now sits on the closing `*` at col 7 instead of floating past the end of the line. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…stics Replace the combinatorial `prefixesAndSuffixes` corpus entries with a single fallback dispatch path in error_generation.rs. The previous approach required one corpus variant per (state, sym, outer_scope) triple, which doesn't scale to arbitrary nesting depth — 3-level nestings like `'*b _c*'` already need new variants per outer-outer combination, and N-level nestings explode further. Key insight: for "unclosed inline scope" diagnostics, the diagnostic content depends only on `outer_scope` (the innermost open scope at error time). The state and sym don't add information — they're noise from tree-sitter's LR state minimisation. Add `find_innermost_open_position` in outer_scope.rs: returns the innermost open scope's opener span (trimmed to the consecutive delimiter bytes) so the "opening mark" indicator lands on the actual delimiter regardless of nesting depth. Add `OuterScope::to_qcode`: maps each open inline scope to its default Q-code (single_quote → Q-2-9, emph_star → Q-2-12, etc.). In error_generation.rs, after the Merr lookup fails, check `outer_scope != None && sym in {_close_block, _whitespace}`. If so, look up the Q-code via `to_qcode`, find any existing corpus entry with that Q-code to source the title/message/notes/hints content, and build a synthetic diagnostic whose positions come from the scope-stack walker (opener) and from find_outermost_close (closing-outer indicator, already added in the previous commit). This handles arbitrary-depth nesting uniformly. Specific Merr entries still take priority where they exist, preserving distinctions like Q-2-10 (apostrophe-close interpretation) vs Q-2-12 (unclosed `*`). Remove the now-redundant corpus cases (`inside-paired-outer`, `same-marker`) from Q-2-5/12/13/15 and Q-2-11. Simplify Q-2-9 to a single minimal case used only as the fallback content template. Net: ~150 LOC of new code, ~2,750 lines of corpus/autogen-table removed, 78 generated case files deleted. Add 4 regression tests for 3-level nesting shapes that previously emitted generic "Parse error" and now route to the correct Q-code via the fallback path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Exercise the generic fallback dispatch with shapes where the scope stack at error time isn't trivially "everything still open": - `a '*b _c_ jeloasd' asdasd` — 3-level nesting where the middle `*` is unmatched but the inner `_c_` pairs. Stack at error is depth 2 (single_quote + emph_star); fallback emits Q-2-12. - `a "**b _c_ jeloasd" asdasd` — same shape with strong-star middle; Q-2-13. - `a '*"_b c"*' x` — 4-level deep nest, innermost `_` unmatched. Stack at error is depth 4; fallback walks down to the innermost emph_underscore and emits Q-2-5. - `a '*_b "c d_*' x` — 4-level deep, innermost is a double quote; Q-2-11. These verify the walker correctly tracks push/pop across arbitrary depths and that the to_qcode mapping fires off the innermost open scope regardless of how many scopes are stacked above the actual unmatched marker. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…atch The bug fixes on this branch (`The "_blank" word.` and friends mapping to the correct Q-code) had grown to extend the Merr error-table key from `(state, sym)` to `(state, sym, outer_scope)` across `ErrorTableEntry`, both lookup helpers, the `include_error_table!` proc-macro, the build script, every Q-*.json corpus file, every row of `_autogen-table.json`, and the QMD-side wrappers. Of 115 `(state, sym)` collision buckets, only 8 actually needed `outer_scope` to disambiguate, and the same approach did not scale to the other inline scopes (span, code span, math, etc.) that share the same bug class. Revert the schema change and replace it with a smaller mechanism that keeps the corpus as the single source of truth for which Q-code owns which scope kind. Schema revert - Drop `outer_scope` from `ErrorTableEntry`, `lookup_error_message`, `lookup_error_entry`, the `include_error_table!` macro, the `qmd_error_message_table.rs` wrappers, the build script's autogen-row emission, and `produce_error_message_json` (back to taking only the observer; the `input_bytes` argument is gone). - Regenerate `_autogen-table.json` (~700 entries, no per-row `outerScope` field). Corpus-driven scope-owner table - Each "Unclosed X" Q-*.json gains one top-level annotation: `"unclosedScope": "<scope-name>"`. - The build script collects those into `resources/error-corpus/_autogen-scope-owners.json` and a new sibling proc-macro `include_scope_owner_table!` embeds it at compile time. - `OuterScope::to_qcode` now reads from this table instead of a hardcoded `match`, so renaming a Q-code in the corpus propagates automatically — same trust model as the Merr table. Outer-scope dispatch ahead of Merr lookup - For innermost scopes whose walker view matches the parser's interpretation (emphasis flavors + double-quote), the dispatcher in `error_diagnostic_from_parse_state` builds the diagnostic directly from the scope's owning Q-code template and skips the Merr lookup. - SingleQuote falls through to Merr by default (the apostrophe-as-close heuristic means a `'` after a letter is Q-2-10 / Q-2-7 territory, which the existing Merr entries already handle). When Merr's match doesn't carry one of those apostrophe-heuristic codes — i.e. the match is a state-collision false positive — fall through to the synthetic Q-2-9 path instead. Walker unification - Collapse `compute_outer_scope`, `find_outermost_close`, and `find_innermost_open_position` in `outer_scope.rs` onto a single internal `scope_walk` helper that returns the sorted token list plus the final stack. The three public functions become thin readers of the snapshot. Corpus cleanup - Remove the now-redundant `in-double-quote` / `in-single-quote` / `in-single-quoted-emphasis` corpus cases from Q-2-5/12/13/15 (the schema extension was their only consumer; the fallback dispatch subsumes them). Snapshot test changes (per repo policy) - 9 generated case-files (`Q-2-{5,12,13,15}-in-*.qmd`) deleted. - 1 new generated case-file (`Q-2-9-simple.qmd`). - `_autogen-table.json` regenerated. Row count unchanged at ~691; per-row `outerScope` field removed (~1100 lines smaller). - `_autogen-scope-owners.json` new (6 entries). Verification - `cargo nextest run -p pampa --test test_emphasis_in_quote`: 59/59 pass (same as before this commit). - End-to-end CLI runs verify Q-codes and indicator positions match the pre-refactor branch for all canonical bug shapes (`The "_blank" word.` → Q-2-5, `_a' b._` → Q-2-10, `*a 'b c*` → Q-2-9, `'foo bar` → Q-2-9, etc.). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…nclosed-X messages The unclosed-inline diagnostics (Q-2-5/9/11/12/13/15) previously read "Reached the end of the outer inline scope before finding a closing ..." in every context. For a block-level input like `a * b` the parser actually gave up at the end of the block, not inside any enclosing scope, and the wording read awkwardly. Switch the corpus messages to a `{anchor}` placeholder and substitute at emit time based on where the diagnostic's primary indicator landed: - "the block" when the anchor falls back to the parser-failure position (no enclosing scope, or no candidate outer closer exists) - "the inline scope" when the anchor lands on the closing delimiter of an enclosing inline scope The selection is driven by `find_outermost_close` returning `Some` / `None` — exactly the same logic that already decides the indicator's column position, so the wording and the indicator stay in lockstep. Verification examples a * b → Q-2-12, "Reached the end of the block before ..." 'a * b' → Q-2-12, "Reached the end of the inline scope before ..." 'foo bar → Q-2-9, "Reached the end of the block before ..." (truly unclosed) The "_blank" word. → Q-2-5, "Reached the end of the inline scope before ..." Snapshot test changes (per repo policy) - `_autogen-table.json` regenerated: the `message` field of every Unclosed-X entry now carries the `{anchor}` placeholder rather than the literal "outer inline scope" phrase. All 691 row count + key positions unchanged. Tests - Two new regression tests pin the wording for both branches on otherwise-identical shapes: - `block_level_unclosed_star_says_block` ("a * b") - `single_quoted_unclosed_star_says_inline_scope` ("'a * b'") - All 61 tests in `test_emphasis_in_quote.rs` pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…s-word

gordonwoodhull · 2026-05-19T23:25:15Z

😮

cscheid · 2026-05-19T23:37:16Z

👀

Approach: scope-stack walker plus context-aware dispatch

The core idea is to give the dispatcher a third piece of context beyond
the LR state and lookahead: which inline scope was open at error time.

This is a little terrifying. I... I don't know if I want to merge this before understanding this very thoroughly!

cscheid · 2026-05-19T23:48:24Z

The "_blank" word

Yeah, this is all a bit cursed (yay Markdown). Consider the following:

The "_blank "word" oh no_".

Should this parse correctly? I can make arguments in both directions. Commonmark doesn't appear to say anything about which way it would go, but it does have opinions about nested links, namely that they're not allowed (https://spec.commonmark.org/0.31.2/#links). So one could argue that the spirit of Commonmark would require us to disallow nested quotations. This in principle can be solved by created a (much) more complex version of the grammar that restricts all inlines inside of quotations to be non-quotation inlines.

But there's an argument to be made that nested quotations should be allowed, actually. For example,

"'this' is defensible"

And if that's an acceptable Markdown document, why shouldn't this be?

"'"This" is' pushing it."

And from there, the only reason we don't support truly nested quotes is typographical. But _ introduces emphasis, and so why not allow our original?

The "_blank "word" oh no_".

I note in passing that this example parses successfully in pampa:

$ cat c2.qmd
The "_blank "word" oh no_".%
$ pampa -i c2.qmd -t native
Warning [Q-7-1]: Missing Newline at End of File
File `c2.qmd` does not end with a newline

[ Para [Str "The", Space, Quoted DoubleQuote [Emph [Str "blank", Space, Quoted DoubleQuote [Str "word"], Space, Str "oh", Space, Str "no"]], Str "."] ]

cscheid · 2026-05-19T23:56:41Z

More generally:

How hard could that possibly be to fix?

:lolsob: I get it, I think you're going through the same process I went when I learned of this parsing technique: "oh my god it solves everything" three weeks later "oh no, this is kind of stiflingly limited actually".

In general, I think your proposal is a version of something I thought of: we could benefit from using a stack of context "unreduced shifts" in a direct generalization of the Jeffery technique. At the point where we read the token after blank in "_blank, we know we're inside a double quotation, and inside an underscore emphasis. In full generality, we could have a system that matches on regular expressions of suffixes of those inline scopes.

But, to be quite frank, I don't want to spend time trying to study this. I was an academic for a decade. I just want to build tools for people to use :) I genuinely think the risks outweigh the benefits here. We're already playing with fire by doing this trick inside tree-sitter and needing to get around its error correction. I don't want to compound that risk.

rundel · 2026-05-20T00:28:32Z

Totally fair - I in no way can claim to understand a fraction of the code generated, this was the result of many iterations where the proposed solution either did not work or got way more invasive in terms of the necessary changes in the corpuses or scanner.c or even worse.

This was the closest I got to something that felt at least like a semi-plausible solution.

rundel · 2026-05-20T00:33:32Z

In general, I think your proposal is a version of something I thought of: we could benefit from using a stack of context "unreduced shifts" in a direct generalization of the Jeffery technique. At the point where we read the token after blank in "_blank, we know we're inside a double quotation, and inside an underscore emphasis. In full generality, we could have a system that matches on regular expressions of suffixes of those inline scopes.

This was the direction I was originally interested in taking - claude was able to modify scanner.c such that it successfully built a context stack that kept track of open and closed inlines via pushing and popping but I wasn't able to get that passed into tree-sitter / the jeffery technique in such a way that it could be surfaced as a useful error.

rundel and others added 10 commits May 18, 2026 20:37

Merge remote-tracking branch 'origin/main' into bugfix/quoted-emphasi…

f7224e5

…s-word

rundel changed the title ~~Fix unclosed-inline error misclassification + supporting infrastructure (quotes + emphasis~~ Fix unclosed-inline error misclassification + supporting infrastructure (quotes + emphasis) May 19, 2026

rundel closed this May 20, 2026

rundel mentioned this pull request May 20, 2026

fix(pampa): treat parse as clean when tree has no ERROR nodes #218

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix unclosed-inline error misclassification + supporting infrastructure (quotes + emphasis)#217

Fix unclosed-inline error misclassification + supporting infrastructure (quotes + emphasis)#217
rundel wants to merge 10 commits into
quarto-dev:mainfrom
rundel:bugfix/quoted-emphasis-word

rundel commented May 19, 2026

Uh oh!

gordonwoodhull commented May 19, 2026

Uh oh!

cscheid commented May 19, 2026

Uh oh!

cscheid commented May 19, 2026 •

edited

Loading

Uh oh!

cscheid commented May 19, 2026

Uh oh!

rundel commented May 20, 2026

Uh oh!

rundel commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rundel commented May 19, 2026

Claude's write-up (w/ light editing)

Summary

Before / after on concrete inputs

Case 1 — Unclosed emphasis between paired double quotes

Case 2 — Same shape, single quotes

Case 3 — Unclosed emphasis inside single quotes (with whitespace flanks)

Case 4 — Same emphasis at block level (no enclosing scope)

Case 5 — Unclosed inner emphasis inside paired outer emphasis

Case 6 — Truly unclosed single quote at block level

The bug

Approach: scope-stack walker plus context-aware dispatch

outer_scope.rs walker

Corpus-driven OuterScope → Q-code table

Dispatch in error_generation.rs

Q-2-9 vs. Q-2-10: the apostrophe-heuristic split

Unrelated fixes folded in

Scanner ::: peek gate (in commit a01e3bc6)

GLR speculative-error filter (in commit a01e3bc6)

What this branch changes structurally

New files

Modified APIs

Unchanged APIs

Corpus contract

Extending to other inline scopes

Coverage

Test counts

CI parity

Known limitations

Multiple unclosed inlines in a single block

Errors in subsequent paragraphs after the first failure

Uh oh!

gordonwoodhull commented May 19, 2026

Uh oh!

cscheid commented May 19, 2026

Uh oh!

cscheid commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cscheid commented May 19, 2026

Uh oh!

rundel commented May 20, 2026

Uh oh!

rundel commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`outer_scope.rs` walker

Corpus-driven `OuterScope → Q-code` table

Dispatch in `error_generation.rs`

Scanner `:::` peek gate (in commit `a01e3bc6`)

GLR speculative-error filter (in commit `a01e3bc6`)

cscheid commented May 19, 2026 •

edited

Loading