Skip to content

Fix unclosed-inline error misclassification + supporting infrastructure (quotes + emphasis)#217

Closed
rundel wants to merge 10 commits into
quarto-dev:mainfrom
rundel:bugfix/quoted-emphasis-word
Closed

Fix unclosed-inline error misclassification + supporting infrastructure (quotes + emphasis)#217
rundel wants to merge 10 commits into
quarto-dev:mainfrom
rundel:bugfix/quoted-emphasis-word

Conversation

@rundel
Copy link
Copy Markdown
Contributor

@rundel rundel commented May 19, 2026

Apologies in advanced for all of the non-sense that is about to follow - everything started with good intentions and a minor issue around error locations for unclosed emphasis markers:

$ printf -- 'The "_blank" word.\n' | cargo run --bin pampa -- 2>&1

Error: [Q-2-11] Unclosed Double Quote
   ╭─[ <stdin>:1:13 ]
   │
 1 │ The "_blank" word.
   │     ┬┬
   │     ╰─── This is the opening quote mark.
   │      │
   │      ╰── I reached the end of the block before finding a closing '"' for the quote.
───╯

How hard could that possibly be to fix?

It got very ugly very quickly - I'm not sure what follows is a good solution or not but given it took a lot of iteration to even get close to something that mostly worked I think it's worthwhile documenting it all and passing it off. At a minimum it will hopefully provide a partial starting point for a better / cleaner solution.

The commits document a lot (but not all) of the iteration (e.g. adding an additional key to the corpus lookup table) that were eventually scraped. So it may also be useful to have claude summarize those for you individually.


Claude's write-up (w/ light editing)

Summary

This branch fixes a class of misleading parse errors where an unclosed
emphasis or quote marker nested inside another inline scope was reported as
if the outer scope were unclosed. The fix evolves through nine commits
that progressively generalize the mechanism, refactor it for
maintainability, and improve the user-facing diagnostic. Along the way it
picks up two unrelated parser fixes that surfaced during the work.

Final shape of the change:

  • A scope-stack walker reconstructs the parser's nesting context at error
    time so the diagnostic can target the correct inline element regardless
    of how deeply it was nested.
  • The "Unclosed X" diagnostics now anchor their indicator at the closing
    delimiter of the enclosing scope (when one exists) instead of floating at
    the parser-failure position past end of line.
  • The message wording adjusts to context: "end of the block" at block
    level, "end of the inline scope" when nested.
  • Dispatch is driven by a corpus-annotated unclosedScope field rather
    than any hardcoded Rust mapping or per-(state, sym) corpus authoring.
    Adding a new inline scope kind is a one-line corpus edit plus a walker
    enum variant.

Diff size: 20 files, +2,883 / −466. Most of the insertion volume is the
new outer_scope.rs module (740 LOC, ~half of it tests), the expanded
test_emphasis_in_quote.rs regression suite (61 tests), and the
regenerated _autogen-table.json. The actual implementation logic is ~400
LOC.

Before / after on concrete inputs

Case 1 — Unclosed emphasis between paired double quotes

printf 'The "_blank" word.\n' | cargo run --bin pampa --

Before (on main):

Error: [Q-2-11] Unclosed Double Quote
   ╭─[ <stdin>:1:13 ]
   │
 1 │ The "_blank" word.
   │            ┬┬
   │            ╰─── This is the opening quote mark.
   │             │
   │             ╰── I reached the end of the block before finding a closing '"' for the quote.
───╯

After:

Error: [Q-2-5] Unclosed Underscore Emphasis
   ╭─[ <stdin>:1:12 ]
   │
 1 │ The "_blank" word.
   │      ┬     ┬
   │      ╰──────── This is the opening '_' mark.
   │            │
   │            ╰── Reached the end of the inline scope before finding a closing '_' for the emphasis.
───╯
ℹ If you want a literal underscore, escape it with `\_`.

The diagnostic now points at the unclosed _ rather than sending the user
hunting for a missing " that's already in the input. The indicator is
anchored at the closing " because the parser actually gave up at that
point (the _ couldn't close before the " arrived). A hint suggests the
escape if the user meant the marker literally.

Case 2 — Same shape, single quotes

printf "The '_blank' word.\n" | cargo run --bin pampa --

Before:

Error: [Q-2-10] Closed Quote Without Matching Open Quote
   ╭─[ <stdin>:1:13 ]
   │
 1 │ The '_blank' word.
   │            ┬┬
   │            ╰─── This is the opening quote. If you need an apostrophe, escape it with a backslash.
   │             │
   │             ╰── A space is causing a quote mark to be interpreted as a quotation close.
───╯

After:

Error: [Q-2-5] Unclosed Underscore Emphasis
   ╭─[ <stdin>:1:12 ]
   │
 1 │ The '_blank' word.
   │      ┬     ┬
   │      ╰──────── This is the opening '_' mark.
   │            │
   │            ╰── Reached the end of the inline scope before finding a closing '_' for the emphasis.
───╯
ℹ If you want a literal underscore, escape it with `\_`.

The single-quote case was even more confusing on main — the diagnostic
talked about apostrophe interpretation while the actual problem was the
same unclosed _.

Case 3 — Unclosed emphasis inside single quotes (with whitespace flanks)

printf "'a * b'\n" | cargo run --bin pampa --

Before:

Error: [Q-2-7] Unclosed Single Quote
   ╭─[ <stdin>:1:8 ]
   │
 1 │ 'a * b'
   │       ┬┬
   │       ╰─── This is the opening quote. If you need an apostrophe, escape it with a backslash.
   │        │
   │        ╰── I reached the end of the block before finding a closing "'" for the quote.
───╯

After:

Error: [Q-2-12] Unclosed Star Emphasis
   ╭─[ <stdin>:1:7 ]
   │
 1 │ 'a * b'
   │    ┬  ┬
   │    ╰───── This is the opening '*' mark.
   │       │
   │       ╰── Reached the end of the inline scope before finding a closing '*' for the emphasis.
───╯
ℹ If you want a literal asterisk, escape it with `\*`.

main insisted the ' was unclosed even though both ' markers are
present and the actual problem is the orphan *. The new diagnostic
points at the * and uses "the inline scope" wording because the
indicator lands on the closing '.

Case 4 — Same emphasis at block level (no enclosing scope)

printf 'a * b\n' | cargo run --bin pampa --

Before:

Error: [Q-2-12] Unclosed Star Emphasis
   ╭─[ <stdin>:1:6 ]
   │
 1 │ a * b
   │ ┬    ┬
   │ ╰─────── This is the opening '*' mark.
   │      │
   │      ╰── I reached the end of the block before finding a closing '*' for the emphasis.
───╯

After:

Error: [Q-2-12] Unclosed Star Emphasis
   ╭─[ <stdin>:1:6 ]
   │
 1 │ a * b
   │   ┬  ┬
   │   ╰───── This is the opening '*' mark.
   │      │
   │      ╰── Reached the end of the block before finding a closing '*' for the emphasis.
───╯
ℹ If you want a literal asterisk, escape it with `\*`.

Two small improvements visible here even though the Q-code is unchanged:

  1. The opening-mark indicator now points at the * (col 2) rather than
    the start of the line (col 0). On main, the opener anchor was off
    because the parser's leading-whitespace token was reported untrimmed;
    the new walker trims to the actual delimiter byte.
  2. The wording stays "Reached the end of the block before…" — context-
    appropriate for a block-level failure with no enclosing inline scope.

Compare this against Case 3 above: the same Q-2-12 template, but the
wording switches between "the block" and "the inline scope" depending on
whether an outer closer anchors the indicator.

Case 5 — Unclosed inner emphasis inside paired outer emphasis

printf '*a _b c* trailing\n' | cargo run --bin pampa --

Before:

Error: [Q-2-12] Unclosed Star Emphasis
   ╭─[ <stdin>:1:18 ]
   │
 1 │ *a _b c* trailing
   │                  ┬
   │                  ╰── I reached the end of the block before finding a closing '*' for the emphasis.
───╯

After:

Error: [Q-2-5] Unclosed Underscore Emphasis
   ╭─[ <stdin>:1:8 ]
   │
 1 │ *a _b c* trailing
   │    ┬   ┬
   │    ╰────── This is the opening '_' mark.
   │        │
   │        ╰── Reached the end of the inline scope before finding a closing '_' for the emphasis.
───╯
ℹ If you want a literal underscore, escape it with `\_`.

main insisted the outer * was unclosed (and let the indicator float at
end-of-line, far from any source feature). The actual issue is the inner
_, with the outer *..* correctly closing at col 7. The new diagnostic
identifies the right marker and anchors at the closing *.

Case 6 — Truly unclosed single quote at block level

printf "'foo bar\n" | cargo run --bin pampa --

Before:

Error: [Q-2-7] Unclosed Single Quote
   ╭─[ <stdin>:1:9 ]
   │
 1 │ 'foo bar
   │ ┬       ┬
   │ ╰────────── This is the opening quote. If you need an apostrophe, escape it with a backslash.
   │         │
   │         ╰── I reached the end of the block before finding a closing "'" for the quote.
───╯

After:

Error: [Q-2-9] Unclosed Single Quote
   ╭─[ <stdin>:1:9 ]
   │
 1 │ 'foo bar
   │ ┬       ┬
   │ ╰────────── This is the opening single quote.
   │         │
   │         ╰── Reached the end of the block before finding a closing single quote.
───╯
ℹ If you meant a literal apostrophe, escape it with `\'`.

The behavior is essentially unchanged here — both branches correctly
identify the unclosed ' at block level. The Q-code has shifted from
Q-2-7 to Q-2-9 (a new code introduced earlier in the branch to
distinguish whitespace-prefixed openers from the apostrophe-as-text-close
interpretation that qmd-syntax-helper relies on), but the diagnostic
content is equivalent.

The bug

The misclassification is a consequence of how tree-sitter's LR state
generator minimises distinct states with identical outgoing transitions.
Multiple input shapes that semantically fail for different reasons
collapse onto the same (LR state, lookahead symbol) parse-error pair,
and the Merr-style error table can only see that pair. For example, both
of these reach (state=704, sym=_whitespace) at error time:

Input Real failure Wanted
_a' b._ apostrophe-as-text-close collides with _..._ Q-2-10 "Closed Quote Without Matching Open"
The '_blank' word. unclosed _ inside paired '..' Q-2-5 "Unclosed Underscore Emphasis"

The same shape repeats across all six "Unclosed X" inline scopes (single
quote, double quote, single/double emphasis with * or _).

Approach: scope-stack walker plus context-aware dispatch

The core idea is to give the dispatcher a third piece of context beyond
the LR state and lookahead: which inline scope was open at error time.
This is recovered by walking the parser's token log.

outer_scope.rs walker

crates/quarto-parse-errors/src/outer_scope.rs (new, 740 LOC) provides
one internal scope_walk helper plus three public readers:

  • compute_outer_scope — returns the innermost open inline scope at the
    error position (the top of the stack).
  • find_outermost_close — returns the position of the would-be closer
    for the outermost open scope (the bottom of the stack), or None if
    no candidate closer exists in the remaining tokens.
  • find_innermost_open_position — returns the position of the innermost
    open scope's opening delimiter.

The walker combines the parser's all_tokens and consumed_tokens logs,
sorts them by source position, and iterates applying the same push /
pop-if-top / block-boundary-clear rules the external scanner uses. The
result is the scope-stack snapshot at the moment the parser failed.

Recognised scope kinds: single_quote, double_quote, emphasis
delimiters (* and _, distinguished by reading the actual byte from
the input because emphasis tokens can include leading whitespace), and
strong emphasis delimiters.

Corpus-driven OuterScope → Q-code table

Each "Unclosed X" corpus file carries a top-level annotation:

{
  "code": "Q-2-5",
  "title": "Unclosed Underscore Emphasis",
  "unclosedScope": "emph_underscore",
  ...
}

build_error_table.ts collects these into a generated artifact
_autogen-scope-owners.json (6 entries), and a sibling proc-macro
include_scope_owner_table! embeds it at compile time.
OuterScope::to_qcode reads from this generated table rather than a
hardcoded match, so renaming a Q-code in the corpus propagates
automatically — the same trust model the Merr table itself uses.

Dispatch in error_generation.rs

For each parse-error state, error_diagnostic_from_parse_state:

  1. Computes the outer scope and the would-be closer of the outermost
    enclosing scope.
  2. Decides whether to route the diagnostic through outer-scope dispatch
    directly or through the Merr (state, sym) lookup:
    • Emphasis-flavored or double_quote innermost scope
      outer-scope dispatch: build the diagnostic from the scope's owning
      Q-code template and skip the Merr table entirely. This handles
      The "_blank" word.-style cases as well as arbitrary nesting
      depth (3+ levels) without needing per-shape corpus entries.
    • single_quote innermost scope → fall through to Merr by
      default, because the apostrophe-as-close heuristic (a '
      preceded by a letter is text, not an opener) means the walker's
      view can diverge from the parser's view here. Pre-existing
      Q-2-7 / Q-2-10 entries handle the parser-interpretation
      overrides correctly. When Merr's match does not carry one of
      those apostrophe-heuristic codes — i.e. the match is a
      state-collision false positive from some other input that
      happened to reach the same (state, sym) — fall back to the
      synthetic Q-2-9 path.
    • No outer scope → Merr lookup as usual.
  3. Anchors the primary indicator at the closing delimiter of the
    outermost open scope (via find_outermost_close) when one exists;
    otherwise falls back to the parser-failure position.
  4. Substitutes the {anchor} placeholder in the corpus message with
    "the inline scope" or "the block" depending on which anchor branch
    fired in step 3.

The asymmetric handling of single-quote vs. the other scopes is
principled, not a hack: single-quote is the one scope kind in qmd where
the parser applies a heuristic that diverges from a naive push/pop
model of the source characters. Every other inline delimiter has
unambiguous open/close semantics, so the walker's stack and the parser's
interpretation agree.

Q-2-9 vs. Q-2-10: the apostrophe-heuristic split

Single-quote is the one scope kind that needs two distinct
"unclosed-X" diagnostics rather than one, because qmd's apostrophe
heuristic gives ' two parser-side interpretations. The walker pushes
SingleQuote on every ' token it sees; the parser decides per-'
whether the character is an opener or a close-apostrophe based on
what precedes it.

  • Q-2-9 "Unclosed Single Quote" is for shapes where the ' is
    a real opener — typically whitespace-prefixed — and the user forgot
    the matching close. Example: 'foo bar (Case 6 above), or
    *a 'b c* (an unclosed inner quote inside paired emphasis). The
    walker's stack matches the parser's view here, so the dispatcher
    hits the synthetic Q-2-9 path through OuterScope::to_qcode.

  • Q-2-10 "Closed Quote Without Matching Open Quote" is for shapes
    where the ' follows a letter, so the parser treats it as text
    (an apostrophe-as-close), while some surrounding scope is what
    actually failed. Example: _a' b._ — the parser's view is
    "stray apostrophe close inside emphasis"; the user's actual
    problem is the unmatched _. (qmd-syntax-helper's auto-fix
    pipeline depends on Q-2-10 firing for these shapes so it can
    suggest escaping the apostrophe.)

The two interpretations frequently land on the same (LR state, lookahead) parse-error bucket, so the table-only approach can't
distinguish them. The dispatch rule that resolves this is in the
single-quote arm of the gate:

When the innermost open scope is SingleQuote, look at the Merr
entries for the current (state, sym). If any entry carries
Q-2-7 or Q-2-10, take it — those are the parser-interpretation
overrides authored deliberately into the corpus. If neither code
is present, the matched entry is a state-collision false positive
from some unrelated shape that happened to reach the same bucket;
fall through to the synthetic Q-2-9 path that the walker's view
suggests.

In effect, Q-2-7 and Q-2-10 act as explicit overrides for
parser-side reinterpretations of ', and Q-2-9 is the default
for everything else with a SingleQuote on the stack. The dispatch
gate codes this check inline (matches!(e.error_info.code, Some("Q-2-7") | Some("Q-2-10"))) rather than via another corpus annotation; with
only two override codes and no path to growth (apostrophe is the only
heuristic of its kind in qmd), the hardcoded set is more readable than
a generated table would be. If a future inline delimiter grows a
similar parser-side heuristic, the same pattern can be lifted into a
corpus annotation at that point.

Unrelated fixes folded in

Scanner ::: peek gate (in commit a01e3bc6)

crates/tree-sitter-qmd/tree-sitter-markdown/src/scanner.c had a
regression where the pipe-table-line-ending detection's :::
lookahead peek ran in any context, not just when
PIPE_TABLE_LINE_ENDING was a valid symbol. The peek calls advance()
to step past the colons; in non-pipe-table contexts this corrupted
the lexer's position. Subsequent SOFT_LINE_ENDING emissions then
mark_end-ed at the wrong byte, breaking fenced divs whose body
contained an inline-bearing block followed by :::{attrs} on the next
line.

Fix: gate the peek on valid_symbols[PIPE_TABLE_LINE_ENDING]. 19-line
change with a long comment explaining the failure mode in detail.

GLR speculative-error filter (in commit a01e3bc6)

Already described above. When had_errors() is true but the final
tree has no ERROR nodes, treat the parse as successful — the errors
came from dead GLR speculation branches.

What this branch changes structurally

New files

  • crates/quarto-parse-errors/src/outer_scope.rs — scope-stack walker
    and OuterScope enum.
  • crates/pampa/resources/error-corpus/_autogen-scope-owners.json
    generated OuterScope → Q-code mapping.
  • crates/pampa/tests/test_emphasis_in_quote.rs — 61-test regression
    suite covering all the bug shapes, control cases, and deep nesting.

Modified APIs

  • produce_diagnostic_messages — new scope_owners: &[ScopeOwnerEntry] parameter. Pampa's wrapper supplies
    get_scope_owner_table().
  • OuterScope::to_qcode — now takes a &[ScopeOwnerEntry] parameter.
  • The include_scope_owner_table! proc-macro is a new sibling to
    include_error_table!, sharing the same input syntax.

Unchanged APIs

  • lookup_error_message, lookup_error_entry, ErrorTableEntry — all
    back to their pre-branch signatures (no outer_scope parameter or
    field).
  • produce_error_message_json — back to its pre-branch signature (no
    input_bytes parameter).

Corpus contract

Each "Unclosed X" Q-*.json file now declares:

{
  "code": "Q-2-X",
  "title": "Unclosed X",
  "unclosedScope": "<scope-name>",
  "message": "Reached the end of {anchor} before finding ...",
  ...
}

Extending to other inline scopes

The walker today recognises only emphasis and quote scopes. The same
bug class exists for every other inline kind in qmd: span [..],
image ![..], inline footnote ^[..], code span `..`, inline
math $..$, superscript ^..^, subscript ~..~, strikeout
~~..~~, editorial markers ({+..+}, {-..-}, {>..<}, {!..!},
{=..=}), and inline attribute specifiers {..}.

A quick probe shows these are still misclassified on both main and
this branch:

printf 'The "`code foo" word.\n' | cargo run --bin pampa --
Error: [Q-2-11] Unclosed Double Quote
   ╭─[ <stdin>:1:6 ]
   │
 1 │ The "`code foo" word.
   │     ┬┬
   │     ╰─── This is the opening quote mark.
   │      │
   │      ╰── Reached the end of the block before finding a closing '"' for the quote.
───╯

The diagnostic should say [Q-2-24] Unclosed Code Span and point at the
backtick. The walker doesn't track code-span tokens, so the dispatcher
sees outer_scope = double_quote and emits the (incorrect) Q-2-11
template.

This is now a mechanical fix, not a design problem. Adding a new
inline scope kind is exactly three steps:

  1. Walker recognition. Add a new variant on OuterScope (e.g.
    CodeSpan) and a matching arm in scope_for_token mapping the
    tree-sitter token kind (e.g. code_span_delimiter) to that
    variant. This is the only parser-bound piece — it requires knowing
    what tokens the grammar emits, which is inherently in code.
  2. Corpus annotation. Add "unclosedScope": "code_span" to the
    corresponding Q-*.json corpus file (e.g. Q-2-24.json for
    unclosed code span). Regenerate via
    ./scripts/build_error_table.ts. The build script automatically
    pulls the annotation into _autogen-scope-owners.json, and
    OuterScope::to_qcode will find the new mapping at compile time.
  3. Dispatch gate (if needed). Check whether the new scope kind has
    any parser-side heuristic that diverges from the walker's view (the
    way single-quote does with its apostrophe-as-text-close rule). If
    yes — add an arm to the dispatch gate in error_generation.rs that
    routes that scope through Merr. If no — let it use the default
    outer-scope dispatch like emphasis and double-quote already do.
    Most inline scopes have unambiguous open/close semantics, so this
    step is a no-op for them.

No new corpus entries per (LR state, lookahead) bucket. No new schema
fields. No combinatorial expansion across N outer × M inner pairs. The
infrastructure built in Phases 2–4 of this branch was specifically
designed so that each new inline scope adds constant work, not
quadratic.

Possible parser-side heuristics worth checking for the remaining inline
kinds, in case they need step 3:

  • Span vs. attribute specifier {..} — Pandoc's syntax overlaps;
    pampa may treat { after a span specially. Worth checking the
    scanner for context-sensitive cases.
  • Math $..$$ followed by a digit-or-letter is math; $ in
    prose contexts is just a dollar sign. Like the apostrophe heuristic,
    this is a parser-level distinction the walker would need to mirror
    if it pushed naively on every $.
  • Superscript ^..^ vs footnote prefix ^[..] — distinguishable
    by lookahead, but the walker would need to consult the next token
    before deciding which variant of ^ to push.
  • Strikeout ~~..~~ vs subscript ~..~ — distinguishable by
    token kind in tree-sitter, so the walker handles them with separate
    scope variants.

Each of these is a self-contained follow-up issue. The framework
established here scales to all of them without architectural change.

Coverage

Test counts

  • test_emphasis_in_quote.rs: 61 tests covering all six "Unclosed X"
    shapes across single/double quote, emphasis-star/underscore, and
    strong-star/underscore outer wrappers, plus controls for
    apostrophe-in-emphasis, deep nesting, multi-paragraph regression
    shapes, and the new anchor-wording disambiguation.
  • Workspace: 9,072 tests pass.

CI parity

cargo xtask verify --skip-hub-build --skip-hub-tests --skip-trace-viewer-build --skip-trace-viewer-tests passes: lint,
fmt, build with -D warnings, tree-sitter tests, workspace nextest.

Known limitations

These cases are documented here rather than addressed in this PR
because each requires independent design work larger than the scope of
this branch.

Multiple unclosed inlines in a single block

printf 'a *b _c\n' | cargo run --bin pampa --

Before and after both produce only one diagnostic (for the innermost
_), even though both * and _ are unclosed:

Error: [Q-2-5] Unclosed Underscore Emphasis
   ╭─[ <stdin>:1:8 ]
   │
 1 │ a *b _c
   │      ┬ ┬
   │      ╰──── This is the opening '_' mark.
   │        │
   │        ╰── Reached the end of the block before finding a closing '_' for the emphasis.
───╯

The walker correctly sees both on its stack, but the dispatcher emits
one diagnostic per parser error state and currently picks the
innermost frame. Fixable in the dispatch layer (~20 LOC): when
find_outermost_close returns None, emit one diagnostic per stack
frame outermost-first.

Errors in subsequent paragraphs after the first failure

printf 'a *b\n\nc _d\n' | cargo run --bin pampa --

Only the first paragraph's error is reported; the second paragraph's
unclosed _ is silent. The verbose trace shows tree-sitter hits
detect_error once, then enters recover_to_previous and
skip_token-s through the rest of the document without re-entering a
productive state — so no second parse-error event ever fires.

This is a parser-level limitation, not a dispatch-layer one. Possible
approaches for a follow-up:

  • Walk ERROR nodes in the final tree to find blocks with unreported
    failures, then re-run the scope-walker on each.
  • Re-invoke the parser per block on failure, syncing at block
    boundaries.
  • Customise tree-sitter recovery so the parser resyncs at block-start
    instead of skipping to EOF.

The first option is the most plausible without touching tree-sitter
internals and would reuse the existing walker.

rundel and others added 10 commits May 18, 2026 20:37
…uote"

When a paragraph contains an unmatched emphasis marker between matching
double quotes — e.g. `The "_blank" word.` — the parser reported the
double quote as unclosed (Q-2-11) instead of the emphasis. The
"unclosed quote" message sent users hunting for a missing `"` that
was already present.

Add four Merr-style corpus cases (one per emphasis marker) that
redirect the (LR state, lookahead) pair to the matching emphasis
diagnostic:

  The "_blank" word.   → Q-2-5  (Unclosed Underscore Emphasis)
  The "__blank" word.  → Q-2-15 (Unclosed Strong Underscore Emphasis)
  The "*blank" word.   → Q-2-12 (Unclosed Star Emphasis)
  The "**blank" word.  → Q-2-13 (Unclosed Strong Star Emphasis)

Each cite also gets a hint telling the user to escape the literal
marker (`\_` / `\*`). The hint bumps the diagnostic score so the
emphasis-aware mapping wins the (state, sym) tie against the older
Q-2-11 entry.

Out of scope: single-quote contexts (`'_blank'`) collapse to the same
LR (state, sym) as `qmd-syntax-helper`'s apostrophe-quotes test
shapes (`*a' b.*`, `**a' b.**`, `_a' b._`, `__a' b.__`). Redirecting
those would silently break the auto-fix path, so they're left at
Q-2-7/Q-2-10. Double-quote shapes only.

Side effect of the (state, sym) sharing: rare shapes that previously
fell under Q-2-11 like `*a" b.*` (an inch-mark inside emphasis) now
emit Q-2-12. No regression test covers those, but the matrix is
documented in the plan.

Tests
-----
New regression file `crates/pampa/tests/test_emphasis_in_quote.rs`
covers all four emphasis variants plus a pipe-table fixture. Modeled
on `tests/test_link_destination_linebreak.rs`.

Verified `cargo nextest run --workspace` → 8993 passed, 0 failed.
No snapshot files were affected (the snapshot test iterates over
`resources/error-corpus/*.qmd`, which is empty in the new format).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previously the parser misclassified errors when emphasis/quote markers
nested inside other emphasis/quote scopes. For example, `The '_blank' word.`
reported Q-2-10 (Closed Quote Without Matching Open) instead of Q-2-5
(Unclosed Underscore Emphasis), because tree-sitter's LR state
minimisation collapsed otherwise-distinct (state, sym) parse-error keys
between inputs like `_a' b._` and `The '_blank' word.`. See
inline_issue.md for the full bug write-up.

Fix: extend the Merr error-table key from (state, sym) to
(state, sym, outer_scope), where outer_scope is the innermost open
inline scope at the error position. A walker in quarto-parse-errors
reconstructs the scope stack from the parser's token log
(all_tokens + consumed_tokens), pushing on each scope-open token, popping
on each top-of-stack matching close, and clearing on every block-boundary
token. The same walker runs at corpus-build time so each test case
records its natural outer_scope alongside (state, sym).

Additional fixes folded in:

- crates/pampa/src/readers/qmd.rs: when had_errors() is true but the
  resulting tree has no ERROR nodes, treat the parse as successful. GLR
  speculation hits detect_error in dead branches; if another branch
  reaches accept cleanly, those speculative errors shouldn't surface.
  Fixes `*a\" b.\"*` parsing as <em>a<dq>b.</dq></em>.

- Reverts the Option 1 scanner-gate experiment from the WIP version of
  this commit. LR state minimisation merges post-OPEN and post-CLOSE
  destination states, so scanner-level gating was invisible at the
  parser level; the experiment also broke valid nested-emphasis parses
  (e.g. `*_*triple*_*`).

- New corpus cases:
  - Q-2-5/12/13/15 in-single-quote: `The '_blank' word.` family
  - Q-2-15 in-single-quoted-emphasis: `The ' *__blank*' word.` family

- 22 tests in test_emphasis_in_quote cover the original 5 shipped cases,
  the 4 single-quoted variants, the 4 unclosed-double-quote-in-emphasis
  variants, the 4 apostrophe-in-emphasis controls, the
  multi-paragraph regression case, three-level-nesting Q-2-15 cases,
  and the clean-parse case for nested double-quote-in-emphasis.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
inline_issue.md and options.md were committed at the repo root in
a01e3bc but belong nowhere in tree per the claude-notes/ convention.
The implementation they describe is now fully captured in the code
and tests, so they are obsolete as forward plans.

Three code-comment back-references to inline_issue.md are rewritten
to stand on their own.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Refactor the inner-in-outer corpus cases for Q-2-5/12/13/15 to a single
`inside-paired-outer` case each, using prefixesAndSuffixes to cover all
six outer wrappers (four emphasis flavours plus single and double quote)
with both bare and trailing-text variants. The diagnostic content
doesn't depend on the outer scope, so the dedicated per-outer cases
added earlier were redundant.

Add Q-2-9 "Unclosed Single Quote" — previously no diagnostic existed for
this shape because qmd's apostrophe heuristic treats a whitespace-
prefixed `'` as an opener that can be unclosed (distinct from Q-2-10's
stray-apostrophe-close interpretation). The new code uses the same
unified prefixesAndSuffixes pattern.

Add trailing-text variants (`* x\n`-style suffixes) to every unified
case so realistic prose inputs like `*a _b c* word.` map to the right
Q-code instead of falling through to generic Parse error.

Add `trimLeadingSpace`/`trimTrailingSpace` to Q-2-5/13/15 notes so the
opening-delimiter indicator renders 1-wide instead of including the
parser's leading-whitespace prefix.

Fix prior mislabeling: the test asserting Q-2-10 for `The "a 'b c"
word.` was wrong — that input has a whitespace-prefixed `'` (Q-2-9),
not a stray-close apostrophe (Q-2-10). Updated the test and removed the
corresponding mislabeled case from Q-2-10.json.

The trailing "I reached the end of the block..." indicator still
anchors at the parser-failure position (one past the closing outer
delimiter for these cases). A follow-up will reroute it to the closing
outer delimiter itself.

Test suite: 51 tests covering all bare and long-form combinations,
including the original same-marker location regression and the new
Q-2-9 cases. Full workspace nextest run (9055 tests) passes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Route the primary diagnostic location (header and "Reached the end of
the outer inline scope..." indicator) to the closing delimiter of the
outermost open inline scope when one exists, rather than letting it
float at the parser-failure position just past the end of the block.

Add `find_outermost_close` in `outer_scope.rs`: walks the parse log with
the same push/pop and block-boundary semantics as `compute_outer_scope`,
identifies the bottom of the scope stack, and returns the position of
the latest token of that scope's kind appearing after the opener. The
returned span is trimmed to the consecutive delimiter bytes so the
indicator renders 1-wide on `*` (not 2-wide on whitespace + `*`).

Wire this into `error_generation.rs`: when `outer_scope != None`, use
`find_outermost_close` to override the parse_state-derived source_info;
fall back to parse_state when no closer exists (e.g. `*hello`).

Reword the unclosed-inline diagnostic messages (Q-2-5/9/11/12/13/15):
drop the first-person "I reached" wording and replace "block" with
"outer inline scope" so the message describes where the indicator
actually lands.

For `a *b _c* jeloasd asdasd`, the indicator now sits on the closing
`*` at col 7 instead of floating past the end of the line.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…stics

Replace the combinatorial `prefixesAndSuffixes` corpus entries with a
single fallback dispatch path in error_generation.rs. The previous
approach required one corpus variant per (state, sym, outer_scope)
triple, which doesn't scale to arbitrary nesting depth — 3-level
nestings like `'*b _c*'` already need new variants per outer-outer
combination, and N-level nestings explode further.

Key insight: for "unclosed inline scope" diagnostics, the diagnostic
content depends only on `outer_scope` (the innermost open scope at
error time). The state and sym don't add information — they're noise
from tree-sitter's LR state minimisation.

Add `find_innermost_open_position` in outer_scope.rs: returns the
innermost open scope's opener span (trimmed to the consecutive
delimiter bytes) so the "opening mark" indicator lands on the actual
delimiter regardless of nesting depth.

Add `OuterScope::to_qcode`: maps each open inline scope to its default
Q-code (single_quote → Q-2-9, emph_star → Q-2-12, etc.).

In error_generation.rs, after the Merr lookup fails, check
`outer_scope != None && sym in {_close_block, _whitespace}`. If so,
look up the Q-code via `to_qcode`, find any existing corpus entry with
that Q-code to source the title/message/notes/hints content, and
build a synthetic diagnostic whose positions come from the scope-stack
walker (opener) and from find_outermost_close (closing-outer
indicator, already added in the previous commit).

This handles arbitrary-depth nesting uniformly. Specific Merr entries
still take priority where they exist, preserving distinctions like
Q-2-10 (apostrophe-close interpretation) vs Q-2-12 (unclosed `*`).

Remove the now-redundant corpus cases (`inside-paired-outer`,
`same-marker`) from Q-2-5/12/13/15 and Q-2-11. Simplify Q-2-9 to a
single minimal case used only as the fallback content template. Net:
~150 LOC of new code, ~2,750 lines of corpus/autogen-table removed,
78 generated case files deleted.

Add 4 regression tests for 3-level nesting shapes that previously
emitted generic "Parse error" and now route to the correct Q-code via
the fallback path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Exercise the generic fallback dispatch with shapes where the scope
stack at error time isn't trivially "everything still open":

- `a '*b _c_ jeloasd' asdasd` — 3-level nesting where the middle `*`
  is unmatched but the inner `_c_` pairs. Stack at error is depth 2
  (single_quote + emph_star); fallback emits Q-2-12.
- `a "**b _c_ jeloasd" asdasd` — same shape with strong-star middle;
  Q-2-13.
- `a '*"_b c"*' x` — 4-level deep nest, innermost `_` unmatched.
  Stack at error is depth 4; fallback walks down to the innermost
  emph_underscore and emits Q-2-5.
- `a '*_b "c d_*' x` — 4-level deep, innermost is a double quote;
  Q-2-11.

These verify the walker correctly tracks push/pop across arbitrary
depths and that the to_qcode mapping fires off the innermost open
scope regardless of how many scopes are stacked above the actual
unmatched marker.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…atch

The bug fixes on this branch (`The "_blank" word.` and friends mapping to
the correct Q-code) had grown to extend the Merr error-table key from
`(state, sym)` to `(state, sym, outer_scope)` across `ErrorTableEntry`,
both lookup helpers, the `include_error_table!` proc-macro, the build
script, every Q-*.json corpus file, every row of `_autogen-table.json`,
and the QMD-side wrappers. Of 115 `(state, sym)` collision buckets, only
8 actually needed `outer_scope` to disambiguate, and the same approach
did not scale to the other inline scopes (span, code span, math, etc.)
that share the same bug class.

Revert the schema change and replace it with a smaller mechanism that
keeps the corpus as the single source of truth for which Q-code owns
which scope kind.

Schema revert
- Drop `outer_scope` from `ErrorTableEntry`, `lookup_error_message`,
  `lookup_error_entry`, the `include_error_table!` macro, the
  `qmd_error_message_table.rs` wrappers, the build script's autogen-row
  emission, and `produce_error_message_json` (back to taking only the
  observer; the `input_bytes` argument is gone).
- Regenerate `_autogen-table.json` (~700 entries, no per-row `outerScope`
  field).

Corpus-driven scope-owner table
- Each "Unclosed X" Q-*.json gains one top-level annotation:
  `"unclosedScope": "<scope-name>"`.
- The build script collects those into
  `resources/error-corpus/_autogen-scope-owners.json` and a new sibling
  proc-macro `include_scope_owner_table!` embeds it at compile time.
- `OuterScope::to_qcode` now reads from this table instead of a
  hardcoded `match`, so renaming a Q-code in the corpus propagates
  automatically — same trust model as the Merr table.

Outer-scope dispatch ahead of Merr lookup
- For innermost scopes whose walker view matches the parser's
  interpretation (emphasis flavors + double-quote), the dispatcher in
  `error_diagnostic_from_parse_state` builds the diagnostic directly
  from the scope's owning Q-code template and skips the Merr lookup.
- SingleQuote falls through to Merr by default (the apostrophe-as-close
  heuristic means a `'` after a letter is Q-2-10 / Q-2-7 territory,
  which the existing Merr entries already handle). When Merr's match
  doesn't carry one of those apostrophe-heuristic codes — i.e. the
  match is a state-collision false positive — fall through to the
  synthetic Q-2-9 path instead.

Walker unification
- Collapse `compute_outer_scope`, `find_outermost_close`, and
  `find_innermost_open_position` in `outer_scope.rs` onto a single
  internal `scope_walk` helper that returns the sorted token list plus
  the final stack. The three public functions become thin readers of
  the snapshot.

Corpus cleanup
- Remove the now-redundant `in-double-quote` / `in-single-quote` /
  `in-single-quoted-emphasis` corpus cases from Q-2-5/12/13/15 (the
  schema extension was their only consumer; the fallback dispatch
  subsumes them).

Snapshot test changes (per repo policy)
- 9 generated case-files (`Q-2-{5,12,13,15}-in-*.qmd`) deleted.
- 1 new generated case-file (`Q-2-9-simple.qmd`).
- `_autogen-table.json` regenerated. Row count unchanged at ~691;
  per-row `outerScope` field removed (~1100 lines smaller).
- `_autogen-scope-owners.json` new (6 entries).

Verification
- `cargo nextest run -p pampa --test test_emphasis_in_quote`: 59/59
  pass (same as before this commit).
- End-to-end CLI runs verify Q-codes and indicator positions match the
  pre-refactor branch for all canonical bug shapes (`The "_blank"
  word.` → Q-2-5, `_a' b._` → Q-2-10, `*a 'b c*` → Q-2-9,
  `'foo bar` → Q-2-9, etc.).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…nclosed-X messages

The unclosed-inline diagnostics (Q-2-5/9/11/12/13/15) previously read
"Reached the end of the outer inline scope before finding a closing ..."
in every context. For a block-level input like `a * b` the parser
actually gave up at the end of the block, not inside any enclosing
scope, and the wording read awkwardly.

Switch the corpus messages to a `{anchor}` placeholder and substitute at
emit time based on where the diagnostic's primary indicator landed:

  - "the block"        when the anchor falls back to the parser-failure
                       position (no enclosing scope, or no candidate
                       outer closer exists)
  - "the inline scope" when the anchor lands on the closing delimiter
                       of an enclosing inline scope

The selection is driven by `find_outermost_close` returning `Some` /
`None` — exactly the same logic that already decides the indicator's
column position, so the wording and the indicator stay in lockstep.

Verification examples
  a * b      → Q-2-12, "Reached the end of the block before ..."
  'a * b'    → Q-2-12, "Reached the end of the inline scope before ..."
  'foo bar   → Q-2-9,  "Reached the end of the block before ..." (truly unclosed)
  The "_blank" word.
             → Q-2-5,  "Reached the end of the inline scope before ..."

Snapshot test changes (per repo policy)
- `_autogen-table.json` regenerated: the `message` field of every
  Unclosed-X entry now carries the `{anchor}` placeholder rather than
  the literal "outer inline scope" phrase. All 691 row count + key
  positions unchanged.

Tests
- Two new regression tests pin the wording for both branches on
  otherwise-identical shapes:
    - `block_level_unclosed_star_says_block` ("a * b")
    - `single_quoted_unclosed_star_says_inline_scope` ("'a * b'")
- All 61 tests in `test_emphasis_in_quote.rs` pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@rundel rundel changed the title Fix unclosed-inline error misclassification + supporting infrastructure (quotes + emphasis Fix unclosed-inline error misclassification + supporting infrastructure (quotes + emphasis) May 19, 2026
@gordonwoodhull
Copy link
Copy Markdown
Member

😮

@cscheid
Copy link
Copy Markdown
Member

cscheid commented May 19, 2026

👀

Approach: scope-stack walker plus context-aware dispatch

The core idea is to give the dispatcher a third piece of context beyond
the LR state and lookahead: which inline scope was open at error time.

This is a little terrifying. I... I don't know if I want to merge this before understanding this very thoroughly!

@cscheid
Copy link
Copy Markdown
Member

cscheid commented May 19, 2026

The "_blank" word

Yeah, this is all a bit cursed (yay Markdown). Consider the following:

The "_blank "word" oh no_".

Should this parse correctly? I can make arguments in both directions. Commonmark doesn't appear to say anything about which way it would go, but it does have opinions about nested links, namely that they're not allowed (https://spec.commonmark.org/0.31.2/#links). So one could argue that the spirit of Commonmark would require us to disallow nested quotations. This in principle can be solved by created a (much) more complex version of the grammar that restricts all inlines inside of quotations to be non-quotation inlines.

But there's an argument to be made that nested quotations should be allowed, actually. For example,

"'this' is defensible"

And if that's an acceptable Markdown document, why shouldn't this be?

"'"This" is' pushing it."

And from there, the only reason we don't support truly nested quotes is typographical. But _ introduces emphasis, and so why not allow our original?

The "_blank "word" oh no_".

I note in passing that this example parses successfully in pampa:

$ cat c2.qmd
The "_blank "word" oh no_".%
$ pampa -i c2.qmd -t native
Warning [Q-7-1]: Missing Newline at End of File
File `c2.qmd` does not end with a newline

[ Para [Str "The", Space, Quoted DoubleQuote [Emph [Str "blank", Space, Quoted DoubleQuote [Str "word"], Space, Str "oh", Space, Str "no"]], Str "."] ]

@cscheid
Copy link
Copy Markdown
Member

cscheid commented May 19, 2026

More generally:

How hard could that possibly be to fix?

:lolsob: I get it, I think you're going through the same process I went when I learned of this parsing technique: "oh my god it solves everything" three weeks later "oh no, this is kind of stiflingly limited actually".

In general, I think your proposal is a version of something I thought of: we could benefit from using a stack of context "unreduced shifts" in a direct generalization of the Jeffery technique. At the point where we read the token after blank in "_blank, we know we're inside a double quotation, and inside an underscore emphasis. In full generality, we could have a system that matches on regular expressions of suffixes of those inline scopes.

But, to be quite frank, I don't want to spend time trying to study this. I was an academic for a decade. I just want to build tools for people to use :) I genuinely think the risks outweigh the benefits here. We're already playing with fire by doing this trick inside tree-sitter and needing to get around its error correction. I don't want to compound that risk.

@rundel
Copy link
Copy Markdown
Contributor Author

rundel commented May 20, 2026

Totally fair - I in no way can claim to understand a fraction of the code generated, this was the result of many iterations where the proposed solution either did not work or got way more invasive in terms of the necessary changes in the corpuses or scanner.c or even worse.

This was the closest I got to something that felt at least like a semi-plausible solution.

@rundel
Copy link
Copy Markdown
Contributor Author

rundel commented May 20, 2026

In general, I think your proposal is a version of something I thought of: we could benefit from using a stack of context "unreduced shifts" in a direct generalization of the Jeffery technique. At the point where we read the token after blank in "_blank, we know we're inside a double quotation, and inside an underscore emphasis. In full generality, we could have a system that matches on regular expressions of suffixes of those inline scopes.

This was the direction I was originally interested in taking - claude was able to modify scanner.c such that it successfully built a context stack that kept track of open and closed inlines via pushing and popping but I wasn't able to get that passed into tree-sitter / the jeffery technique in such a way that it could be surfaced as a useful error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants