Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

an incremental change can corrupt the parse tree #1444

Closed
the-mikedavis opened this issue Oct 17, 2021 · 8 comments
Closed

an incremental change can corrupt the parse tree #1444

the-mikedavis opened this issue Oct 17, 2021 · 8 comments
Labels

Comments

@the-mikedavis
Copy link
Contributor

the-mikedavis commented Oct 17, 2021

👋 hello!

I'm working with the elixir grammar and I noticed that the parse tree can get into a wonky state with an incremental change.

Starting with some valid Elixir like so:

quote do
  A.B.c()
end

we have a good tree:

source [0, 0] - [3, 0]
  call [0, 0] - [2, 3]
    target: identifier [0, 0] - [0, 5]
    do_block [0, 6] - [2, 3]
      call [1, 2] - [1, 9]
        target: dot [1, 2] - [1, 7]
          left: alias [1, 2] - [1, 5]
          right: identifier [1, 6] - [1, 7]
        arguments [1, 7] - [1, 9]

then we edit our Elixir a bit to something invalid (replace A.B with A,B):

quote do
  A,B.c()
end

and the tree recognizes that this expression is invalid:

source [0, 0] - [3, 0]
  call [0, 0] - [2, 3]
    target: identifier [0, 0] - [0, 5]
    do_block [0, 6] - [2, 3]
      stab_clause [1, 2] - [2, 0]
        left: arguments [1, 2] - [1, 9]
          alias [1, 2] - [1, 3]
          call [1, 4] - [1, 9]
            target: dot [1, 4] - [1, 7]
              left: alias [1, 4] - [1, 5]
              right: identifier [1, 6] - [1, 7]
            arguments [1, 7] - [1, 9]
        operator: MISSING -> [2, 0] - [2, 0]

Now we edit our snippet back to the original valid code, but the tree becomes corrupted:

ERROR [0, 0] - [3, 0]
  identifier [0, 0] - [0, 5]
  call [1, 2] - [1, 9]
    target: dot [1, 2] - [1, 7]
      left: alias [1, 2] - [1, 5]
      right: identifier [1, 6] - [1, 7]
    arguments [1, 7] - [1, 9]

In this particular case we can kick the tree back into valid state by deleting the . between B and c and then typing it back in.

I can reproduce the behavior using the playground building the repo from master:

playground

and in the helix editor (see helix-editor/helix#830 (comment)) which I believe uses the rust bindings.

Most minimally:

$ cat f.exs
quote do
  A,B.c()
end

$ tree-sitter parse f.exs --edit '1,3 1'
(ERROR [0, 0] - [3, 0]
  (identifier [0, 0] - [0, 5])
  (call [1, 2] - [1, 8]
    target: (dot [1, 2] - [1, 6]
      left: (alias [1, 2] - [1, 4])
      right: (identifier [1, 5] - [1, 6]))
    (arguments [1, 6] - [1, 8])))
f.exs	0 ms	(ERROR [0, 0] - [3, 0])

I haven't been able to reproduce the behavior with other grammars yet, but it was a pretty random chance that I found this case here. I'll go and try to mangle some other grammars' trees too 🕵️

@the-mikedavis
Copy link
Contributor Author

I'm pretty sure from the debug logging that this is the block that's causing the wrap in ERROR

tree-sitter/lib/src/parser.c

Lines 1291 to 1300 in 67de943

// If the parser is still in the error state at the end of the file, just wrap everything
// in an ERROR node and terminate.
if (ts_subtree_is_eof(lookahead)) {
LOG("recover_eof");
SubtreeArray children = array_new();
Subtree parent = ts_subtree_new_error_node(&children, false, self->language);
ts_stack_push(self->stack, version, parent, false, 1);
ts_parser__accept(self, version, lookahead);
return;
}

(haven't been able to dedicate much time but I'm still looking into this :)

@milahu
Copy link

milahu commented Nov 16, 2021

reproduce the behavior with other grammars

tree-sitter-rust: EvgeniyPeshkov/syntax-highlighter#46 (comment)

then we edit our Elixir a bit to something invalid, and the tree recognizes that this expression is invalid
Now we edit our snippet back to the original valid code, but the tree becomes corrupted

so tree-sitter is missing one of its goals

Robust enough to provide useful results even in the presence of syntax errors

would be interesting to compare other incremental parsers: lezer, papa-carlo, parsley, sparse, scanner, ...

@alemuller
Copy link
Contributor

alemuller commented Nov 16, 2021

From lezer's docs/guid:

This system's approach is heavily influenced by tree-sitter, a similar system written in C and Rust, and several papers by Tim Wagner and Susan Graham on incremental parsing (1, 2). It exists as a different system because it has different priorities than tree-sitter—as part of a JavaScript system, it is written in JavaScript, with relatively small library and parser table size. It also generates more compact in-memory trees, to avoid putting too much pressure on the user's machine.

From papa-carlo's #error-recovering:

Parsers made with Papa Carlo are able to recover syntax errors. There are two techniques:

Erasing mismatched tokens.
Skipping missed required tokens.

First one is designed to avoid parts of the source code that recognized by parser as completely inappropriate in that place. And the second is to parse code parts that contains syntax errors, but still can be recognized as meaningful. For example a block of code if (a == b) {doSmthng(); has missed closing brace "}", but still can be recognized as operator block.

(...)

To acquire advantages of the second way developer should manually specify which syntax rules can potentially be skipped in case of fail.

From parsley's README

If anyone stumbles here trying to find an incremental parser, here is the 1998 paper (Efficient and Flexible Incremental Parsing)

This is one of several researches cited on tree-sitter home.

I don't think scanner is comparable with tree-sitter.


I don't know what algorithm tree-sitter uses for error recovery. I've been wanting to look carefully at the code, but haven't found time yet. Form a quick glance, the code does look similar to snippets from Efficient and Flexible Incremental Parsing.

By the way, error recovery does not seems to be fast, according to benchmarks from Don’t Panic! Better, Fewer, Syntax Errors for LR Parsers, the mean time for recovery are in tens of milliseconds (on a Xeon with 32GB of ram, they ran sequentially, thought).

Since it is not fast, and does fails sometimes, I've wondered if a Graph Structured Network (that seems to be quite good at matching trees) couldn't be used to provide the suggestion faster and with less error.

But implementing those would likely be a premature optimization. I think tree-sitter should try to get stable before implementing any new shiny features.

Addendum.

The time tree-sitter takes to highlight never bothered me. Latency in 10ms isn't perceptible (my head is at another timescale right now, thus the comment about not being fast...).

I do find annoying empty string triggering the error recovery and messing up the highlighting. But this is an issue with the grammar, not tree-sitter itself. Maybe we could draft a standard on how to write grammars?

@maxbrunsfeld
Copy link
Contributor

Thanks very much for the report @the-mikedavis; this is definitely a bug in Tree-sitter.

I'd be very curious if it turns out to be reproducible with any other grammars, and if not, what it is about the Elixir grammar that triggers it. I know that it may be a lot of work to answer both of these questions.

@maxbrunsfeld
Copy link
Contributor

Ok, I investigated this a bit. In the Elixir grammar, there is a token called NEWLINE, which is used both as an "extra" token, and as a structurally-important token in the _terminator rule.

Tree-sitter should allow you to do this, but it is currently not handling it properly. I think the generated C code is good, but the runtime library is not correctly managing the reusability of that token. Specifically, when the NEWLINE is used as an 'extra' in the old syntax tree, Tree-sitter is not correctly re-categorizing it as non-extra after an edit.

Thanks so much for creating such a small, isolated reproduction case @the-mikedavis.

@maxbrunsfeld
Copy link
Contributor

Ok, I think this is fixed on master.

@maxbrunsfeld
Copy link
Contributor

I'm going to cut new releases of the library and CLI, because this was a pretty severe bug.

@maxbrunsfeld
Copy link
Contributor

I published 0.20.1.

hendrikvanantwerpen pushed a commit to hendrikvanantwerpen/tree-sitter that referenced this issue Nov 23, 2021
the-mikedavis added a commit to the-mikedavis/tree-sitter-git-commit that referenced this issue Dec 25, 2021
see helix-editor/helix#1338 (comment)

Leading whitespace is important when injecting diff highlights into
messages trailing the scissors.

Without this change, some adverserial context lines can end up being
mistakenly parsed as non-$.context rules. For example, in the
screenshot of the linked issue comment, a context line is being parsed
by a tricky link that looks like a malformed $.similarity rule. Because
NEWLINE is not a child of $._line but $.source in tree-sitter-git-diff,
part of the line is re-parsed into another valid $._line rule, namely
$.addition in this case.

For an example COMMIT_EDITMSG which has a second diff line 6 characters
to the right of the newline, this commit changes the start column of
the parsed $.message node to include the whitespace:

 (source [0, 0] - [7, 0]
   (subject [0, 0] - [0, 53])
   (comment [2, 0] - [4, 38]
     (scissors [2, 0] - [4, 38]))
   (message [5, 0] - [5, 36])
-  (message [6, 5] - [6, 80]))
+  (message [6, 0] - [6, 80]))

This change probably restricts this grammar to tree-sitter 0.20.1+
because the WHITE_SPACE 'extra' is now used as an extra and within
a rule
(see tree-sitter/tree-sitter#1444 (comment))
but trailing diffs are not meant to be edited anyways, so it's probably
not a big deal.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants