an incremental change can corrupt the parse tree #1444

the-mikedavis · 2021-10-17T15:47:05Z

👋 hello!

I'm working with the elixir grammar and I noticed that the parse tree can get into a wonky state with an incremental change.

Starting with some valid Elixir like so:

quote do
  A.B.c()
end

we have a good tree:

source [0, 0] - [3, 0]
  call [0, 0] - [2, 3]
    target: identifier [0, 0] - [0, 5]
    do_block [0, 6] - [2, 3]
      call [1, 2] - [1, 9]
        target: dot [1, 2] - [1, 7]
          left: alias [1, 2] - [1, 5]
          right: identifier [1, 6] - [1, 7]
        arguments [1, 7] - [1, 9]

then we edit our Elixir a bit to something invalid (replace A.B with A,B):

quote do
  A,B.c()
end

and the tree recognizes that this expression is invalid:

source [0, 0] - [3, 0]
  call [0, 0] - [2, 3]
    target: identifier [0, 0] - [0, 5]
    do_block [0, 6] - [2, 3]
      stab_clause [1, 2] - [2, 0]
        left: arguments [1, 2] - [1, 9]
          alias [1, 2] - [1, 3]
          call [1, 4] - [1, 9]
            target: dot [1, 4] - [1, 7]
              left: alias [1, 4] - [1, 5]
              right: identifier [1, 6] - [1, 7]
            arguments [1, 7] - [1, 9]
        operator: MISSING -> [2, 0] - [2, 0]

Now we edit our snippet back to the original valid code, but the tree becomes corrupted:

ERROR [0, 0] - [3, 0]
  identifier [0, 0] - [0, 5]
  call [1, 2] - [1, 9]
    target: dot [1, 2] - [1, 7]
      left: alias [1, 2] - [1, 5]
      right: identifier [1, 6] - [1, 7]
    arguments [1, 7] - [1, 9]

In this particular case we can kick the tree back into valid state by deleting the . between B and c and then typing it back in.

I can reproduce the behavior using the playground building the repo from master:

and in the helix editor (see helix-editor/helix#830 (comment)) which I believe uses the rust bindings.

Most minimally:

$ cat f.exs
quote do
  A,B.c()
end

$ tree-sitter parse f.exs --edit '1,3 1'
(ERROR [0, 0] - [3, 0]
  (identifier [0, 0] - [0, 5])
  (call [1, 2] - [1, 8]
    target: (dot [1, 2] - [1, 6]
      left: (alias [1, 2] - [1, 4])
      right: (identifier [1, 5] - [1, 6]))
    (arguments [1, 6] - [1, 8])))
f.exs	0 ms	(ERROR [0, 0] - [3, 0])

I haven't been able to reproduce the behavior with other grammars yet, but it was a pretty random chance that I found this case here. I'll go and try to mangle some other grammars' trees too 🕵️

The text was updated successfully, but these errors were encountered:

the-mikedavis · 2021-11-03T03:54:28Z

I'm pretty sure from the debug logging that this is the block that's causing the wrap in ERROR

tree-sitter/lib/src/parser.c

Lines 1291 to 1300 in 67de943

    
           // If the parser is still in the error state at the end of the file, just wrap everything 
        
           // in an ERROR node and terminate. 
        
           if (ts_subtree_is_eof(lookahead)) { 
        
             LOG("recover_eof"); 
        
             SubtreeArray children = array_new(); 
        
             Subtree parent = ts_subtree_new_error_node(&children, false, self->language); 
        
             ts_stack_push(self->stack, version, parent, false, 1); 
        
             ts_parser__accept(self, version, lookahead); 
        
             return; 
        
           }

(haven't been able to dedicate much time but I'm still looking into this :)

milahu · 2021-11-16T10:39:56Z

reproduce the behavior with other grammars

tree-sitter-rust: EvgeniyPeshkov/syntax-highlighter#46 (comment)

then we edit our Elixir a bit to something invalid, and the tree recognizes that this expression is invalid
Now we edit our snippet back to the original valid code, but the tree becomes corrupted

so tree-sitter is missing one of its goals

Robust enough to provide useful results even in the presence of syntax errors

would be interesting to compare other incremental parsers: lezer, papa-carlo, parsley, sparse, scanner, ...

alemuller · 2021-11-16T14:10:47Z

From lezer's docs/guid:

This system's approach is heavily influenced by tree-sitter, a similar system written in C and Rust, and several papers by Tim Wagner and Susan Graham on incremental parsing (1, 2). It exists as a different system because it has different priorities than tree-sitter—as part of a JavaScript system, it is written in JavaScript, with relatively small library and parser table size. It also generates more compact in-memory trees, to avoid putting too much pressure on the user's machine.

From papa-carlo's #error-recovering:

Parsers made with Papa Carlo are able to recover syntax errors. There are two techniques:

Erasing mismatched tokens.
Skipping missed required tokens.

First one is designed to avoid parts of the source code that recognized by parser as completely inappropriate in that place. And the second is to parse code parts that contains syntax errors, but still can be recognized as meaningful. For example a block of code if (a == b) {doSmthng(); has missed closing brace "}", but still can be recognized as operator block.

(...)

To acquire advantages of the second way developer should manually specify which syntax rules can potentially be skipped in case of fail.

From parsley's README

If anyone stumbles here trying to find an incremental parser, here is the 1998 paper (Efficient and Flexible Incremental Parsing)

This is one of several researches cited on tree-sitter home.

I don't think scanner is comparable with tree-sitter.

I don't know what algorithm tree-sitter uses for error recovery. I've been wanting to look carefully at the code, but haven't found time yet. Form a quick glance, the code does look similar to snippets from Efficient and Flexible Incremental Parsing.

By the way, error recovery does not seems to be fast, according to benchmarks from Don’t Panic! Better, Fewer, Syntax Errors for LR Parsers, the mean time for recovery are in tens of milliseconds (on a Xeon with 32GB of ram, they ran sequentially, thought).

Since it is not fast, and does fails sometimes, I've wondered if a Graph Structured Network (that seems to be quite good at matching trees) couldn't be used to provide the suggestion faster and with less error.

But implementing those would likely be a premature optimization. I think tree-sitter should try to get stable before implementing any new shiny features.

Addendum.

The time tree-sitter takes to highlight never bothered me. Latency in 10ms isn't perceptible (my head is at another timescale right now, thus the comment about not being fast...).

I do find annoying empty string triggering the error recovery and messing up the highlighting. But this is an issue with the grammar, not tree-sitter itself. Maybe we could draft a standard on how to write grammars?

maxbrunsfeld · 2021-11-19T19:35:05Z

Thanks very much for the report @the-mikedavis; this is definitely a bug in Tree-sitter.

I'd be very curious if it turns out to be reproducible with any other grammars, and if not, what it is about the Elixir grammar that triggers it. I know that it may be a lot of work to answer both of these questions.

maxbrunsfeld · 2021-11-19T19:50:23Z

Ok, I investigated this a bit. In the Elixir grammar, there is a token called NEWLINE, which is used both as an "extra" token, and as a structurally-important token in the _terminator rule.

Tree-sitter should allow you to do this, but it is currently not handling it properly. I think the generated C code is good, but the runtime library is not correctly managing the reusability of that token. Specifically, when the NEWLINE is used as an 'extra' in the old syntax tree, Tree-sitter is not correctly re-categorizing it as non-extra after an edit.

Thanks so much for creating such a small, isolated reproduction case @the-mikedavis.

maxbrunsfeld · 2021-11-19T20:44:27Z

Ok, I think this is fixed on master.

maxbrunsfeld · 2021-11-19T20:46:00Z

I'm going to cut new releases of the library and CLI, because this was a pretty severe bug.

maxbrunsfeld · 2021-11-21T21:57:08Z

I published 0.20.1.

Fixes tree-sitter#1444

see helix-editor/helix#1338 (comment) Leading whitespace is important when injecting diff highlights into messages trailing the scissors. Without this change, some adverserial context lines can end up being mistakenly parsed as non-$.context rules. For example, in the screenshot of the linked issue comment, a context line is being parsed by a tricky link that looks like a malformed $.similarity rule. Because NEWLINE is not a child of $._line but $.source in tree-sitter-git-diff, part of the line is re-parsed into another valid $._line rule, namely $.addition in this case. For an example COMMIT_EDITMSG which has a second diff line 6 characters to the right of the newline, this commit changes the start column of the parsed $.message node to include the whitespace: (source [0, 0] - [7, 0] (subject [0, 0] - [0, 53]) (comment [2, 0] - [4, 38] (scissors [2, 0] - [4, 38])) (message [5, 0] - [5, 36]) - (message [6, 5] - [6, 80])) + (message [6, 0] - [6, 80])) This change probably restricts this grammar to tree-sitter 0.20.1+ because the WHITE_SPACE 'extra' is now used as an extra and within a rule (see tree-sitter/tree-sitter#1444 (comment)) but trailing diffs are not meant to be edited anyways, so it's probably not a big deal.

milahu mentioned this issue Nov 16, 2021

Make use of tree-sitter's incremental parsing EvgeniyPeshkov/syntax-highlighter#46

Open

maxbrunsfeld added the bug label Nov 19, 2021

maxbrunsfeld closed this as completed in 4e2e059 Nov 19, 2021

hendrikvanantwerpen pushed a commit to hendrikvanantwerpen/tree-sitter that referenced this issue Nov 23, 2021

Ensure 'extra' bit is set correctly when reusing a node

2ed95b3

Fixes tree-sitter#1444

the-mikedavis mentioned this issue Apr 28, 2022

Loosen tree-sitter library bounds. elixir-lang/tree-sitter-elixir#36

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

an incremental change can corrupt the parse tree #1444

an incremental change can corrupt the parse tree #1444

the-mikedavis commented Oct 17, 2021 •

edited

Loading

the-mikedavis commented Nov 3, 2021

milahu commented Nov 16, 2021 •

edited

Loading

alemuller commented Nov 16, 2021 •

edited

Loading

maxbrunsfeld commented Nov 19, 2021

maxbrunsfeld commented Nov 19, 2021

maxbrunsfeld commented Nov 19, 2021

maxbrunsfeld commented Nov 19, 2021

maxbrunsfeld commented Nov 21, 2021

an incremental change can corrupt the parse tree #1444

an incremental change can corrupt the parse tree #1444

Comments

the-mikedavis commented Oct 17, 2021 • edited Loading

the-mikedavis commented Nov 3, 2021

milahu commented Nov 16, 2021 • edited Loading

alemuller commented Nov 16, 2021 • edited Loading

maxbrunsfeld commented Nov 19, 2021

maxbrunsfeld commented Nov 19, 2021

maxbrunsfeld commented Nov 19, 2021

maxbrunsfeld commented Nov 19, 2021

maxbrunsfeld commented Nov 21, 2021

the-mikedavis commented Oct 17, 2021 •

edited

Loading

milahu commented Nov 16, 2021 •

edited

Loading

alemuller commented Nov 16, 2021 •

edited

Loading