Fix limitations of the parse state merging algorithm to produce tables with fewer states #354
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I extracted this out of PR #334, since it's an orthogonal change, and unlike #334, it doesn't require any changes to the runtime library or the generated parser ABI.
Problem
Tree-sitter generates LR(1) parsing tables, as opposed to LALR(1) tables which are much smaller, but can introduce conflicts that make grammars harder to write.
In order to reduce code size, there's some functionality in the
minimize_parse_table
module that tries to shrink the LR(1) parse table as much as possible by merging states after-the-fact.I recently found a major deficiency in this parse state merging algorithm - it would fail to merge large groups of states that formed identical cyclic structures in the table.
A False Start
There are several papers written about how to generated a smaller LR(1) parse table more directly, but the IELR(1) algorithm is only one that I'm aware of that accounts for grammars that contain conflicts, to be resolved either with precedence or with the GLR algorithm.
I spent some time yesterday and the day before trying to rework Tree-sitter's table construction to work more like the algorithm from the IELR paper. Their approach is fairly complex: there are six phases to the table construction algorithm, and they still don't account for some issues that Tree-sitter deals with because of its context-aware lexing.
Solution
I ended up giving up on the IELR approach, but I came away from the exercise with a couple of insights:
I ended up totally reworking Tree-sitter's after-the-fact parse state merging algorithm based on these insights. Now, I attempt to merge all the LR(1) parse states with common item set cores, and then re-split the states only as necessary to avoid introducing lexical or syntactic conflicts. The new algorithm merges states much more thoroughly and is actually much faster as well.
I feel this after-the-fact merging of the LR(1) table is still much simpler conceptually than the algorithm presented in the IELR table, and is pretty fast in practice. The Ruby parser now generates in 7 seconds, and all the other ones generate in less than a second.
Result
This significantly improved all the parsers that I've tested it against:
Next Steps
Together with #334, this will improve the binary sizes a lot.