Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix limitations of the parse state merging algorithm to produce tables with fewer states #354

Merged
merged 1 commit into from Jun 6, 2019

Conversation

maxbrunsfeld
Copy link
Contributor

I extracted this out of PR #334, since it's an orthogonal change, and unlike #334, it doesn't require any changes to the runtime library or the generated parser ABI.

Problem

Tree-sitter generates LR(1) parsing tables, as opposed to LALR(1) tables which are much smaller, but can introduce conflicts that make grammars harder to write.

In order to reduce code size, there's some functionality in the minimize_parse_table module that tries to shrink the LR(1) parse table as much as possible by merging states after-the-fact.

I recently found a major deficiency in this parse state merging algorithm - it would fail to merge large groups of states that formed identical cyclic structures in the table.

A False Start

There are several papers written about how to generated a smaller LR(1) parse table more directly, but the IELR(1) algorithm is only one that I'm aware of that accounts for grammars that contain conflicts, to be resolved either with precedence or with the GLR algorithm.

I spent some time yesterday and the day before trying to rework Tree-sitter's table construction to work more like the algorithm from the IELR paper. Their approach is fairly complex: there are six phases to the table construction algorithm, and they still don't account for some issues that Tree-sitter deals with because of its context-aware lexing.

Solution

I ended up giving up on the IELR approach, but I came away from the exercise with a couple of insights:

  1. To ensure that the parse table is minimal, it's better to eagerly merge states and then split them as necessary, rather than to merge conservatively.
  2. The LALR algorithm's isocore merging criteria is a good way of initially merging states.

I ended up totally reworking Tree-sitter's after-the-fact parse state merging algorithm based on these insights. Now, I attempt to merge all the LR(1) parse states with common item set cores, and then re-split the states only as necessary to avoid introducing lexical or syntactic conflicts. The new algorithm merges states much more thoroughly and is actually much faster as well.

I feel this after-the-fact merging of the LR(1) table is still much simpler conceptually than the algorithm presented in the IELR table, and is pretty fast in practice. The Ruby parser now generates in 7 seconds, and all the other ones generate in less than a second.

Result

This significantly improved all the parsers that I've tested it against:

language old state count new state count
JavaScript 2139 1143
C 1637 1108
Ruby 12478 5624

Next Steps

Together with #334, this will improve the binary sizes a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant