Fix limitations of the parse state merging algorithm to produce tables with fewer states #354

maxbrunsfeld · 2019-06-06T22:16:55Z

I extracted this out of PR #334, since it's an orthogonal change, and unlike #334, it doesn't require any changes to the runtime library or the generated parser ABI.

Problem

Tree-sitter generates LR(1) parsing tables, as opposed to LALR(1) tables which are much smaller, but can introduce conflicts that make grammars harder to write.

In order to reduce code size, there's some functionality in the minimize_parse_table module that tries to shrink the LR(1) parse table as much as possible by merging states after-the-fact.

I recently found a major deficiency in this parse state merging algorithm - it would fail to merge large groups of states that formed identical cyclic structures in the table.

A False Start

There are several papers written about how to generated a smaller LR(1) parse table more directly, but the IELR(1) algorithm is only one that I'm aware of that accounts for grammars that contain conflicts, to be resolved either with precedence or with the GLR algorithm.

I spent some time yesterday and the day before trying to rework Tree-sitter's table construction to work more like the algorithm from the IELR paper. Their approach is fairly complex: there are six phases to the table construction algorithm, and they still don't account for some issues that Tree-sitter deals with because of its context-aware lexing.

Solution

I ended up giving up on the IELR approach, but I came away from the exercise with a couple of insights:

To ensure that the parse table is minimal, it's better to eagerly merge states and then split them as necessary, rather than to merge conservatively.
The LALR algorithm's isocore merging criteria is a good way of initially merging states.

I ended up totally reworking Tree-sitter's after-the-fact parse state merging algorithm based on these insights. Now, I attempt to merge all the LR(1) parse states with common item set cores, and then re-split the states only as necessary to avoid introducing lexical or syntactic conflicts. The new algorithm merges states much more thoroughly and is actually much faster as well.

I feel this after-the-fact merging of the LR(1) table is still much simpler conceptually than the algorithm presented in the IELR table, and is pretty fast in practice. The Ruby parser now generates in 7 seconds, and all the other ones generate in less than a second.

Result

This significantly improved all the parsers that I've tested it against:

language	old state count	new state count
JavaScript	2139	1143
C	1637	1108
Ruby	12478	5624

Next Steps

Together with #334, this will improve the binary sizes a lot.

Fix missed opportunities to merge parse states

f7d25a5

maxbrunsfeld force-pushed the eager-state-merging branch from 752b3bf to f7d25a5 Compare June 6, 2019 22:29

maxbrunsfeld merged commit c2e1f68 into master Jun 6, 2019

maxbrunsfeld deleted the eager-state-merging branch June 6, 2019 22:51

maxbrunsfeld mentioned this pull request Jun 7, 2019

Reduce parsers' static memory footprint by storing "small" parse states more compactly #334

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix limitations of the parse state merging algorithm to produce tables with fewer states #354

Fix limitations of the parse state merging algorithm to produce tables with fewer states #354

maxbrunsfeld commented Jun 6, 2019

Fix limitations of the parse state merging algorithm to produce tables with fewer states #354

Fix limitations of the parse state merging algorithm to produce tables with fewer states #354

Conversation

maxbrunsfeld commented Jun 6, 2019

Problem

A False Start

Solution

Result

Next Steps