New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce parsers' static memory footprint by storing "small" parse states more compactly #334
Conversation
c9349e5
to
72ca437
Compare
4ea6984
to
7e18c22
Compare
ff8a64d
to
a1bfdcf
Compare
To help debug where the parse states originate from, I added a $ tree-sitter generate --report-states
And then then there's a second flag that allows you to view each state's parse item set, and the symbol sequence that would lead to that state: $ tree-sitter generate --report-states-for-rule for_in_statement which produces a bunch of output like this, showing the item sets with their lookahead context:
This output helped eventually track down the bug fixed in #354. I can see using this command that in a lot of languages, there are a lot of states that currently cannot be merged because of sub-optimal structures in the grammars. |
efd523b
to
6a08b4f
Compare
a92572e
to
d4ca2c8
Compare
3e2349f
to
1bdc661
Compare
d94009c
to
8037607
Compare
Ok, I think this is in good shape. I'm not 100% ready to commit to this ABI though, so I've added a new flag to the By merging this, I can stop having to rebase this PR against all of the changes on master, and I can experiment with the new ABI more easily. Prior to enabling the new ABI by default, I may still make some structural changes to optimize the parser file size further. |
Background
Currently, Tree-sitter's parse table is represented as a two-dimensional array of
uint16_t
values (which represent either action ids or successor state ids), indexed by parse state and by lookahead symbol.Problem
With many parsers having around 2000 states and 200 symbols, this array occupies a significant amount of statically-allocated memory. I wrote a script to display the sizes of each symbol in a Tree-sitter parser binary:
Solution
A lot of states in this table are sparse - they have very few valid lookahead symbols. This means that we can save space by representing them in a different way.
In this PR, I've introduced the notion of small parse states. These states are represented as arrays of (
lookahead
,value
) pairs instead of as arrays of sizeSYMBOL_COUNT
, indexed by lookahead.The small parse states are all stored in a single 1-D array. The starting index of each small state is stored in a separate array:
So the procedure for looking up a value in the parse table now has a little bit of conditional logic.
Results
This reduces the size of language binaries (and their static memory footprint) by 50% to 75%:
In Python, the majority of parse states are actually small, so there's more than a 50% savings:
$ ./script/show-symbol-sizes ~/.tree-sitter/bin/python.so total 339.5 kb _ts_small_parse_table 217.3 kb _ts_parse_actions 39.3 kb _ts_parse_table 24.9 kb
Notes
ABI versioning - This PR entails another backward-compatible ABI change, so I've bumped
TREE_SITTER_LANGUAGE_VERSION
up to 11. But the library will still be able to load parsers that were compiled with ABI version 9 and up, so the transition will be easy to manage.Runtime cost - This change doesn't make a measurable difference in parsing speed. The symbols within a small parse state are ordered, so we could search them using a binary search, but for small arrays, I'm not sure it's worth it. Currently I just use a linear search with an early
break
based on the ordering.WASM - Unfortunately, the gzipped size of the binaries is not really affected by this change. Some binaries gzip slightly smaller, and some have actually gotten slightly larger. On average, they remain around 70k gzipped. Still, I think the memory savings are worthwhile in their own right.
🎩 to @marijnh for pointing out how much room for optimization exists due to sparse parse states.