Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce parsers' static memory footprint by storing "small" parse states more compactly #334

Merged
merged 6 commits into from Aug 30, 2019

Conversation

maxbrunsfeld
Copy link
Contributor

@maxbrunsfeld maxbrunsfeld commented May 17, 2019

Background

Currently, Tree-sitter's parse table is represented as a two-dimensional array of uint16_t values (which represent either action ids or successor state ids), indexed by parse state and by lookahead symbol.

static uint16_t ts_parse_table[STATE_COUNT][SYMBOL_COUNT] = {
  [0] = {
    sym_identifier = ACTIONS(250),
    anon_sym_LBRACE = ACTIONS(251),
    sym_expression = STATE(102),

    // ...
  },

  // ...
};

Problem

With many parsers having around 2000 states and 200 symbols, this array occupies a significant amount of statically-allocated memory. I wrote a script to display the sizes of each symbol in a Tree-sitter parser binary:

$ ./script/show-symbol-sizes ~/.tree-sitter/bin/javascript.so

total                                                            	 1034.8 kb
_ts_parse_table                                                  	 931.6 kb
_ts_parse_actions                                                	 40.1 kb
_ts_lex                                                          	 34.6 kb
_ts_lex_modes                                                    	 8.4 kb
_ts_lex_keywords                                                 	 7.4 kb
_ts_alias_sequences                                              	 4.7 kb

# ...

Solution

A lot of states in this table are sparse - they have very few valid lookahead symbols. This means that we can save space by representing them in a different way.

In this PR, I've introduced the notion of small parse states. These states are represented as arrays of (lookahead, value) pairs instead of as arrays of size SYMBOL_COUNT, indexed by lookahead.

static uint16_t ts_small_parse_table[] = {
  [0] = 3,
    sym_identifier, ACTIONS(1306),
    anon_sym_extern, ACTIONS(27),
    anon_sym_static, ACTIONS(27),
  [7] = 2,
    sym_comment, ACTIONS(3),
    sym_parenthesized_expression, STATE(102),

  // ...
};

The small parse states are all stored in a single 1-D array. The starting index of each small state is stored in a separate array:

static uint32_t ts_small_parse_table_map[] = {
  [SMALL_STATE(385)] = 0,
  [SMALL_STATE(386)] = 97,
  [SMALL_STATE(387)] = 194,

  // ...
};

So the procedure for looking up a value in the parse table now has a little bit of conditional logic.

Results

This reduces the size of language binaries (and their static memory footprint) by 50% to 75%:

 $ ./script/show-symbol-sizes ~/.tree-sitter/bin/javascript.so
total                                                            	 570.9 kb
_ts_parse_table                                                  	 362.4 kb
_ts_small_parse_table                                            	 99.7 kb
_ts_parse_actions                                                	 40.1 kb
# ...

In Python, the majority of parse states are actually small, so there's more than a 50% savings:

$ ./script/show-symbol-sizes ~/.tree-sitter/bin/python.so
total                                                            	 339.5 kb
_ts_small_parse_table                                            	 217.3 kb
_ts_parse_actions                                                	 39.3 kb
_ts_parse_table                                                  	 24.9 kb

Notes

  • ABI versioning - This PR entails another backward-compatible ABI change, so I've bumped TREE_SITTER_LANGUAGE_VERSION up to 11. But the library will still be able to load parsers that were compiled with ABI version 9 and up, so the transition will be easy to manage.

  • Runtime cost - This change doesn't make a measurable difference in parsing speed. The symbols within a small parse state are ordered, so we could search them using a binary search, but for small arrays, I'm not sure it's worth it. Currently I just use a linear search with an early break based on the ordering.

  • WASM - Unfortunately, the gzipped size of the binaries is not really affected by this change. Some binaries gzip slightly smaller, and some have actually gotten slightly larger. On average, they remain around 70k gzipped. Still, I think the memory savings are worthwhile in their own right.

🎩 to @marijnh for pointing out how much room for optimization exists due to sparse parse states.

@maxbrunsfeld
Copy link
Contributor Author

To help debug where the parse states originate from, I added a --report-states flag to tree-sitter generate that outputs something like this:

$ tree-sitter generate --report-states
binary_expression              	175
member_expression              	161
call_expression                	132
subscript_expression           	128
ternary_expression             	125
update_expression              	124
for_of_statement               	101
object                         	83
for_in_statement               	74
assignment_expression          	68
_expression                    	65
arrow_function                 	54
jsx_self_closing_element       	50
augmented_assignment_expression	48
function                       	41
formal_parameters              	40
array                          	38
class                          	37
jsx_opening_element            	36
method_definition              	32
sequence_expression            	32
_property_name                 	30
assignment_pattern             	29
jsx_closing_element            	28
jsx_fragment                   	27
new_expression                 	27
object_repeat1                 	24
string                         	24
statement_block                	23
generator_function             	21
anonymous_class                	20

...

And then then there's a second flag that allows you to view each state's parse item set, and the symbol sequence that would lead to that state:

$ tree-sitter generate --report-states-for-rule for_in_statement

which produces a bunch of output like this, showing the item sets with their lookahead context:

state index: 525
state id: 3668
symbol sequence: for ( let array in _expression
items:
call_expression → _expression • (12) template_string	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
call_expression → _expression • (12) arguments	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
member_expression → _expression • (14) . identifier@property_identifier	[*, ,, (, ), in, [, <, >, /, ., =, +=, -=, *=, /=, %=, ^=, &=, |=, >>=, >>>=, <<=, **=, ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
subscript_expression → _expression • (14 Right) [ _expressions ]	[*, ,, (, ), in, [, <, >, /, ., =, +=, -=, *=, /=, %=, ^=, &=, |=, >>=, >>>=, <<=, **=, ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
ternary_expression → _expression • (1 Right) ? _expression : _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (2 Left) || _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (2 Left) ^ _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (2 Left) | _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (3 Left) && _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (3 Left) & _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (4 Left) in _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (4 Left) < _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (4 Left) > _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (4 Left) <= _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (4 Left) == _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (4 Left) === _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (4 Left) != _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (4 Left) !== _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (4 Left) >= _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (4 Left) instanceof _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (5 Left) + _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (5 Left) - _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (6 Left) * _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (6 Left) / _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (6 Left) >> _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (6 Left) >>> _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (6 Left) << _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (6 Left) % _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (7 Left) ** _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
update_expression → _expression • (11 Left) ++	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
update_expression → _expression • (11 Left) --	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
sequence_expression → _expression • (-1) , _expression	[)]
sequence_expression → _expression • (-1) , sequence_expression	[)]
for_in_statement → for ( let array@array_pattern in _expression • ) _statement	[export, {, import, var, let, const, if, switch, for, (, await, while, do, try, with, break, continue, debugger, return, throw, ;, yield, [, <, /, class, async, function, =>, new, +, -, !, ~, typeof, void, delete, ++, --, ", ', `, number, identifier, this, super, true, false, null, undefined, @, get, set, program]

This output helped eventually track down the bug fixed in #354.

I can see using this command that in a lot of languages, there are a lot of states that currently cannot be merged because of sub-optimal structures in the grammars.

@maxbrunsfeld
Copy link
Contributor Author

maxbrunsfeld commented Aug 30, 2019

Ok, I think this is in good shape. I'm not 100% ready to commit to this ABI though, so I've added a new flag to the generate command called --next-abi. By default, Tree-sitter will continue to generate the parsers with the same ABI that it did before. It will only generate the new structure if you run tree-sitter generate --next-abi.

By merging this, I can stop having to rebase this PR against all of the changes on master, and I can experiment with the new ABI more easily. Prior to enabling the new ABI by default, I may still make some structural changes to optimize the parser file size further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant