Reduce parsers' static memory footprint by storing "small" parse states more compactly #334

maxbrunsfeld · 2019-05-17T00:08:20Z

Background

Currently, Tree-sitter's parse table is represented as a two-dimensional array of uint16_t values (which represent either action ids or successor state ids), indexed by parse state and by lookahead symbol.

static uint16_t ts_parse_table[STATE_COUNT][SYMBOL_COUNT] = {
  [0] = {
    sym_identifier = ACTIONS(250),
    anon_sym_LBRACE = ACTIONS(251),
    sym_expression = STATE(102),

    // ...
  },

  // ...
};

Problem

With many parsers having around 2000 states and 200 symbols, this array occupies a significant amount of statically-allocated memory. I wrote a script to display the sizes of each symbol in a Tree-sitter parser binary:

$ ./script/show-symbol-sizes ~/.tree-sitter/bin/javascript.so

total                                                            	 1034.8 kb
_ts_parse_table                                                  	 931.6 kb
_ts_parse_actions                                                	 40.1 kb
_ts_lex                                                          	 34.6 kb
_ts_lex_modes                                                    	 8.4 kb
_ts_lex_keywords                                                 	 7.4 kb
_ts_alias_sequences                                              	 4.7 kb

# ...

Solution

A lot of states in this table are sparse - they have very few valid lookahead symbols. This means that we can save space by representing them in a different way.

In this PR, I've introduced the notion of small parse states. These states are represented as arrays of (lookahead, value) pairs instead of as arrays of size SYMBOL_COUNT, indexed by lookahead.

static uint16_t ts_small_parse_table[] = {
  [0] = 3,
    sym_identifier, ACTIONS(1306),
    anon_sym_extern, ACTIONS(27),
    anon_sym_static, ACTIONS(27),
  [7] = 2,
    sym_comment, ACTIONS(3),
    sym_parenthesized_expression, STATE(102),

  // ...
};

The small parse states are all stored in a single 1-D array. The starting index of each small state is stored in a separate array:

static uint32_t ts_small_parse_table_map[] = {
  [SMALL_STATE(385)] = 0,
  [SMALL_STATE(386)] = 97,
  [SMALL_STATE(387)] = 194,

  // ...
};

So the procedure for looking up a value in the parse table now has a little bit of conditional logic.

Results

This reduces the size of language binaries (and their static memory footprint) by 50% to 75%:

 $ ./script/show-symbol-sizes ~/.tree-sitter/bin/javascript.so
total                                                            	 570.9 kb
_ts_parse_table                                                  	 362.4 kb
_ts_small_parse_table                                            	 99.7 kb
_ts_parse_actions                                                	 40.1 kb
# ...

In Python, the majority of parse states are actually small, so there's more than a 50% savings:

$ ./script/show-symbol-sizes ~/.tree-sitter/bin/python.so
total                                                            	 339.5 kb
_ts_small_parse_table                                            	 217.3 kb
_ts_parse_actions                                                	 39.3 kb
_ts_parse_table                                                  	 24.9 kb

Notes

ABI versioning - This PR entails another backward-compatible ABI change, so I've bumped TREE_SITTER_LANGUAGE_VERSION up to 11. But the library will still be able to load parsers that were compiled with ABI version 9 and up, so the transition will be easy to manage.
Runtime cost - This change doesn't make a measurable difference in parsing speed. The symbols within a small parse state are ordered, so we could search them using a binary search, but for small arrays, I'm not sure it's worth it. Currently I just use a linear search with an early break based on the ordering.
WASM - Unfortunately, the gzipped size of the binaries is not really affected by this change. Some binaries gzip slightly smaller, and some have actually gotten slightly larger. On average, they remain around 70k gzipped. Still, I think the memory savings are worthwhile in their own right.

🎩 to @marijnh for pointing out how much room for optimization exists due to sparse parse states.

maxbrunsfeld · 2019-06-07T00:42:40Z

To help debug where the parse states originate from, I added a --report-states flag to tree-sitter generate that outputs something like this:

$ tree-sitter generate --report-states

binary_expression              	175
member_expression              	161
call_expression                	132
subscript_expression           	128
ternary_expression             	125
update_expression              	124
for_of_statement               	101
object                         	83
for_in_statement               	74
assignment_expression          	68
_expression                    	65
arrow_function                 	54
jsx_self_closing_element       	50
augmented_assignment_expression	48
function                       	41
formal_parameters              	40
array                          	38
class                          	37
jsx_opening_element            	36
method_definition              	32
sequence_expression            	32
_property_name                 	30
assignment_pattern             	29
jsx_closing_element            	28
jsx_fragment                   	27
new_expression                 	27
object_repeat1                 	24
string                         	24
statement_block                	23
generator_function             	21
anonymous_class                	20

...

And then then there's a second flag that allows you to view each state's parse item set, and the symbol sequence that would lead to that state:

$ tree-sitter generate --report-states-for-rule for_in_statement

which produces a bunch of output like this, showing the item sets with their lookahead context:

state index: 525
state id: 3668
symbol sequence: for ( let array in _expression
items:
call_expression → _expression • (12) template_string	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
call_expression → _expression • (12) arguments	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
member_expression → _expression • (14) . identifier@property_identifier	[*, ,, (, ), in, [, <, >, /, ., =, +=, -=, *=, /=, %=, ^=, &=, |=, >>=, >>>=, <<=, **=, ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
subscript_expression → _expression • (14 Right) [ _expressions ]	[*, ,, (, ), in, [, <, >, /, ., =, +=, -=, *=, /=, %=, ^=, &=, |=, >>=, >>>=, <<=, **=, ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
ternary_expression → _expression • (1 Right) ? _expression : _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (2 Left) || _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (2 Left) ^ _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (2 Left) | _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (3 Left) && _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (3 Left) & _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (4 Left) in _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (4 Left) < _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (4 Left) > _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (4 Left) <= _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (4 Left) == _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (4 Left) === _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (4 Left) != _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (4 Left) !== _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (4 Left) >= _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (4 Left) instanceof _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (5 Left) + _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (5 Left) - _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (6 Left) * _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (6 Left) / _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (6 Left) >> _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (6 Left) >>> _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (6 Left) << _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (6 Left) % _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
binary_expression → _expression • (7 Left) ** _expression	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
update_expression → _expression • (11 Left) ++	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
update_expression → _expression • (11 Left) --	[*, ,, (, ), in, [, <, >, /, ., ?, &&, ||, >>, >>>, <<, &, ^, |, +, -, %, **, <=, ==, ===, !=, !==, >=, instanceof, ++, --, `]
sequence_expression → _expression • (-1) , _expression	[)]
sequence_expression → _expression • (-1) , sequence_expression	[)]
for_in_statement → for ( let array@array_pattern in _expression • ) _statement	[export, {, import, var, let, const, if, switch, for, (, await, while, do, try, with, break, continue, debugger, return, throw, ;, yield, [, <, /, class, async, function, =>, new, +, -, !, ~, typeof, void, delete, ++, --, ", ', `, number, identifier, this, super, true, false, null, undefined, @, get, set, program]

This output helped eventually track down the bug fixed in #354.

I can see using this command that in a lot of languages, there are a lot of states that currently cannot be merged because of sub-optimal structures in the grammars.

maxbrunsfeld · 2019-08-30T03:29:35Z

Ok, I think this is in good shape. I'm not 100% ready to commit to this ABI though, so I've added a new flag to the generate command called --next-abi. By default, Tree-sitter will continue to generate the parsers with the same ABI that it did before. It will only generate the new structure if you run tree-sitter generate --next-abi.

By merging this, I can stop having to rebase this PR against all of the changes on master, and I can experiment with the new ABI more easily. Prior to enabling the new ABI by default, I may still make some structural changes to optimize the parser file size further.

maxbrunsfeld force-pushed the small-parse-states branch 2 times, most recently from c9349e5 to 72ca437 Compare May 22, 2019 17:42

maxbrunsfeld force-pushed the small-parse-states branch 3 times, most recently from 4ea6984 to 7e18c22 Compare June 6, 2019 19:28

maxbrunsfeld mentioned this pull request Jun 6, 2019

Fix limitations of the parse state merging algorithm to produce tables with fewer states #354

Merged

maxbrunsfeld force-pushed the small-parse-states branch 2 times, most recently from ff8a64d to a1bfdcf Compare June 7, 2019 00:14

maxbrunsfeld force-pushed the small-parse-states branch 2 times, most recently from efd523b to 6a08b4f Compare June 16, 2019 18:18

maxbrunsfeld mentioned this pull request Jun 17, 2019

build-wasm: generated .wasm file excessively large and not working tree-sitter/tree-sitter-ocaml#30

Closed

maxbrunsfeld force-pushed the small-parse-states branch 2 times, most recently from a92572e to d4ca2c8 Compare June 20, 2019 23:00

maxbrunsfeld mentioned this pull request Aug 3, 2019

Optimize file size for WASM language builds #410

Closed

maxbrunsfeld force-pushed the small-parse-states branch from 3e2349f to 1bdc661 Compare August 28, 2019 22:54

maxbrunsfeld added 6 commits August 29, 2019 15:28

Reorder parse states by descending symbol count

759c1d6

Move external token state id computation out of render module

48a883c

Store parse states with few lookahead symbols in a more compact way

09a2755

Appease MSVC by avoiding empty arrays

82ff542

Add --report-states flag for reporting state counts for each rule

aeb2f89

Only generate the new parse table format if --next-abi flag is used

8037607

maxbrunsfeld force-pushed the small-parse-states branch from d94009c to 8037607 Compare August 30, 2019 00:37

maxbrunsfeld merged commit 94ca4dc into master Aug 30, 2019

maxbrunsfeld deleted the small-parse-states branch August 30, 2019 03:30

This was referenced Oct 21, 2019

Allow non-terminal extras #469

Merged

Update tree-sitter to 0.15.13 atom/atom#20061

Merged

maxbrunsfeld mentioned this pull request Oct 31, 2019

Don't assume that null characters mean EOF #475

Merged

maxbrunsfeld mentioned this pull request Dec 2, 2019

Simplify the grammar with the goal of shrinking the parser tree-sitter/tree-sitter-c-sharp#61

Merged

maxbrunsfeld mentioned this pull request Dec 6, 2019

Store a mapping to ensure no two symbols map to the same metadata #500

Merged

ahlinc mentioned this pull request Jan 7, 2023

Investigate memory usage of parser generator #1890

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce parsers' static memory footprint by storing "small" parse states more compactly #334

Reduce parsers' static memory footprint by storing "small" parse states more compactly #334

maxbrunsfeld commented May 17, 2019 •

edited

maxbrunsfeld commented Jun 7, 2019

maxbrunsfeld commented Aug 30, 2019 •

edited

Reduce parsers' static memory footprint by storing "small" parse states more compactly #334

Reduce parsers' static memory footprint by storing "small" parse states more compactly #334

Conversation

maxbrunsfeld commented May 17, 2019 • edited

Background

Problem

Solution

Results

Notes

maxbrunsfeld commented Jun 7, 2019

maxbrunsfeld commented Aug 30, 2019 • edited

maxbrunsfeld commented May 17, 2019 •

edited

maxbrunsfeld commented Aug 30, 2019 •

edited