Rework AST memory layout for better memory usage and performance #7920

andrewrk · 2021-01-31T03:47:59Z

This is a proof-of-concept of switching to a new memory layout for
tokens and AST nodes. The goal is threefold:

smaller memory footprint
faster performance for tokenization and parsing
most importantly, a proof-of-concept that can be also applied to ZIR
and TZIR to improve the entire compiler pipeline in this way.

I had a few key insights here:

Underlying premise: using less memory will make things faster, because
of fewer allocations and better cache utilization. Also using less
memory is valuable in and of itself.
Using a Struct-Of-Arrays for tokens and AST nodes, saves the bytes of
padding between the enum tag (which kind of token is it; which kind
of AST node is it) and the next fields in the struct. It also improves
cache coherence, since one can peek ahead in the tokens array without
having to load the source locations of tokens.
Token memory can be conserved by only having the tag (1 byte) and byte
offset (4 bytes) for a total of 5 bytes per token. It is not necessary
to store the token ending byte offset because one can always re-tokenize
later, but also most tokens the length can be trivially determined from
the tag alone, and for ones where it doesn't, string literals for
example, one must parse the string literal again later anyway in
astgen, making it free to re-tokenize.
AST nodes do not actually need to store more than 1 token index because
one can poke left and right in the tokens array very cheaply.

So far we are left with one big problem though: how can we put AST nodes
into an array, since different AST nodes are different sizes?

This is where my key observation comes in: one can have a hash table for
the extra data for the less common AST nodes! But it gets even better than
that:

I defined this data that is always present for every AST Node:

tag (1 byte)
- which AST node is it
main_token (4 bytes, index into tokens array)
- the tag determines which token this points to
struct{lhs: u32, rhs: u32}
- enough to store 2 indexes to other AST nodes, the tag determines
  how to interpret this data

You can see how a binary operation, such as a * b would fit into this
structure perfectly. A unary operation, such as *a would also fit,
and leave rhs unused. So this is a total of 13 bytes per AST node.
And again, we don't have to pay for the padding to round up to 16 because
we store in struct-of-arrays format.

I made a further observation: the only kind of data AST nodes need to
store other than the main_token is indexes to sub-expressions. That's it.
The only purpose of an AST is to bring a tree structure to a list of tokens.
This observation means all the data that nodes store are only sets of u32
indexes to other nodes. The other tokens can be found later by the compiler,
by poking around in the tokens array, which again is super fast because it
is struct-of-arrays, so you often only need to look at the token tags array,
which is an array of bytes, very cache friendly.

So for nearly every kind of AST node, you can store it in 13 bytes. For the
rarer AST nodes that have 3 or more indexes to other nodes to store, either
the lhs or the rhs will be repurposed to be an index into an extra_data array
which contains the extra AST node indexes. In other words, no hash table needed,
it's just 1 big ArrayList with the extra data for AST Nodes.

Final observation, no need to have a canonical tag for a given AST. For example:
The expression foo(bar) is a function call. Function calls can have any
number of parameters. However in this example, we can encode the function
call into the AST with a tag called FunctionCallOnlyOneParam, and use lhs
for the function expr and rhs for the only parameter expr. Meanwhile if the
code was foo(bar, baz) then the AST node would have to be FunctionCall
with lhs still being the function expr, but rhs being the index into
extra_data. Then because the tag is FunctionCall it means
extra_data[rhs] is the "start" and extra_data[rhs+1] is the "end".
Now the range extra_data[start..end] describes the list of parameters
to the function.

Point being, you only have to pay for the extra bytes if the AST actually
requires it. There's no limit to the number of different AST tag encodings.

Results of parsing only:

✔️ 15% fewer cache-misses
✔️ 28% fewer total instructions executed
✔️ 26% fewer total CPU cycles
✔️ 22% faster wall clock time

Checklist before this can actually be merged:

parser
render (zig fmt)
astgen
translate-c

Also there are a couple common forms of AST that I did not implement yet:

FnCall with 1 or 0 args
Builtin call with 0, 1, or 2 args
block with 0, 1, or 2 statements
container decl with 0, 1, or 2 fields/decls
extern fn foo() void; should be a FnProto, not FnDecl

Also known as "Struct-Of-Arrays" or "SOA". The purpose of this data structure is to provide a similar API to ArrayList but instead of the element type being a struct, the fields of the struct are in N different arrays, all with the same length and capacity. Having this abstraction means we can put them in the same allocation, avoiding overhead with the allocator. It also saves a tiny bit of overhead from the redundant capacity and length fields, since each struct element shares the same value. This is an alternate implementation to #7854.

It now uses the log scope "gpa" instead of "std". Additionally, there is a new config option `verbose_log` which enables info log messages for every allocation. Can be useful when debugging. This option is off by default.

This is a proof-of-concept of switching to a new memory layout for tokens and AST nodes. The goal is threefold: * smaller memory footprint * faster performance for tokenization and parsing * most importantly, a proof-of-concept that can be also applied to ZIR and TZIR to improve the entire compiler pipeline in this way. I had a few key insights here: * Underlying premise: using less memory will make things faster, because of fewer allocations and better cache utilization. Also using less memory is valuable in and of itself. * Using a Struct-Of-Arrays for tokens and AST nodes, saves the bytes of padding between the enum tag (which kind of token is it; which kind of AST node is it) and the next fields in the struct. It also improves cache coherence, since one can peek ahead in the tokens array without having to load the source locations of tokens. * Token memory can be conserved by only having the tag (1 byte) and byte offset (4 bytes) for a total of 5 bytes per token. It is not necessary to store the token ending byte offset because one can always re-tokenize later, but also most tokens the length can be trivially determined from the tag alone, and for ones where it doesn't, string literals for example, one must parse the string literal again later anyway in astgen, making it free to re-tokenize. * AST nodes do not actually need to store more than 1 token index because one can poke left and right in the tokens array very cheaply. So far we are left with one big problem though: how can we put AST nodes into an array, since different AST nodes are different sizes? This is where my key observation comes in: one can have a hash table for the extra data for the less common AST nodes! But it gets even better than that: I defined this data that is always present for every AST Node: * tag (1 byte) - which AST node is it * main_token (4 bytes, index into tokens array) - the tag determines which token this points to * struct{lhs: u32, rhs: u32} - enough to store 2 indexes to other AST nodes, the tag determines how to interpret this data You can see how a binary operation, such as `a * b` would fit into this structure perfectly. A unary operation, such as `*a` would also fit, and leave `rhs` unused. So this is a total of 13 bytes per AST node. And again, we don't have to pay for the padding to round up to 16 because we store in struct-of-arrays format. I made a further observation: the only kind of data AST nodes need to store other than the main_token is indexes to sub-expressions. That's it. The only purpose of an AST is to bring a tree structure to a list of tokens. This observation means all the data that nodes store are only sets of u32 indexes to other nodes. The other tokens can be found later by the compiler, by poking around in the tokens array, which again is super fast because it is struct-of-arrays, so you often only need to look at the token tags array, which is an array of bytes, very cache friendly. So for nearly every kind of AST node, you can store it in 13 bytes. For the rarer AST nodes that have 3 or more indexes to other nodes to store, either the lhs or the rhs will be repurposed to be an index into an extra_data array which contains the extra AST node indexes. In other words, no hash table needed, it's just 1 big ArrayList with the extra data for AST Nodes. Final observation, no need to have a canonical tag for a given AST. For example: The expression `foo(bar)` is a function call. Function calls can have any number of parameters. However in this example, we can encode the function call into the AST with a tag called `FunctionCallOnlyOneParam`, and use lhs for the function expr and rhs for the only parameter expr. Meanwhile if the code was `foo(bar, baz)` then the AST node would have to be `FunctionCall` with lhs still being the function expr, but rhs being the index into `extra_data`. Then because the tag is `FunctionCall` it means `extra_data[rhs]` is the "start" and `extra_data[rhs+1]` is the "end". Now the range `extra_data[start..end]` describes the list of parameters to the function. Point being, you only have to pay for the extra bytes if the AST actually requires it. There's no limit to the number of different AST tag encodings. Preliminary results: * 15% improvement on cache-misses * 28% improvement on total instructions executed * 26% improvement on total CPU cycles * 22% improvement on wall clock time This is 1/4 items on the checklist before this can actually be merged: * [x] parser * [ ] render (zig fmt) * [ ] astgen * [ ] translate-c

Vexu · 2021-01-31T09:28:03Z

Could you make the Token and Node Tags snake_case now while you're breaking the API anyways?

andrewrk · 2021-01-31T09:38:05Z

Yep

ifreund

I like the MultiArrayList implementation with a single allocation. It seems the main disadvantage compared to a version with multiple allocations is that shrinking requires copying memory around and the toOwnedSlice() API is a little less useful because of that.

ifreund · 2021-01-31T10:38:58Z

lib/std/multi_array_list.zig

+                self.len = new_len;
+                // TODO memset the invalidated items to undefined


Suggested change

self.len = new_len;

// TODO memset the invalidated items to undefined

inline for (fields) |field_info, i| {

const field = @intToEnum(Field, i);

mem.set(field_info.field_type, self.slice().items(field)[new_len..], undefined);

}

self.len = new_len;

ifreund · 2021-01-31T11:17:38Z

lib/std/multi_array_list.zig

+        pub const Field = meta.FieldEnum(S);
+
+        pub const Slice = struct {
+            /// The index corresponds to sizes.bytes, not in field order.


This comment seems to be incorrect, the ptrs array is indeed in field order.

ifreund · 2021-01-31T11:45:42Z

lib/std/multi_array_list.zig

+
+        const fields = meta.fields(S);
+        /// `sizes.bytes` is an array of @sizeOf each S field. Sorted by alignment, descending.
+        /// `sizes.indexes` is an array mapping from field to its index in the `sizes.bytes` array.


sizes.indexes is never used, we should probably get rid of it.

andrewrk · 2021-02-01T04:38:19Z

Heads up to @alexnask I think this is going to be a major breaking change to ZLS that may require quite some effort to update. Sorry about that. The good news, however, is that it should in theory improve perf & memory usage of ZLS.

only std.zig.render cares about these, and it can find them in the original source easily enough.

trishume · 2021-02-01T23:14:23Z

One thing you might want to consider is writing a quick benchmark of a simple traversal over the whole AST. I would expect this change to make producing the AST faster but it might make traversing slower due to potentially taking misses on 3 cache lines when visiting a node instead of one. Similarly if you implement the two types of function calls you might add more branch mispredictions.

It would also maybe help double-check the ergonomics of traversal. Although maybe you can just wait to port zig fmt to do all of this since presumably it's a tree traversal.

Edit: The extra cache misses may only be an issue if in practice the order you generate the tree differs substantially from the order you traverse it. This may depend on details of your parser, like if you (pre)allocate the + node before parsing either argument you might end up with the same depth-first order you traverse in and thus minimize cache misses, but if you allocate it after or in-between the two children you won't (although it may still be quite similar and thus cache-efficient).

* start implementation of ast.Tree.firstToken and lastToken * clarify some ast.Node doc comments * reimplement renderToken

Avokadoen · 2021-02-02T19:33:27Z

Shouldn't the Allocator arguments in multi_array_list.zig simply be called "allocator" and not "gpa"?

andrewrk · 2021-02-02T19:54:46Z

Shouldn't the Allocator arguments in multi_array_list.zig simply be called "allocator" and not "gpa"?

I've been experimenting with a new convention: if the parameter is named gpa it indicates the allocator will be used in a way that would be better suited for a general purpose allocator - e.g. one that reclaims memory on free(). It's not a requirement, but it's a hint to callsites that, if they have the choice, a gpa would be preferable to an arena.

…/for

This approach properly handles nesting unlike the approach in the previous commit.

andrewrk · 2021-02-24T19:41:35Z

Nope.
Wanna see a magic trick? Apply this patch:

Wow, thanks for pointing this out. That's a footgun if I've ever seen one. Wait so this has nothing to do with SIMD vectors, instead it has to do with using std.mem.copy on a u7? I do not yet understand why this patch would change anything. The ABI size of u7 is 1 byte, so writing a u7 should clear the extra bit.

@LemonBoy

Reverts bf64220 and uses a different workaround, suggested by @LemonBoy. There is either a compiler bug or a design flaw somewhere around here. It does not have to block this branch, but I need to understand exactly what's going on here and make it so that nobody ever has to run into this problem again.

LemonBoy · 2021-02-24T20:00:07Z

Long story short, the optimizations are kicking in and are vectorizing (exceptionally well, I must add) the mem.copy loop by turning the single i7 loads into <4 x i7> ones (and coalescing more than a single read/write when possible, the aarch64 asm accelerates the copy by reading/writing 8 bytes at once).

Now, the problem with LLVM's vector is that they are tight, there's no padding in between the elements, unlike in a array...
You probably understand the problem now, the generated aarch64 code is slicing and dicing the bits in the wrong way.

andrewrk · 2021-02-24T20:04:07Z

the generated aarch64 code is slicing and dicing the bits in the wrong way.

Ah, I see. So it is an LLVM bug in vectorization, but not with @reduce. Is that your conclusion? Or have we made some horrific language design flaw?

LemonBoy · 2021-02-24T22:01:09Z

Ah, I see. So it is an LLVM bug in vectorization, but not with @reduce. Is that your conclusion?

Yes, the @reduce thing was a red herring, all it did was shift the code around to hide the problem.
It remains to be seen if the problem is limited to the loop vectorizer or is extended to other regular loads/stores.

andrewrk · 2021-02-25T00:16:53Z

Benchmark for running zig fmt on the whole std lib:

master branch

peak memory usage: 84.7 MiB
zig executable binary size (without -Denable-llvm): 3.37 MiB

[nix-shell:~/dev/zig/build-release]$ perf stat -e cache-misses -e instructions -e cycles ./zig fmt ~/tmp/std

 Performance counter stats for './zig fmt /home/andy/tmp/std':

         2,803,440      cache-misses                                                
     4,592,565,483      instructions              #    2.70  insn per cycle         
     1,702,969,462      cycles                                                      

       0.389872088 seconds time elapsed

       0.370892000 seconds user
       0.018994000 seconds sys

ast-memory-layout branch

peak memory usage: 68.4 MiB
zig executable binary size (without -Denable-llvm): 3.17 MiB

[nix-shell:~/Downloads/zig/build-release]$ perf stat -e cache-misses -e instructions -e cycles ./zig fmt ~/tmp/std2

 Performance counter stats for './zig fmt /home/andy/tmp/std2':

         3,771,657      cache-misses                                                
     3,941,606,705      instructions              #    2.62  insn per cycle         
     1,502,290,493      cycles                                                      

       0.345329438 seconds time elapsed

       0.332314000 seconds user
       0.013012000 seconds sys

Deltas:

✔️ 19.3% less peak memory usage
✔️ 10.8% reduction in binary size
❌ 34.5% more cache-misses
✔️ 14.2% fewer instructions
✔️ 11.8% fewer CPU cycles
✔️ 11.4% faster wall clock time

Keep in mind, zig fmt does a lot of extra work that astgen from the main pipeline does not have to do. So I think these are very promising results!

LemonBoy · 2021-02-25T14:00:06Z

Link to the LLVM patch, keep all your fingers crossed and hopefully it'll make into LLVM 12

Fixes a regression from ziglang#7920.

This behavior was changed in ziglang/zig#7920

andrewrk added 3 commits January 30, 2021 20:12

std.GeneralPurposeAllocator: logging improvements

766b315

It now uses the log scope "gpa" instead of "std". Additionally, there is a new config option `verbose_log` which enables info log messages for every allocation. Can be useful when debugging. This option is off by default.

andrewrk mentioned this pull request Jan 31, 2021

Explicit support for AoS/SoA #2634

Closed

ifreund reviewed Jan 31, 2021

View reviewed changes

andrewrk mentioned this pull request Jan 31, 2021

Make translate-c use intermediate AST #7479

Merged

stage2: tokenizer does not emit line comments anymore

bf8fafc

only std.zig.render cares about these, and it can find them in the original source easily enough.

andrewrk added 3 commits February 1, 2021 17:23

zig fmt: start reworking with new memory layout

20554d3

* start implementation of ast.Tree.firstToken and lastToken * clarify some ast.Node doc comments * reimplement renderToken

zig fmt: implement "line comment followed by top-level comptime"

272a0ab

zig fmt: implement simple test with doc comments

0c6b98b

andrewrk added the optimization label Feb 2, 2021

andrewrk and others added 6 commits February 2, 2021 21:05

zig fmt: implement if, call, field access, assignment

1a83b29

zig fmt: implement top-level fields

f5279cb

zig fmt: builtin calls and array access

725adf8

zig fmt: deref, unwrap optional

4428acf

zig fmt: implement fn protos and defers

8e46d06

zig fmt: implement struct init

7069459

ifreund mentioned this pull request Feb 5, 2021

zig fmt: help finish the rework #7958

Merged

ifreund added 3 commits February 5, 2021 10:51

std.MultiArrayList: implement review comments

0b4bb9b

zig fmt: float literal with exponent

3e960cf

zig fmt: struct and anon array initialization

6f3b93e

ifreund added 2 commits February 24, 2021 13:46

zig fmt: fix comment indent after multiline single statement if/while…

371b21b

…/for

zig fmt: rework single statement if/while/for indentation

52c45bf

This approach properly handles nesting unlike the approach in the previous commit.

andrewrk added 3 commits February 24, 2021 15:08

Merge remote-tracking branch 'origin/master' into ast-memory-layout

8e6c2b7

build.zig: expose a strip option

c79ee1f

zig fmt: function with labeled block as return type

9ada763

andrewrk mentioned this pull request Feb 25, 2021

improve C macro translation of R_CAST from Ruby #8062

Closed

andrewrk merged commit d7049fc into master Feb 25, 2021

andrewrk deleted the ast-memory-layout branch February 25, 2021 02:49

Luukdegram mentioned this pull request Feb 26, 2021

Update to Zig std's new AST format zigtools/zls#238

Merged

14 tasks

andrewrk mentioned this pull request Feb 27, 2021

Zig fmt introduces newline on for-around-if where it formerly didn't #8088

Closed

andrewrk added translate-c C to Zig source translation feature (@cImport) zig fmt labels Feb 27, 2021

andrewrk mentioned this pull request Mar 3, 2021

translate-c panic on ffmpeg headers #8129

Closed

andrewrk added a commit to g-w1/zig that referenced this pull request Mar 3, 2021

astgen: fix crash looking for wrong token in error sets

3995783

Fixes a regression from ziglang#7920.

ifreund added a commit to ifreund/zig-spec that referenced this pull request Mar 7, 2021

Container doc comments only valid at start of containers

c542258

This behavior was changed in ziglang/zig#7920

ifreund mentioned this pull request Mar 7, 2021

Container doc comments only valid at start of containers ziglang/zig-spec#27

Merged

SpexGuy pushed a commit to ziglang/zig-spec that referenced this pull request Mar 7, 2021

Container doc comments only valid at start of containers

5a7b8d8

This behavior was changed in ziglang/zig#7920

This was referenced Mar 7, 2021

zig fmt: Mixing comments and doc comments has weird results #5483

Closed

zig fmt doesn't like being disabled for inner blocks. #5548

Closed

Bug: zig fmt removes empty comments and collapses additional whitespace #7722

Closed

Luukdegram mentioned this pull request Mar 8, 2021

container 'std.zig.ast.Node' has no member called 'Use' zigtools/zls#242

Closed

andrewrk mentioned this pull request Mar 16, 2021

rework ZIR memory layout; overhaul source locations #8266

Merged

andrewrk mentioned this pull request Jul 10, 2021

stage2: rework AIR memory layout #9353

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework AST memory layout for better memory usage and performance #7920

Rework AST memory layout for better memory usage and performance #7920

andrewrk commented Jan 31, 2021 •

edited

Loading

Vexu commented Jan 31, 2021

andrewrk commented Jan 31, 2021

ifreund left a comment

ifreund Jan 31, 2021

ifreund Jan 31, 2021

ifreund Jan 31, 2021

andrewrk commented Feb 1, 2021

trishume commented Feb 1, 2021 •

edited

Loading

Avokadoen commented Feb 2, 2021

andrewrk commented Feb 2, 2021

andrewrk commented Feb 24, 2021

LemonBoy commented Feb 24, 2021

andrewrk commented Feb 24, 2021

LemonBoy commented Feb 24, 2021

andrewrk commented Feb 25, 2021

LemonBoy commented Feb 25, 2021

		self.len = new_len;
		// TODO memset the invalidated items to undefined

Rework AST memory layout for better memory usage and performance #7920

Rework AST memory layout for better memory usage and performance #7920

Conversation

andrewrk commented Jan 31, 2021 • edited Loading

Vexu commented Jan 31, 2021

andrewrk commented Jan 31, 2021

ifreund left a comment

Choose a reason for hiding this comment

ifreund Jan 31, 2021

Choose a reason for hiding this comment

ifreund Jan 31, 2021

Choose a reason for hiding this comment

ifreund Jan 31, 2021

Choose a reason for hiding this comment

andrewrk commented Feb 1, 2021

trishume commented Feb 1, 2021 • edited Loading

Avokadoen commented Feb 2, 2021

andrewrk commented Feb 2, 2021

andrewrk commented Feb 24, 2021

LemonBoy commented Feb 24, 2021

andrewrk commented Feb 24, 2021

LemonBoy commented Feb 24, 2021

andrewrk commented Feb 25, 2021

master branch

ast-memory-layout branch

Deltas:

LemonBoy commented Feb 25, 2021

andrewrk commented Jan 31, 2021 •

edited

Loading

trishume commented Feb 1, 2021 •

edited

Loading