-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework AST memory layout for better memory usage and performance #7920
Conversation
Also known as "Struct-Of-Arrays" or "SOA". The purpose of this data structure is to provide a similar API to ArrayList but instead of the element type being a struct, the fields of the struct are in N different arrays, all with the same length and capacity. Having this abstraction means we can put them in the same allocation, avoiding overhead with the allocator. It also saves a tiny bit of overhead from the redundant capacity and length fields, since each struct element shares the same value. This is an alternate implementation to #7854.
It now uses the log scope "gpa" instead of "std". Additionally, there is a new config option `verbose_log` which enables info log messages for every allocation. Can be useful when debugging. This option is off by default.
This is a proof-of-concept of switching to a new memory layout for tokens and AST nodes. The goal is threefold: * smaller memory footprint * faster performance for tokenization and parsing * most importantly, a proof-of-concept that can be also applied to ZIR and TZIR to improve the entire compiler pipeline in this way. I had a few key insights here: * Underlying premise: using less memory will make things faster, because of fewer allocations and better cache utilization. Also using less memory is valuable in and of itself. * Using a Struct-Of-Arrays for tokens and AST nodes, saves the bytes of padding between the enum tag (which kind of token is it; which kind of AST node is it) and the next fields in the struct. It also improves cache coherence, since one can peek ahead in the tokens array without having to load the source locations of tokens. * Token memory can be conserved by only having the tag (1 byte) and byte offset (4 bytes) for a total of 5 bytes per token. It is not necessary to store the token ending byte offset because one can always re-tokenize later, but also most tokens the length can be trivially determined from the tag alone, and for ones where it doesn't, string literals for example, one must parse the string literal again later anyway in astgen, making it free to re-tokenize. * AST nodes do not actually need to store more than 1 token index because one can poke left and right in the tokens array very cheaply. So far we are left with one big problem though: how can we put AST nodes into an array, since different AST nodes are different sizes? This is where my key observation comes in: one can have a hash table for the extra data for the less common AST nodes! But it gets even better than that: I defined this data that is always present for every AST Node: * tag (1 byte) - which AST node is it * main_token (4 bytes, index into tokens array) - the tag determines which token this points to * struct{lhs: u32, rhs: u32} - enough to store 2 indexes to other AST nodes, the tag determines how to interpret this data You can see how a binary operation, such as `a * b` would fit into this structure perfectly. A unary operation, such as `*a` would also fit, and leave `rhs` unused. So this is a total of 13 bytes per AST node. And again, we don't have to pay for the padding to round up to 16 because we store in struct-of-arrays format. I made a further observation: the only kind of data AST nodes need to store other than the main_token is indexes to sub-expressions. That's it. The only purpose of an AST is to bring a tree structure to a list of tokens. This observation means all the data that nodes store are only sets of u32 indexes to other nodes. The other tokens can be found later by the compiler, by poking around in the tokens array, which again is super fast because it is struct-of-arrays, so you often only need to look at the token tags array, which is an array of bytes, very cache friendly. So for nearly every kind of AST node, you can store it in 13 bytes. For the rarer AST nodes that have 3 or more indexes to other nodes to store, either the lhs or the rhs will be repurposed to be an index into an extra_data array which contains the extra AST node indexes. In other words, no hash table needed, it's just 1 big ArrayList with the extra data for AST Nodes. Final observation, no need to have a canonical tag for a given AST. For example: The expression `foo(bar)` is a function call. Function calls can have any number of parameters. However in this example, we can encode the function call into the AST with a tag called `FunctionCallOnlyOneParam`, and use lhs for the function expr and rhs for the only parameter expr. Meanwhile if the code was `foo(bar, baz)` then the AST node would have to be `FunctionCall` with lhs still being the function expr, but rhs being the index into `extra_data`. Then because the tag is `FunctionCall` it means `extra_data[rhs]` is the "start" and `extra_data[rhs+1]` is the "end". Now the range `extra_data[start..end]` describes the list of parameters to the function. Point being, you only have to pay for the extra bytes if the AST actually requires it. There's no limit to the number of different AST tag encodings. Preliminary results: * 15% improvement on cache-misses * 28% improvement on total instructions executed * 26% improvement on total CPU cycles * 22% improvement on wall clock time This is 1/4 items on the checklist before this can actually be merged: * [x] parser * [ ] render (zig fmt) * [ ] astgen * [ ] translate-c
Could you make the Token and Node Tags snake_case now while you're breaking the API anyways? |
Yep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the MultiArrayList implementation with a single allocation. It seems the main disadvantage compared to a version with multiple allocations is that shrinking requires copying memory around and the toOwnedSlice() API is a little less useful because of that.
lib/std/multi_array_list.zig
Outdated
self.len = new_len; | ||
// TODO memset the invalidated items to undefined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.len = new_len; | |
// TODO memset the invalidated items to undefined | |
inline for (fields) |field_info, i| { | |
const field = @intToEnum(Field, i); | |
mem.set(field_info.field_type, self.slice().items(field)[new_len..], undefined); | |
} | |
self.len = new_len; |
lib/std/multi_array_list.zig
Outdated
pub const Field = meta.FieldEnum(S); | ||
|
||
pub const Slice = struct { | ||
/// The index corresponds to sizes.bytes, not in field order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment seems to be incorrect, the ptrs
array is indeed in field order.
lib/std/multi_array_list.zig
Outdated
|
||
const fields = meta.fields(S); | ||
/// `sizes.bytes` is an array of @sizeOf each S field. Sorted by alignment, descending. | ||
/// `sizes.indexes` is an array mapping from field to its index in the `sizes.bytes` array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sizes.indexes
is never used, we should probably get rid of it.
Heads up to @alexnask I think this is going to be a major breaking change to ZLS that may require quite some effort to update. Sorry about that. The good news, however, is that it should in theory improve perf & memory usage of ZLS. |
only std.zig.render cares about these, and it can find them in the original source easily enough.
One thing you might want to consider is writing a quick benchmark of a simple traversal over the whole AST. I would expect this change to make producing the AST faster but it might make traversing slower due to potentially taking misses on 3 cache lines when visiting a node instead of one. Similarly if you implement the two types of function calls you might add more branch mispredictions. It would also maybe help double-check the ergonomics of traversal. Although maybe you can just wait to port Edit: The extra cache misses may only be an issue if in practice the order you generate the tree differs substantially from the order you traverse it. This may depend on details of your parser, like if you (pre)allocate the |
* start implementation of ast.Tree.firstToken and lastToken * clarify some ast.Node doc comments * reimplement renderToken
Shouldn't the Allocator arguments in multi_array_list.zig simply be called "allocator" and not "gpa"? |
I've been experimenting with a new convention: if the parameter is named |
This approach properly handles nesting unlike the approach in the previous commit.
Wow, thanks for pointing this out. That's a footgun if I've ever seen one. Wait so this has nothing to do with SIMD vectors, instead it has to do with using |
Reverts bf64220 and uses a different workaround, suggested by @LemonBoy. There is either a compiler bug or a design flaw somewhere around here. It does not have to block this branch, but I need to understand exactly what's going on here and make it so that nobody ever has to run into this problem again.
Long story short, the optimizations are kicking in and are vectorizing (exceptionally well, I must add) the Now, the problem with LLVM's vector is that they are tight, there's no padding in between the elements, unlike in a array... |
Ah, I see. So it is an LLVM bug in vectorization, but not with |
Yes, the |
Benchmark for running master branch
ast-memory-layout branch
Deltas:
Keep in mind, |
Link to the LLVM patch, keep all your fingers crossed and hopefully it'll make into LLVM 12 |
Fixes a regression from ziglang#7920.
This behavior was changed in ziglang/zig#7920
This behavior was changed in ziglang/zig#7920
This is a proof-of-concept of switching to a new memory layout for
tokens and AST nodes. The goal is threefold:
and TZIR to improve the entire compiler pipeline in this way.
I had a few key insights here:
of fewer allocations and better cache utilization. Also using less
memory is valuable in and of itself.
padding between the enum tag (which kind of token is it; which kind
of AST node is it) and the next fields in the struct. It also improves
cache coherence, since one can peek ahead in the tokens array without
having to load the source locations of tokens.
offset (4 bytes) for a total of 5 bytes per token. It is not necessary
to store the token ending byte offset because one can always re-tokenize
later, but also most tokens the length can be trivially determined from
the tag alone, and for ones where it doesn't, string literals for
example, one must parse the string literal again later anyway in
astgen, making it free to re-tokenize.
one can poke left and right in the tokens array very cheaply.
So far we are left with one big problem though: how can we put AST nodes
into an array, since different AST nodes are different sizes?
This is where my key observation comes in: one can have a hash table for
the extra data for the less common AST nodes! But it gets even better than
that:
I defined this data that is always present for every AST Node:
how to interpret this data
You can see how a binary operation, such as
a * b
would fit into thisstructure perfectly. A unary operation, such as
*a
would also fit,and leave
rhs
unused. So this is a total of 13 bytes per AST node.And again, we don't have to pay for the padding to round up to 16 because
we store in struct-of-arrays format.
I made a further observation: the only kind of data AST nodes need to
store other than the main_token is indexes to sub-expressions. That's it.
The only purpose of an AST is to bring a tree structure to a list of tokens.
This observation means all the data that nodes store are only sets of u32
indexes to other nodes. The other tokens can be found later by the compiler,
by poking around in the tokens array, which again is super fast because it
is struct-of-arrays, so you often only need to look at the token tags array,
which is an array of bytes, very cache friendly.
So for nearly every kind of AST node, you can store it in 13 bytes. For the
rarer AST nodes that have 3 or more indexes to other nodes to store, either
the lhs or the rhs will be repurposed to be an index into an extra_data array
which contains the extra AST node indexes. In other words, no hash table needed,
it's just 1 big ArrayList with the extra data for AST Nodes.
Final observation, no need to have a canonical tag for a given AST. For example:
The expression
foo(bar)
is a function call. Function calls can have anynumber of parameters. However in this example, we can encode the function
call into the AST with a tag called
FunctionCallOnlyOneParam
, and use lhsfor the function expr and rhs for the only parameter expr. Meanwhile if the
code was
foo(bar, baz)
then the AST node would have to beFunctionCall
with lhs still being the function expr, but rhs being the index into
extra_data
. Then because the tag isFunctionCall
it meansextra_data[rhs]
is the "start" andextra_data[rhs+1]
is the "end".Now the range
extra_data[start..end]
describes the list of parametersto the function.
Point being, you only have to pay for the extra bytes if the AST actually
requires it. There's no limit to the number of different AST tag encodings.
Results of parsing only:
Checklist before this can actually be merged:
Also there are a couple common forms of AST that I did not implement yet:
extern fn foo() void;
should be a FnProto, not FnDecl