Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework AST memory layout for better memory usage and performance #7920

Merged
merged 176 commits into from Feb 25, 2021

Conversation

andrewrk
Copy link
Member

@andrewrk andrewrk commented Jan 31, 2021

This is a proof-of-concept of switching to a new memory layout for
tokens and AST nodes. The goal is threefold:

  • smaller memory footprint
  • faster performance for tokenization and parsing
  • most importantly, a proof-of-concept that can be also applied to ZIR
    and TZIR to improve the entire compiler pipeline in this way.

I had a few key insights here:

  • Underlying premise: using less memory will make things faster, because
    of fewer allocations and better cache utilization. Also using less
    memory is valuable in and of itself.
  • Using a Struct-Of-Arrays for tokens and AST nodes, saves the bytes of
    padding between the enum tag (which kind of token is it; which kind
    of AST node is it) and the next fields in the struct. It also improves
    cache coherence, since one can peek ahead in the tokens array without
    having to load the source locations of tokens.
  • Token memory can be conserved by only having the tag (1 byte) and byte
    offset (4 bytes) for a total of 5 bytes per token. It is not necessary
    to store the token ending byte offset because one can always re-tokenize
    later, but also most tokens the length can be trivially determined from
    the tag alone, and for ones where it doesn't, string literals for
    example, one must parse the string literal again later anyway in
    astgen, making it free to re-tokenize.
  • AST nodes do not actually need to store more than 1 token index because
    one can poke left and right in the tokens array very cheaply.

So far we are left with one big problem though: how can we put AST nodes
into an array, since different AST nodes are different sizes?

This is where my key observation comes in: one can have a hash table for
the extra data for the less common AST nodes! But it gets even better than
that:

I defined this data that is always present for every AST Node:

  • tag (1 byte)
    • which AST node is it
  • main_token (4 bytes, index into tokens array)
    • the tag determines which token this points to
  • struct{lhs: u32, rhs: u32}
    • enough to store 2 indexes to other AST nodes, the tag determines
      how to interpret this data

You can see how a binary operation, such as a * b would fit into this
structure perfectly. A unary operation, such as *a would also fit,
and leave rhs unused. So this is a total of 13 bytes per AST node.
And again, we don't have to pay for the padding to round up to 16 because
we store in struct-of-arrays format.

I made a further observation: the only kind of data AST nodes need to
store other than the main_token is indexes to sub-expressions. That's it.
The only purpose of an AST is to bring a tree structure to a list of tokens.
This observation means all the data that nodes store are only sets of u32
indexes to other nodes. The other tokens can be found later by the compiler,
by poking around in the tokens array, which again is super fast because it
is struct-of-arrays, so you often only need to look at the token tags array,
which is an array of bytes, very cache friendly.

So for nearly every kind of AST node, you can store it in 13 bytes. For the
rarer AST nodes that have 3 or more indexes to other nodes to store, either
the lhs or the rhs will be repurposed to be an index into an extra_data array
which contains the extra AST node indexes. In other words, no hash table needed,
it's just 1 big ArrayList with the extra data for AST Nodes.

Final observation, no need to have a canonical tag for a given AST. For example:
The expression foo(bar) is a function call. Function calls can have any
number of parameters. However in this example, we can encode the function
call into the AST with a tag called FunctionCallOnlyOneParam, and use lhs
for the function expr and rhs for the only parameter expr. Meanwhile if the
code was foo(bar, baz) then the AST node would have to be FunctionCall
with lhs still being the function expr, but rhs being the index into
extra_data. Then because the tag is FunctionCall it means
extra_data[rhs] is the "start" and extra_data[rhs+1] is the "end".
Now the range extra_data[start..end] describes the list of parameters
to the function.

Point being, you only have to pay for the extra bytes if the AST actually
requires it. There's no limit to the number of different AST tag encodings.

Results of parsing only:

  • ✔️ 15% fewer cache-misses
  • ✔️ 28% fewer total instructions executed
  • ✔️ 26% fewer total CPU cycles
  • ✔️ 22% faster wall clock time

Checklist before this can actually be merged:

  • parser
  • render (zig fmt)
  • astgen
  • translate-c

Also there are a couple common forms of AST that I did not implement yet:

  • FnCall with 1 or 0 args
  • Builtin call with 0, 1, or 2 args
  • block with 0, 1, or 2 statements
  • container decl with 0, 1, or 2 fields/decls
  • extern fn foo() void; should be a FnProto, not FnDecl

Also known as "Struct-Of-Arrays" or "SOA". The purpose of this data
structure is to provide a similar API to ArrayList but instead of
the element type being a struct, the fields of the struct are in N
different arrays, all with the same length and capacity.

Having this abstraction means we can put them in the same allocation,
avoiding overhead with the allocator. It also saves a tiny bit of
overhead from the redundant capacity and length fields, since each
struct element shares the same value.

This is an alternate implementation to #7854.
It now uses the log scope "gpa" instead of "std".

Additionally, there is a new config option `verbose_log` which enables
info log messages for every allocation. Can be useful when debugging.
This option is off by default.
This is a proof-of-concept of switching to a new memory layout for
tokens and AST nodes. The goal is threefold:

 * smaller memory footprint
 * faster performance for tokenization and parsing
 * most importantly, a proof-of-concept that can be also applied to ZIR
   and TZIR to improve the entire compiler pipeline in this way.

I had a few key insights here:

 * Underlying premise: using less memory will make things faster, because
   of fewer allocations and better cache utilization. Also using less
   memory is valuable in and of itself.
 * Using a Struct-Of-Arrays for tokens and AST nodes, saves the bytes of
   padding between the enum tag (which kind of token is it; which kind
   of AST node is it) and the next fields in the struct. It also improves
   cache coherence, since one can peek ahead in the tokens array without
   having to load the source locations of tokens.
 * Token memory can be conserved by only having the tag (1 byte) and byte
   offset (4 bytes) for a total of 5 bytes per token. It is not necessary
   to store the token ending byte offset because one can always re-tokenize
   later, but also most tokens the length can be trivially determined from
   the tag alone, and for ones where it doesn't, string literals for
   example, one must parse the string literal again later anyway in
   astgen, making it free to re-tokenize.
 * AST nodes do not actually need to store more than 1 token index because
   one can poke left and right in the tokens array very cheaply.

So far we are left with one big problem though: how can we put AST nodes
into an array, since different AST nodes are different sizes?

This is where my key observation comes in: one can have a hash table for
the extra data for the less common AST nodes! But it gets even better than
that:

I defined this data that is always present for every AST Node:

 * tag (1 byte)
   - which AST node is it
 * main_token (4 bytes, index into tokens array)
   - the tag determines which token this points to
 * struct{lhs: u32, rhs: u32}
   - enough to store 2 indexes to other AST nodes, the tag determines
     how to interpret this data

You can see how a binary operation, such as `a * b` would fit into this
structure perfectly. A unary operation, such as `*a` would also fit,
and leave `rhs` unused. So this is a total of 13 bytes per AST node.
And again, we don't have to pay for the padding to round up to 16 because
we store in struct-of-arrays format.

I made a further observation: the only kind of data AST nodes need to
store other than the main_token is indexes to sub-expressions. That's it.
The only purpose of an AST is to bring a tree structure to a list of tokens.
This observation means all the data that nodes store are only sets of u32
indexes to other nodes. The other tokens can be found later by the compiler,
by poking around in the tokens array, which again is super fast because it
is struct-of-arrays, so you often only need to look at the token tags array,
which is an array of bytes, very cache friendly.

So for nearly every kind of AST node, you can store it in 13 bytes. For the
rarer AST nodes that have 3 or more indexes to other nodes to store, either
the lhs or the rhs will be repurposed to be an index into an extra_data array
which contains the extra AST node indexes. In other words, no hash table needed,
it's just 1 big ArrayList with the extra data for AST Nodes.

Final observation, no need to have a canonical tag for a given AST. For example:
The expression `foo(bar)` is a function call. Function calls can have any
number of parameters. However in this example, we can encode the function
call into the AST with a tag called `FunctionCallOnlyOneParam`, and use lhs
for the function expr and rhs for the only parameter expr. Meanwhile if the
code was `foo(bar, baz)` then the AST node would have to be `FunctionCall`
with lhs still being the function expr, but rhs being the index into
`extra_data`. Then because the tag is `FunctionCall` it means
`extra_data[rhs]` is the "start" and `extra_data[rhs+1]` is the "end".
Now the range `extra_data[start..end]` describes the list of parameters
to the function.

Point being, you only have to pay for the extra bytes if the AST actually
requires it. There's no limit to the number of different AST tag encodings.

Preliminary results:

 * 15% improvement on cache-misses
 * 28% improvement on total instructions executed
 * 26% improvement on total CPU cycles
 * 22% improvement on wall clock time

This is 1/4 items on the checklist before this can actually be merged:

 * [x] parser
 * [ ] render (zig fmt)
 * [ ] astgen
 * [ ] translate-c
@andrewrk andrewrk added enhancement Solving this issue will likely involve adding new logic or components to the codebase. breaking Implementing this issue could cause existing code to no longer compile or have different behavior. frontend Tokenization, parsing, AstGen, Sema, and Liveness. labels Jan 31, 2021
@Vexu
Copy link
Member

Vexu commented Jan 31, 2021

Could you make the Token and Node Tags snake_case now while you're breaking the API anyways?

@andrewrk
Copy link
Member Author

Yep

Copy link
Member

@ifreund ifreund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the MultiArrayList implementation with a single allocation. It seems the main disadvantage compared to a version with multiple allocations is that shrinking requires copying memory around and the toOwnedSlice() API is a little less useful because of that.

Comment on lines 186 to 187
self.len = new_len;
// TODO memset the invalidated items to undefined
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.len = new_len;
// TODO memset the invalidated items to undefined
inline for (fields) |field_info, i| {
const field = @intToEnum(Field, i);
mem.set(field_info.field_type, self.slice().items(field)[new_len..], undefined);
}
self.len = new_len;

pub const Field = meta.FieldEnum(S);

pub const Slice = struct {
/// The index corresponds to sizes.bytes, not in field order.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems to be incorrect, the ptrs array is indeed in field order.


const fields = meta.fields(S);
/// `sizes.bytes` is an array of @sizeOf each S field. Sorted by alignment, descending.
/// `sizes.indexes` is an array mapping from field to its index in the `sizes.bytes` array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sizes.indexes is never used, we should probably get rid of it.

@andrewrk
Copy link
Member Author

andrewrk commented Feb 1, 2021

Heads up to @alexnask I think this is going to be a major breaking change to ZLS that may require quite some effort to update. Sorry about that. The good news, however, is that it should in theory improve perf & memory usage of ZLS.

only std.zig.render cares about these, and it can find them in the
original source easily enough.
@trishume
Copy link

trishume commented Feb 1, 2021

One thing you might want to consider is writing a quick benchmark of a simple traversal over the whole AST. I would expect this change to make producing the AST faster but it might make traversing slower due to potentially taking misses on 3 cache lines when visiting a node instead of one. Similarly if you implement the two types of function calls you might add more branch mispredictions.

It would also maybe help double-check the ergonomics of traversal. Although maybe you can just wait to port zig fmt to do all of this since presumably it's a tree traversal.

Edit: The extra cache misses may only be an issue if in practice the order you generate the tree differs substantially from the order you traverse it. This may depend on details of your parser, like if you (pre)allocate the + node before parsing either argument you might end up with the same depth-first order you traverse in and thus minimize cache misses, but if you allocate it after or in-between the two children you won't (although it may still be quite similar and thus cache-efficient).

 * start implementation of ast.Tree.firstToken and lastToken
 * clarify some ast.Node doc comments
 * reimplement renderToken
@Avokadoen
Copy link

Shouldn't the Allocator arguments in multi_array_list.zig simply be called "allocator" and not "gpa"?

@andrewrk
Copy link
Member Author

andrewrk commented Feb 2, 2021

Shouldn't the Allocator arguments in multi_array_list.zig simply be called "allocator" and not "gpa"?

I've been experimenting with a new convention: if the parameter is named gpa it indicates the allocator will be used in a way that would be better suited for a general purpose allocator - e.g. one that reclaims memory on free(). It's not a requirement, but it's a hint to callsites that, if they have the choice, a gpa would be preferable to an arena.

This approach properly handles nesting unlike the approach in the
previous commit.
@andrewrk
Copy link
Member Author

Nope.
Wanna see a magic trick? Apply this patch:

Wow, thanks for pointing this out. That's a footgun if I've ever seen one. Wait so this has nothing to do with SIMD vectors, instead it has to do with using std.mem.copy on a u7? I do not yet understand why this patch would change anything. The ABI size of u7 is 1 byte, so writing a u7 should clear the extra bit.

Reverts bf64220 and uses a different
workaround, suggested by @LemonBoy.

There is either a compiler bug or a design flaw somewhere around here.
It does not have to block this branch, but I need to understand exactly
what's going on here and make it so that nobody ever has to run into
this problem again.
@LemonBoy
Copy link
Contributor

Long story short, the optimizations are kicking in and are vectorizing (exceptionally well, I must add) the mem.copy loop by turning the single i7 loads into <4 x i7> ones (and coalescing more than a single read/write when possible, the aarch64 asm accelerates the copy by reading/writing 8 bytes at once).

Now, the problem with LLVM's vector is that they are tight, there's no padding in between the elements, unlike in a array...
You probably understand the problem now, the generated aarch64 code is slicing and dicing the bits in the wrong way.

@andrewrk
Copy link
Member Author

the generated aarch64 code is slicing and dicing the bits in the wrong way.

Ah, I see. So it is an LLVM bug in vectorization, but not with @reduce. Is that your conclusion? Or have we made some horrific language design flaw?

@LemonBoy
Copy link
Contributor

Ah, I see. So it is an LLVM bug in vectorization, but not with @reduce. Is that your conclusion?

Yes, the @reduce thing was a red herring, all it did was shift the code around to hide the problem.
It remains to be seen if the problem is limited to the loop vectorizer or is extended to other regular loads/stores.

@andrewrk
Copy link
Member Author

Benchmark for running zig fmt on the whole std lib:

master branch

  • peak memory usage: 84.7 MiB
  • zig executable binary size (without -Denable-llvm): 3.37 MiB
[nix-shell:~/dev/zig/build-release]$ perf stat -e cache-misses -e instructions -e cycles ./zig fmt ~/tmp/std

 Performance counter stats for './zig fmt /home/andy/tmp/std':

         2,803,440      cache-misses                                                
     4,592,565,483      instructions              #    2.70  insn per cycle         
     1,702,969,462      cycles                                                      

       0.389872088 seconds time elapsed

       0.370892000 seconds user
       0.018994000 seconds sys

ast-memory-layout branch

  • peak memory usage: 68.4 MiB
  • zig executable binary size (without -Denable-llvm): 3.17 MiB
[nix-shell:~/Downloads/zig/build-release]$ perf stat -e cache-misses -e instructions -e cycles ./zig fmt ~/tmp/std2

 Performance counter stats for './zig fmt /home/andy/tmp/std2':

         3,771,657      cache-misses                                                
     3,941,606,705      instructions              #    2.62  insn per cycle         
     1,502,290,493      cycles                                                      

       0.345329438 seconds time elapsed

       0.332314000 seconds user
       0.013012000 seconds sys

Deltas:

  • ✔️ 19.3% less peak memory usage
  • ✔️ 10.8% reduction in binary size
  • ❌ 34.5% more cache-misses
  • ✔️ 14.2% fewer instructions
  • ✔️ 11.8% fewer CPU cycles
  • ✔️ 11.4% faster wall clock time

Keep in mind, zig fmt does a lot of extra work that astgen from the main pipeline does not have to do. So I think these are very promising results!

@andrewrk andrewrk merged commit d7049fc into master Feb 25, 2021
@andrewrk andrewrk deleted the ast-memory-layout branch February 25, 2021 02:49
@LemonBoy
Copy link
Contributor

Link to the LLVM patch, keep all your fingers crossed and hopefully it'll make into LLVM 12

@andrewrk andrewrk added translate-c C to Zig source translation feature (@cImport) zig fmt labels Feb 27, 2021
andrewrk added a commit to g-w1/zig that referenced this pull request Mar 3, 2021
ifreund added a commit to ifreund/zig-spec that referenced this pull request Mar 7, 2021
SpexGuy pushed a commit to ziglang/zig-spec that referenced this pull request Mar 7, 2021
@andrewrk andrewrk mentioned this pull request Jul 10, 2021
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Implementing this issue could cause existing code to no longer compile or have different behavior. enhancement Solving this issue will likely involve adding new logic or components to the codebase. frontend Tokenization, parsing, AstGen, Sema, and Liveness. optimization translate-c C to Zig source translation feature (@cImport) zig fmt
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet