-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
libexpr: Reduce the size of Value down to 16 bytes (on 64 bit systems) #13407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool stuff. Have only went over briefly. But I am interested in a review from @NaN-git
Also fixed |
FYI, some prior discussion about slimming down |
@xokdvium Does this PR slow down the garbage collection due to the tagged/misaligned pointers? |
Considering that the GC has to work much less (significantly less memory to scan) it works out that the fraction of the time spent in GC is even reduced (~5% difference for (For reference, here's how Boehm does the marking with specified offsets: https://github.com/ivmai/bdwgc/blob/88a61fc6f2688d8e1ea94f3f810c23db89a3f30b/new_hblk.c#L203-L206, https://github.com/ivmai/bdwgc/blob/88a61fc6f2688d8e1ea94f3f810c23db89a3f30b/include/private/gc_pmark.h#L246-L249, https://github.com/ivmai/bdwgc/blob/88a61fc6f2688d8e1ea94f3f810c23db89a3f30b/include/private/gc_pmark.h#L272-L274)
I'm actually seeing a modest speedup across the board both for exercising the gc and with large initial heap sizes. Also I wouldn't really worry about the compute overhead. All of the pointer tagging and untagging are the cheapest possible instructions, which take up a single clock cycle and considering the huge out-of-order windows and plenty of ALUs with large superscalar pipelines of modern processors this works out to be negligible (at least compared to the wins from doing less memory allocations, which are syscalls). For example, here's what
(So basically just several bitshifts and branches to The pointer tagging for
One thing I'm really inclined to do is just get rid of the assertions, which would really simplify the code generation for all of these functions. That way we'd get even more performance wins on top of the reduced memory usage. In my measurements with assertions the runtime is slightly better, but without them the improvement (compared to current master) is the ballpark of 3.3%:
|
6b99241
to
6315294
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic!
Just minor comments. This is close to done.
// These should be removed eventually, by putting the functionality that's | ||
// needed by callers into methods of this type | ||
|
||
// type() == nThunk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't see that it was about the group at first. Doesn't seem like a particularly valuable comment.
// type() == nThunk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-existing. Can be done separately.
The following commits will touch this file significantly, so it's better to get the formatting out of the way first.
…alueStorage` This factors out most of the value representation into a mixin class. `finishValue` is now gone for good and replaced with a simple template function `setStorage` which derives the type information/disriminator from the type of the argument. Likewise, reading of the value goes through function template `getStorage`. An empty type `Null` is introduced to make the bijection InternalType <-> C++ type complete.
This also makes it possible to make `payload` field private in the `ValueStorage` class template.
This shaves off a very significand amount of memory used for evaluation as well as reduces the GC-managed heap. Previously the union discriminator (InternalType) was stored as a separate field in the Value, which takes up whole 8 bytes due to padding needed for member alignment. This effectively wasted 7 whole bytes of memory. Instead of doing that InternalType is instead packed into pointer alignment niches. As it turns out, there's more than enough unused bits there for the bit packing to be effective. See the doxygen comment in the ValueStorage specialization for more details. This does not add any performance overhead, even though we now consistently assert the InternalType in all getters. This can also be made atomic with a double width compare-and-swap instruction on x86_64 (CMPXCHG16B instruction) for parallel evaluation.
It would have been nice not to use the phrase Microsoft denotes 32bit integers as |
for storing the type information. This way tagged pointers are considered | ||
to be valid, even when they are not aligned. */ | ||
if constexpr (detail::useBitPackedValueStorage<sizeof(void *)>) | ||
for (std::size_t i = 1; i < sizeof(std::uintptr_t); ++i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't introduce std::uintptr_t
here because sizeof(std::uintptr_t) == sizeof(void *)
isn't necessarily true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here because sizeof(std::uintptr_t) == sizeof(void *) isn't necessarily true.
Yeah, that's true. Though I can't think of a platform where that is actually the case.
* discriminator is stored in the first double word. | ||
* | ||
* The layout of discriminator bits is determined by the 3 bits of PrimaryDiscriminator, | ||
* which are always stored in the lower 3 bits of the first dword of the payload. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using dword
is ambiguous. My expectation is that the word size on a 64bit machine is 64bit, although on x86_64
64bit arithmetic instruction might not have the most efficient encoding.
Historically Microsoft uses dword
for 32bit values.
return nFloat; | ||
case tThunk: | ||
case tApp: | ||
return nThunk; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please introduce formatting changes in another PR.
/* Whether to use a specialization of ValueStorage that does bitpacking into | ||
alignment niches. */ | ||
template<std::size_t ptrSize> | ||
inline constexpr bool useBitPackedValueStorage = (ptrSize == 8) && (__STDCPP_DEFAULT_NEW_ALIGNMENT__ >= 8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably the GC isn't affected by __STDCPP_DEFAULT_NEW_ALIGNMENT__
, i.e. this makes sense when the default allocator is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably the GC isn't affected by STDCPP_DEFAULT_NEW_ALIGNMENT, i.e. this makes sense when the default allocator is used.
I've added that because Value also stores pointers that are not owned by Boehm (like SourceAccessor). Those are most likely allocated with the default allocator.
argh I did a review, but I forgot to publish it. |
Motivation
This shaves off a very significant (~20% percent reduction in maximum heap size and ~17% in total bytes) amount of memory used for evaluation as well as reduces the GC-managed heap. See the results below. The fraction of collected memory is not affected.
Memory usage comparison
Below are the comparisons for
NIX_SHOW_STATS=1 nix-env --query --available --out-path --file ../nixpkgs --eval-system x86_64-linux > /dev/null
Before
After
So the savings are around
4.9
GiB on a full nixpkgs eval.Previously the union discriminator (
InternalType
) wasstored as a separate field in the Value, which takes up
whole 8 bytes due to padding needed for member alignment.
This effectively wasted 7 whole bytes of memory. Instead
of doing that
InternalType
is packed into pointeralignment niches. As it turns out, there's more than enough
unused bits there for the bit packing to be effective.
See the doxygen comment in the
ValueStorage
specializationfor more details.
This does not add any performance overhead, even though
we now consistently assert the
InternalType
in all getters.This can also be made atomic with a double width compare-and-swap
instruction on x86_64 (
CMPXCHG16B
instruction) for parallel evaluation.Context
This is the best we can do with the current data model to address #8621.
The improvement is very significant and should give more leeway to the ever growing nixpkgs.
Add 👍 to pull requests you find important.
The Nix maintainer team uses a GitHub project board to schedule and track reviews.