libexpr: Reduce the size of Value down to 16 bytes (on 64 bit systems) #13407

xokdvium · 2025-06-28T00:28:46Z

Motivation

This shaves off a very significant (~20% percent reduction in maximum heap size and ~17% in total bytes) amount of memory used for evaluation as well as reduces the GC-managed heap. See the results below. The fraction of collected memory is not affected.

Memory usage comparison

Below are the comparisons for NIX_SHOW_STATS=1 nix-env --query --available --out-path --file ../nixpkgs --eval-system x86_64-linux > /dev/null

Before

{
  ....
  "gc": {
    "cycles": 14,
    "heapSize": 14818668544,
    "totalBytes": 27457473440
  },
  ....
  "sizes": {
    "Attr": 16,
    "Bindings": 16,
    "Env": 8,
    "Value": 24
  },
  ....
  "values": {
    "bytes": 7345779024,
    "number": 306074126
  }
}

After

{
  ....
  "gc": {
    "cycles": 14,
    "heapSize": 11916210176,
    "totalBytes": 22553098160
  },
  ....
  "sizes": {
    "Attr": 16,
    "Bindings": 16,
    "Env": 8,
    "Value": 16
  },
  ....
  "values": {
    "bytes": 4897186016,
    "number": 306074126
  }
}

So the savings are around 4.9 GiB on a full nixpkgs eval.

Previously the union discriminator (InternalType) was
stored as a separate field in the Value, which takes up
whole 8 bytes due to padding needed for member alignment.
This effectively wasted 7 whole bytes of memory. Instead
of doing that InternalType is packed into pointer
alignment niches. As it turns out, there's more than enough
unused bits there for the bit packing to be effective.

See the doxygen comment in the ValueStorage specialization
for more details.

This does not add any performance overhead, even though
we now consistently assert the InternalType in all getters.

This can also be made atomic with a double width compare-and-swap
instruction on x86_64 (CMPXCHG16B instruction) for parallel evaluation.

Context

This is the best we can do with the current data model to address #8621.
The improvement is very significant and should give more leeway to the ever growing nixpkgs.

Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

src/libexpr/value-to-xml.cc

Mic92

Cool stuff. Have only went over briefly. But I am interested in a review from @NaN-git

src/libexpr/include/nix/expr/value.hh

src/libexpr/print-ambiguous.cc

xokdvium · 2025-06-28T14:18:10Z

Also fixed i686-linux builds. Needed some SFINAE trickery to make the compiler not instantiate ValueStorage member functions that would be ill-formed on a non-64 bit system.

xokdvium · 2025-06-29T15:42:18Z

FYI, some prior discussion about slimming down Value by using pointer tagging #9895 (comment).

NaN-git · 2025-06-29T18:23:23Z

@xokdvium Does this PR slow down the garbage collection due to the tagged/misaligned pointers?
Otherwise the computational overhead seems to be reasonable.

xokdvium · 2025-06-29T19:13:31Z

Does this PR slow down the garbage collection due to the tagged/misaligned pointers?

Considering that the GC has to work much less (significantly less memory to scan) it works out that the fraction of the time spent in GC is even reduced (~5% difference for hello package and ~10% for nixosTests.gnome):

(For reference, here's how Boehm does the marking with specified offsets: https://github.com/ivmai/bdwgc/blob/88a61fc6f2688d8e1ea94f3f810c23db89a3f30b/new_hblk.c#L203-L206, https://github.com/ivmai/bdwgc/blob/88a61fc6f2688d8e1ea94f3f810c23db89a3f30b/include/private/gc_pmark.h#L246-L249, https://github.com/ivmai/bdwgc/blob/88a61fc6f2688d8e1ea94f3f810c23db89a3f30b/include/private/gc_pmark.h#L272-L274)

GC_INITIAL_HEAP_SIZE=1M NIX_SHOW_STATS=1 result/bin/nix-instantiate ../nixpkgs -A hello --readonly-mode

I'm actually seeing a modest speedup across the board both for exercising the gc and with large initial heap sizes.

Also I wouldn't really worry about the compute overhead. All of the pointer tagging and untagging are the cheapest possible instructions, which take up a single clock cycle and considering the huge out-of-order windows and plenty of ALUs with large superscalar pipelines of modern processors this works out to be negligible (at least compared to the wins from doing less memory allocations, which are syscalls).

For example, here's what getInternalType compiles down to on x86_64 (but it also gets inlined everywhere that accesses happen):

0000000000080790 <nix::ValueStorage<8ul, void>::getInternalType() const>:
   80790:	       48 8b 07             	mov    (%rdi),%rax
   80793:	       89 c2                	mov    %eax,%edx
   80795:	       83 e2 07             	and    $0x7,%edx
   80798:	       83 fa 04             	cmp    $0x4,%edx
   8079b:	,----- 7f 23                	jg     807c0 <nix::ValueStorage<8ul, void>::getInternalType() const+0x30>
   8079d:	|      89 c6                	mov    %eax,%esi
   8079f:	|      8d 4a 0b             	lea    0xb(%rdx),%ecx
   807a2:	|      83 e6 06             	and    $0x6,%esi
   807a5:	|  ,-- 75 0b                	jne    807b2 <nix::ValueStorage<8ul, void>::getInternalType() const+0x22>
   807a7:	|  |   48 c1 e8 03          	shr    $0x3,%rax
   807ab:	|  |   85 d2                	test   %edx,%edx
   807ad:	|  |   89 c1                	mov    %eax,%ecx
   807af:	|  |   0f 44 ce             	cmove  %esi,%ecx
   807b2:	|  '-> 89 c8                	mov    %ecx,%eax
   807b4:	|      31 d2                	xor    %edx,%edx
   807b6:	|      31 c9                	xor    %ecx,%ecx
   807b8:	|      31 f6                	xor    %esi,%esi
   807ba:	|      31 ff                	xor    %edi,%edi
   807bc:	|      c3                   	ret
   807bd:	|      0f 1f 00             	nopl   (%rax)
   807c0:	'----> 83 fa 05             	cmp    $0x5,%edx
   807c3:	   ,-- 75 15                	jne    807da <nix::ValueStorage<8ul, void>::getInternalType() const+0x4a> < Predictable branch to unreachable
   807c5:	   |   48 8b 4f 08          	mov    0x8(%rdi),%rcx
   807c9:	   |   83 e1 07             	and    $0x7,%ecx
   807cc:	   |   83 c1 08             	add    $0x8,%ecx
   807cf:	   |   89 c8                	mov    %ecx,%eax
   807d1:	   |   31 d2                	xor    %edx,%edx
   807d3:	   |   31 c9                	xor    %ecx,%ecx
   807d5:	   |   31 f6                	xor    %esi,%esi
   807d7:	   |   31 ff                	xor    %edi,%edi
   807d9:	   |   c3                   	ret
   807da:	   '-> 50                   	push   %rax
   807db:	       48 8d 15 a0 cb 14 00 	lea    0x14cba0(%rip),%rdx        # 1cd382 <_fini+0x2f0da>
   807e2:	       be 17 02 00 00       	mov    $0x217,%esi
   807e7:	       48 8d 3d 5d ca 14 00 	lea    0x14ca5d(%rip),%rdi        # 1cd24b <_fini+0x2efa3>
   807ee:	       e8 0d 59 fb ff       	call   36100 <nix::panic(char const*, int, char const*)@plt>
   807f3:	       90                   	nop
   807f4:	       66 66 2e 0f 1f 84 00 	data16 cs nopw 0x0(%rax,%rax,1)
   807fb:	       00 00 00 00 
   807ff:	       90                   	nop

(So basically just several bitshifts and branches to unreachable, which are always predictable)

The pointer tagging for setPairOfPointersPayload is just a simple or instruction:

00000000000834d0 <void nix::ValueStorage<8ul, void>::setPairOfPointersPayload<(nix::InternalType)11, nix::Env, nix::Expr>(nix::Env*, nix::Expr*)>:
   834d0:	    40 f6 c6 07          	test   $0x7,%sil
   834d4:	,-- 75 1f                	jne    834f5 <void nix::ValueStorage<8ul, void>::setPairOfPointersPayload<(nix::InternalType)11, nix::Env, nix::Expr>(nix::Env*, nix::Expr*)+0x25>
   834d6:	|   48 83 ce 05          	or     $0x5,%rsi < This is the pointer tagging
   834da:	|   48 89 37             	mov    %rsi,(%rdi)
   834dd:	|   f6 c2 07             	test   $0x7,%dl
   834e0:	+-- 75 13                	jne    834f5 <void nix::ValueStorage<8ul, void>::setPairOfPointersPayload<(nix::InternalType)11, nix::Env, nix::Expr>(nix::Env*, nix::Expr*)+0x25>
   834e2:	|   48 83 ca 03          	or     $0x3,%rdx

One thing I'm really inclined to do is just get rid of the assertions, which would really simplify the code generation for all of these functions. That way we'd get even more performance wins on top of the reduced memory usage. In my measurements with assertions the runtime is slightly better, but without them the improvement (compared to current master) is the ballpark of 3.3%:

Benchmark 1: GC_INITIAL_HEAP_SIZE=16G result/bin/nix-instantiate ../nixpkgs -A nixosTests.gnome --readonly-mode
  Time (mean ± σ):      2.979 s ±  0.020 s    [User: 2.355 s, System: 0.623 s]
  Range (min … max):    2.953 s …  3.011 s    10 runs
Benchmark 1: GC_INITIAL_HEAP_SIZE=16G nix-instantiate ../nixpkgs -A nixosTests.gnome --readonly-mode
  Time (mean ± σ):      3.072 s ±  0.033 s    [User: 2.346 s, System: 0.725 s]
  Range (min … max):    3.033 s …  3.141 s    10 runs

roberth

Fantastic!

Just minor comments. This is close to done.

src/libexpr/include/nix/expr/value.hh

roberth · 2025-07-02T08:44:00Z

src/libexpr/include/nix/expr/value.hh

+    // These should be removed eventually, by putting the functionality that's
+    // needed by callers into methods of this type
+
+    // type() == nThunk


Didn't see that it was about the group at first. Doesn't seem like a particularly valuable comment.

Suggested change

// type() == nThunk

Pre-existing. Can be done separately.

src/libexpr/include/nix/expr/value.hh

src/libexpr/print-ambiguous.cc

src/libexpr/include/nix/expr/value.hh

The following commits will touch this file significantly, so it's better to get the formatting out of the way first.

…late

…alueStorage` This factors out most of the value representation into a mixin class. `finishValue` is now gone for good and replaced with a simple template function `setStorage` which derives the type information/disriminator from the type of the argument. Likewise, reading of the value goes through function template `getStorage`. An empty type `Null` is introduced to make the bijection InternalType <-> C++ type complete.

This also makes it possible to make `payload` field private in the `ValueStorage` class template.

This shaves off a very significand amount of memory used for evaluation as well as reduces the GC-managed heap. Previously the union discriminator (InternalType) was stored as a separate field in the Value, which takes up whole 8 bytes due to padding needed for member alignment. This effectively wasted 7 whole bytes of memory. Instead of doing that InternalType is instead packed into pointer alignment niches. As it turns out, there's more than enough unused bits there for the bit packing to be effective. See the doxygen comment in the ValueStorage specialization for more details. This does not add any performance overhead, even though we now consistently assert the InternalType in all getters. This can also be made atomic with a double width compare-and-swap instruction on x86_64 (CMPXCHG16B instruction) for parallel evaluation.

NaN-git · 2025-07-02T22:17:18Z

It would have been nice not to use the phrase dword because on a 64bit system I would expect that the word size is 64bit, although e.g. on x86_64 64bit arithmetic instructions might not have the most efficient encoding.

Microsoft denotes 32bit integers as dword, so that this can create a lot of confusion.

NaN-git · 2025-06-29T17:47:27Z

src/libexpr/eval-gc.cc

+       for storing the type information. This way tagged pointers are considered
+       to be valid, even when they are not aligned. */
+    if constexpr (detail::useBitPackedValueStorage<sizeof(void *)>)
+        for (std::size_t i = 1; i < sizeof(std::uintptr_t); ++i)


I wouldn't introduce std::uintptr_t here because sizeof(std::uintptr_t) == sizeof(void *) isn't necessarily true.

here because sizeof(std::uintptr_t) == sizeof(void *) isn't necessarily true.

Yeah, that's true. Though I can't think of a platform where that is actually the case.

NaN-git · 2025-06-29T17:57:10Z

src/libexpr/include/nix/expr/value.hh

+     * discriminator is stored in the first double word.
+     *
+     * The layout of discriminator bits is determined by the 3 bits of PrimaryDiscriminator,
+     * which are always stored in the lower 3 bits of the first dword of the payload.


Using dword is ambiguous. My expectation is that the word size on a 64bit machine is 64bit, although on x86_64 64bit arithmetic instruction might not have the most efficient encoding.
Historically Microsoft uses dword for 32bit values.

NaN-git · 2025-06-29T18:03:54Z

src/libexpr/include/nix/expr/value.hh

+            return nFloat;
+        case tThunk:
+        case tApp:
+            return nThunk;


Please introduce formatting changes in another PR.

NaN-git · 2025-06-29T18:22:13Z

src/libexpr/include/nix/expr/value.hh

+/* Whether to use a specialization of ValueStorage that does bitpacking into
+   alignment niches. */
+template<std::size_t ptrSize>
+inline constexpr bool useBitPackedValueStorage = (ptrSize == 8) && (__STDCPP_DEFAULT_NEW_ALIGNMENT__ >= 8);


Probably the GC isn't affected by __STDCPP_DEFAULT_NEW_ALIGNMENT__, i.e. this makes sense when the default allocator is used.

Probably the GC isn't affected by STDCPP_DEFAULT_NEW_ALIGNMENT, i.e. this makes sense when the default allocator is used.

I've added that because Value also stores pointers that are not owned by Boehm (like SourceAccessor). Those are most likely allocated with the default allocator.

NaN-git · 2025-07-02T22:20:25Z

argh I did a review, but I forgot to publish it.

xokdvium requested review from roberth and edolstra as code owners June 28, 2025 00:28

github-actions bot added new-cli Relating to the "nix" command c api Nix as a C library with a stable interface labels Jun 28, 2025

xokdvium changed the title ~~libexpr: Reduce the size of Value down to 16 bytes~~ libexpr: Reduce the size of Value down to 16 bytes (on 64 bit systems) Jun 28, 2025

xokdvium commented Jun 28, 2025

View reviewed changes

src/libexpr/value-to-xml.cc Outdated Show resolved Hide resolved

Mic92 reviewed Jun 28, 2025

View reviewed changes

xokdvium commented Jun 28, 2025

View reviewed changes

src/libexpr/include/nix/expr/value.hh Outdated Show resolved Hide resolved

xokdvium force-pushed the smaller-value branch from 814cff3 to 92f4086 Compare June 28, 2025 12:31

xokdvium commented Jun 28, 2025

View reviewed changes

src/libexpr/print-ambiguous.cc Outdated Show resolved Hide resolved

xokdvium force-pushed the smaller-value branch from 92f4086 to 824b925 Compare June 28, 2025 14:17

xokdvium force-pushed the smaller-value branch from 824b925 to cc7ca79 Compare June 28, 2025 17:09

xokdvium force-pushed the smaller-value branch 2 times, most recently from 6b99241 to 6315294 Compare July 1, 2025 22:17

roberth reviewed Jul 2, 2025

View reviewed changes

xokdvium added 6 commits July 2, 2025 21:51

libexpr: Format value.hh

ea32580

The following commits will touch this file significantly, so it's better to get the formatting out of the way first.

libexpr: Use single tSmallList Value discriminator for small lists

1a033ee

libexpr: Simplify Value::is* methods by introducing isa function temp…

810455f

…late

libexpr: Use proxy ListView for all Value list accesses

e73fcf7

This also makes it possible to make `payload` field private in the `ValueStorage` class template.

xokdvium force-pushed the smaller-value branch from 6315294 to 5a20a48 Compare July 2, 2025 19:33

roberth approved these changes Jul 2, 2025

View reviewed changes

roberth merged commit c79b801 into NixOS:master Jul 2, 2025
12 checks passed

NaN-git reviewed Jul 2, 2025

View reviewed changes

This was referenced Jul 2, 2025

Memory usage in eval #8621

Open

Try a different memory garbage collector: whippet #8626

Open

xokdvium deleted the smaller-value branch July 4, 2025 08:54

Uh oh!

libexpr: Reduce the size of Value down to 16 bytes (on 64 bit systems) #13407

libexpr: Reduce the size of Value down to 16 bytes (on 64 bit systems) #13407

Conversation

xokdvium commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Before

After

Context

Uh oh!

Uh oh!

Mic92 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

xokdvium commented Jun 28, 2025

Uh oh!

xokdvium commented Jun 29, 2025

Uh oh!

NaN-git commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xokdvium commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roberth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NaN-git commented Jul 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NaN-git commented Jul 2, 2025

Uh oh!

Uh oh!

xokdvium commented Jun 28, 2025 •

edited

Loading

NaN-git commented Jun 29, 2025 •

edited

Loading

xokdvium commented Jun 29, 2025 •

edited

Loading