Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor jemalloc performance with zeroed allocations leading to TLB shootdown #27275

Open
alessandrod opened this issue Aug 19, 2022 · 4 comments · May be fixed by anza-xyz/agave#1250 or anza-xyz/agave#1364
Labels
validator Issues that relate to the validator

Comments

@alessandrod
Copy link
Contributor

Problem

While profiling a branch including all the patches needed to bring direct account mapping with abiv1, I noticed a very large amount of TLB flushes and page faults caused by the program runtime. Initially I feared that direct mapping changes were somehow causing the issue, but I've now observed that the problem can happen in master as well. Direct mapping does seem to make it worse, most likely by making the program runtime threads a lot faster (the irony!).

The problem is the following:

Screen Shot 2022-08-19 at 7 48 38 pm

It looks like jemalloc always force-purges zeroed extents immediately, instead of implementing two phase release like it does for non-zeroed allocations. Two phase cleanup reduces overhead from allocating/deallocting memory, at the expense of retaining a bit more memory during the decay period. Furthermore, jemalloc purges zeroed extents by using madvise(MADV_DONTNEED) which requires a TLB flush - and with our allocation sizes - a full TLB flush (the theory being that doing a full flush is faster than flushing the individual page entries).

Since we run the program runtime inside rayon, we have a bunch of threads constantly flushing TLBs, therefore getting into a by the book TLB shootdown (https://web.njit.edu/~dingxn/papers/ispa20.pdf).

To confirm that the shootdown is caused by the interaction between the rayon thread pool and jemalloc (the default glibc allocator doesn't exhibit the problem), I've written a minimal test case which mimics the CallFrame allocation we do in the program runtime: https://gist.github.com/alessandrod/a80788429873a4b9caa6aa53a82e0b2b

Here's perf numbers on a 64 vcpu gcloud vm:

$ hyperfine -i -L alloc malloc_memset,calloc,calloc_slab  'target/release/examples/mem {alloc}'
Benchmark 1: target/release/examples/mem malloc_memset
  Time (mean ± σ):     122.5 ms ±  22.5 ms    [User: 1566.1 ms, System: 113.2 ms]
  Range (min … max):    59.3 ms … 176.6 ms    23 runs
 
Benchmark 2: target/release/examples/mem calloc
  Time (mean ± σ):     260.2 ms ±  28.7 ms    [User: 370.3 ms, System: 5734.7 ms]
  Range (min … max):   207.0 ms … 293.6 ms    10 runs
 
Benchmark 3: target/release/examples/mem calloc_slab
  Time (mean ± σ):      94.5 ms ±  10.0 ms    [User: 85.6 ms, System: 237.9 ms]
  Range (min … max):    64.8 ms … 123.1 ms    28 runs
 
Summary
  'target/release/examples/mem calloc_slab' ran
    1.30 ± 0.27 times faster than 'target/release/examples/mem malloc_memset'
    2.75 ± 0.42 times faster than 'target/release/examples/mem calloc'

You can see that calloc is awfully slower than malloc_memset, even though the latter causes nearly twice as many page faults as it pages in the whole allocation to zero it.

calloc_slab works around the problem by pre-allocating large zero extents and then purging in one go, therefore doing only one TLB flush when the whole slab is deallocated. This confirms that the problem is caused by releasing many small calloc allocations. I've prototyped this for the program runtime - one slab per transaction execution. Unfortunately since we don't have a hard max number of instructions that can be executed per transaction, the slab needs to be quite large and while it improves perf, it also increases peak virtual memory usage significantly (although actual paged in memory stays lower than with malloc_memset).

Jemalloc implements two levels of caching: a small lock-free, per-thread cache and then larger arenas shared among threads. Turns out one way to avoid this particular issue is to make sure that the allocation fits in the per-thread cache (default is 32k, here I bumped it to 256k):

$ MALLOC_CONF=tcache_max:262144 hyperfine -i -L alloc malloc_memset,calloc,calloc_slab  'target/release/examples/mem {alloc}'
Benchmark 1: target/release/examples/mem malloc_memset
  Time (mean ± σ):     131.8 ms ±   9.1 ms    [User: 1346.1 ms, System: 113.6 ms]
  Range (min … max):   119.6 ms … 149.5 ms    22 runs
 
Benchmark 2: target/release/examples/mem calloc
  Time (mean ± σ):     135.6 ms ±   8.5 ms    [User: 1404.5 ms, System: 127.1 ms]
  Range (min … max):   124.7 ms … 154.3 ms    21 runs
 
Benchmark 3: target/release/examples/mem calloc_slab
  Time (mean ± σ):     100.5 ms ±   8.6 ms    [User: 104.2 ms, System: 308.4 ms]
  Range (min … max):    88.0 ms … 132.5 ms    30 runs
 
Summary
  'target/release/examples/mem calloc_slab' ran
    1.31 ± 0.14 times faster than 'target/release/examples/mem malloc_memset'
    1.35 ± 0.14 times faster than 'target/release/examples/mem calloc'

Proposed Solution

Has anyone looked into tuning jemalloc for the validator? This issue aside I see that there's quite a bit of memory churn, so I'm tempted to fix this issue (and possibly more), by running the jemalloc profiler and making sure that more allocations get cached.

@alessandrod
Copy link
Contributor Author

Btw for the lols: if you look at the stack trace, there's a _rjem_je_ehooks_default_zero_impl callback. Great! I thought I'll implement my callback and make it not purge so often. Then I found this https://github.com/jemalloc/jemalloc/blob/deb8e62a837b6dd303128a544501a7dc9677e47a/include/jemalloc/internal/ehooks.h#L367

@ryoqun
Copy link
Member

ryoqun commented Aug 20, 2022

hehe, nice finding.

i think we can use alloca or equivalent with increased pthread stack size? After all, cpis are like normal function calls in terms of its temporary Vecs lifetime.

@alessandrod
Copy link
Contributor Author

i think we can use alloca or equivalent with increased pthread stack size? After all, cpis are like normal function calls in terms of its temporary Vecs lifetime.

I thought about that and it'd be fairly easy to implement. Max frame size is fixed and CPIs are nested in the host stack too so we don't even need alloca. But it would merge the SBF stack with the host stack, which from a security perspective isn't worth the tradeoff I think.

@sakridge sakridge added the validator Issues that relate to the validator label Oct 21, 2022
@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Oct 23, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 31, 2023
@alessandrod alessandrod reopened this Nov 9, 2023
@behzadnouri behzadnouri removed the stale [bot only] Added to stale content; results in auto-close after a week. label Nov 9, 2023
@ryoqun
Copy link
Member

ryoqun commented May 15, 2024

i think we can use alloca or equivalent with increased pthread stack size? After all, cpis are like normal function calls in terms of its temporary Vecs lifetime.

after almost 2 years, i finally got my hands on this: anza-xyz#1364

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
validator Issues that relate to the validator
Projects
None yet
4 participants