-
Notifications
You must be signed in to change notification settings - Fork 1k
Switch to tcmalloc #5101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to tcmalloc #5101
Conversation
dmkozh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memory/perf numbers looks great, thanks for the research!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR switches stellar-core from the glibc allocator to tcmalloc to address memory arena issues that were causing OOM kills when the state snapshot invariant was enabled. The PR adds tcmalloc as a required dependency on Linux, configures it with optimized parameters for stellar-core's allocation patterns, and updates documentation and build configuration accordingly.
Changes:
- Adds TcmallocConfig.cpp to configure tcmalloc parameters at startup
- Updates build system (configure.ac, Makefile.am) to detect and link tcmalloc on Linux
- Updates INSTALL.md to document the new libgoogle-perftools-dev dependency
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/util/TcmallocConfig.cpp | New file that configures tcmalloc parameters (memory release rate, thread cache size, allocation thresholds) using a constructor attribute |
| configure.ac | Adds pkg-config check for libtcmalloc_minimal >= 2.0 on Linux, makes it mandatory with clear error message |
| src/Makefile.am | Adds libtcmalloc_LIBS to stellar_core_LDADD for linking |
| INSTALL.md | Documents new libgoogle-perftools-dev dependency for Linux builds |
|
Nice work indeed! We might have to add this in a few other places like Dockerfiles etc. I thought of asking copilot to do it but we can just search manually and include them in this PR. |
ab092df to
eb688a9
Compare
I've updated the testing dockerfiles in the core repo and opened a PR for quickstart: stellar/quickstart#890. I believe the last failing github action should pass after the quickstart PR merges. |
ab092df to
01bd2f4
Compare
anupsdf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
Switches core grom glibc allocator to tcmalloc.
After adding the state snapshot invariant, we've seen nodes with it enabled get OOM killed. This was due to an issue with glibc memory arenas. TLDR: we'd make a large allocation on thread A (the in-memory snapshot copy), pass it over to thread B, then have thread B free the allocation. In glibc, there is a per-thread memory arena. Each allocation made by a thread takes capacity from this arena. The issue is, whatever thread frees an allocation gets the capacity added back to its own arena regardless of the original allocating thread. This means that the snapshot invariant copy allocation caused us to "drain" memory from the apply thread, resulting in the apply thread continually requesting more heap until we get OOM killed. For more info, check out this blog.
The solution is to switch to an allocator more friendly to our multithreaded allocation pattern. After testing mimalloc, jemalloc, glibc, and tcmalloc, I've landed on tcmalloc as our winner.
tcmalloc is developed by Google, packaged in default Ubuntu repos, and is quite stable. It's memory overhead and runtime performance are both excellent for our use case. In order to test Resident Set Size (RSS, or allocator memory overhead), I've been running different allocator setting over the last couple days on test-core-003. After tuning tcmalloc, I've seen a memory decrease of about 40%, from 7.5 GB to 5 GB. Surprisingly, I've also seen a very significant runtime improvement.
In a multithreaded, high throughput environment (measured by indexing the BucketList in parallel), we see the following allocation measurements:
Deallocation time is omitted, as it is basically the same for both. The key takeway is that tcmalloc is faster on the mdeian case, but also much more consistent due to better handling of thread contention.
To test more realistic patterns, I ran catchup on 200 ledgers starting from 60757320 locally on my laptop. tcmalloc resulted in an 11% reduction of total apply time compared to glibc:
I also tested max-sac-tps test on
user-dev-001. I tested with 4 threads, batch size of 1, with disk writes counting towards overall apply time. tcmalloc consistently showed ~18% improvement in apply time and overall max sac TPS (from 6400 -> 7680 TPS). This seems like a pretty significant and easy win!I don't think CI will pass, as we need to add a new dependency to our build. It looks like we have to create the runner with the new dependency inside namespace directly, so maybe @graydon can help out with this. I'm also not sure if I correctly changed the build system, especially with the way I've set up the config options. In my testing I was just linking everything directly, and the actual compile time changes are AI generated, so some help would be appreciated.
Checklist
clang-formatv8.0.0 (viamake formator the Visual Studio extension)