Reduce memory allocation overhead#15096
Conversation
🤖 Augment PR SummarySummary: This PR reduces allocation/free overhead in Redis’ command execution hot paths (notably workloads with many small args and pipelining). Changes:
Technical Notes: Overflow threads share hashed accounting slots (atomic RMW) as before; increased tcache capacity may raise worst-case memory footprint. 🤖 Was this summary useful? React with 👍 or 👎 |
CE Performance Automation : step 1 of 2 (build) DONE.This comment was automatically generated given a benchmark was triggered.
You can check a comparison in detail via the grafana link |
CE Performance Automation : step 2 of 2 (benchmark) FINISHED.This comment was automatically generated given a benchmark was triggered. Started benchmark suite at 2026-05-03 20:34:44.628983 and took 57577.83804 seconds to finish. In total will run 378 benchmarks. |
Preliminary benchmark results — fleet run
|
| Platform | % change | Note |
|---|---|---|
x86-aws-m7i.metal-24xl |
-3.8% | 22,770 → 21,909 ops/s |
x86-aws-m8i.24xlarge |
-3.2% | |
x86-aws-m8a.metal-24xl |
-7.2% | 33,658 → 31,224 ops/s |
Interestingly, the hgetex-persist variant of the same test improves (+3.6% to +12.7%) — so it's specific to read-only HGETEX, not HGETEX-as-a-write.
TMA funnel on x86-aws-m8a.metal-24xl-profiler (level-2, 30s windows):
| unstable | faster-alloc | Δ | |
|---|---|---|---|
| Retiring | 40.8% | 39.3% | -1.5pp |
| Frontend_Bound | 44.9% | 49.5% | +4.6pp |
| ↳ Fetch_Bandwidth | 15.3% | 19.3% | +4.0pp |
| ↳ Fetch_Latency | 29.6% | 30.3% | +0.7pp |
| Memory_Bound | 10.9% | 8.0% | -2.9pp |
| Bad_Speculation | 2.3% | 2.2% | -0.1pp |
The allocator work delivered the expected Memory_Bound win (-2.9pp). But the combined effect of -flto into jemalloc + the always_inline createStringObjectInline() appears to bloat the instruction footprint enough that Fetch_Bandwidth jumps +4.0pp on reply-heavy paths (HGETEX-50f runs createStringObject 50× per request, so icache / DSB pressure shows up). The icache cost outweighs the Memory_Bound win on this test, so Retiring drops 1.5pp → throughput -3 to -7%.
The hgetex-persist variant improves because it has a write side (TTL clear) where the allocator fastpaths dominate; plain read-only HGETEX pays the icache cost without the write-path benefit.
This tracks the concern you already called out: "misuse [of always_inline] can cause code bloat." Might be worth either scoping createStringObjectInline more narrowly, or gating on __attribute__((hot)) / PGO rather than always_inline — otherwise the read-path regression is likely to surface on more reply-heavy workloads.
Happy to share the raw funnels or try a variant that reverts just the forced-inline.
Final benchmark results — fleet run
|
| Platform | ISA | # improvements ≥3% | # regressions ≤-3% | Top win |
|---|---|---|---|---|
x86-aws-m7i.metal-24xl (Sapphire Rapids) |
x86 | 7 | 2 | load-hash-50f-100B +9.3% |
x86-aws-m8i.24xlarge (Granite Rapids) |
x86 | 2 | 0 | hgetex-persist-50f +12.4% |
x86-aws-m8a.metal-24xl (AMD Turin) |
x86 | 7 | 1 | load-hash-20f-1B-pipeline-30 +20.9% |
arm-aws-m8g.metal-24xl (Graviton4) |
ARM | 6 | 1 | load-hash-50f-100B +14.2% |
arm-gcp-c4a-standard-48 (Axion / Neoverse V2) |
ARM | 7 | 0 | load-hash-50f-100B +12.5% |
Two additional x86 runners (gcp-c4-standard-48 and m8a.metal-24xl-profiler) also completed (or are 94% complete), but their unstable baselines are older (9 days and 3 days respectively), so deltas there bake in 3–9 days of unrelated commits and are not directly comparable. Directional signal on those runners is consistent with the clean-baseline set: broad hash/stream-load improvements.
Cross-platform consistent improvements (observed on ≥3 platforms, ≥3%)
| Test | x86 m7i | x86 m8i | x86 m8a | ARM m8g | ARM c4a |
|---|---|---|---|---|---|
load-hash-50-fields-100B |
+9.3% | — | — | +14.2% | +12.5% |
load-hash-50-fields-1000B |
+8.0% | — | — | +4.8% | +6.5% |
load-hash-50-fields-1000B-expiration |
+3.9% | — | — | +3.9% | +3.2% |
1000streams-xreadgroup-count-100 |
— | — | +3.6% | +3.3% | +4.3% |
hgetex-persist-50-fields-10B |
— | +12.4% | +3.6% | — | +3.2% |
hash-htll-50-fields-10B (x86 only) |
+6.6% | +3.6% | +5.3% | −3.8% | +0.8% |
The HSET/load-hash pipelined wins you reported in the PR description reproduce across both ISAs and all hardware generations. Best observed delta is +20.9% on load-hash-20-fields-1B-pipeline-30 on m8a.metal-24xl.
Cross-platform regressions
1. hash-hgetex-50-fields-10B — x86 only (covered in the earlier comment)
| Platform | Δ | ops/s |
|---|---|---|
x86-aws-m7i.metal-24xl |
−3.8% | 22,770 → 21,909 |
x86-aws-m8i.24xlarge |
−3.2% | (sub-threshold) |
x86-aws-m8a.metal-24xl |
−7.2% | 33,658 → 31,224 |
x86-aws-m8a.metal-24xl-profiler |
−2.5% | 33,679 → 32,846 |
arm-aws-m8g.metal-24xl |
+1.0% | 22,885 → 23,106 |
arm-gcp-c4a-standard-48 |
+1.0% | 23,776 → 24,005 |
The x86-only regression pattern matches the TMA diagnosis in the earlier comment: Memory_Bound shrinks −2.9pp (expected allocator win) but Fetch_Bandwidth grows +4.0pp from the always_inline createStringObjectInline bloating the hot path on reply-heavy workloads. ARM is flat — consistent with the absence of a uop cache / DSB-equivalent bottleneck on Neoverse-class cores.
2. hash-htll-50-fields-10B — ARM m8g only (−3.8%)
| Platform | Δ |
|---|---|
x86-aws-m7i.metal-24xl |
+6.6% |
x86-aws-m8i.24xlarge |
+3.6% |
x86-aws-m8a.metal-24xl |
+5.3% |
arm-aws-m8g.metal-24xl |
−3.8% |
arm-gcp-c4a-standard-48 |
+0.8% |
Opposite sign on Graviton4 vs the rest of the fleet. Only one data point on one ARM platform, and the test shows 0.2% std.dev so it is not noise — but we can't fully attribute without ARM TMA (no ARM topdown runner in our fleet). Might be worth a second look in case the zmalloc single-writer-slot reshuffle interacts badly with Graviton4's cache layout. Can share the raw timeseries if useful.
Test failures (context)
All the test failures observed on profiler and c4-standard-48 runners (string-auth-reconnect-10B, pubsub-mixed-*, replica-only-parallel-fullsync-*, 3Mkeys-string-mixed-50-50-with-expiration-pipeline-10-400_conns, zremrangebyscore-pipeline-10) reproduce on both platforms and on both branches — they are test-framework/infrastructure issues, not caused by this PR.
Bottom line
- Write-path wins are confirmed and consistent across x86 (Intel + AMD) and ARM (Graviton4 + Axion).
- The
hgetex-50-fields-10Bregression on x86 (up to −7.2%) is real and TMA-attributable toalways_inline-driven icache pressure; ARM is not affected. - The
htll-50-fields-10Bregression on Graviton4 (−3.8%) is narrow, within-noise-margin, but worth a sanity check. - The allocator-accounting +
sdallocxparts of the PR look like clear wins across the board. The part worth scoping more carefully is the forced-inline — see earlier comment.
Happy to share raw CSVs, individual timeseries, or a scoped re-run if you want to validate a fix.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit ea42cc6. Configure here.
Re-run on
|
| Platform | exp #1 (988806e, with inline) |
exp #3 (5fc11317, post-revert) |
|---|---|---|
x86-aws-m8a.metal-24xl (AMD Turin) |
−7.2% | +0.3% ✅ |
x86-aws-m8a.metal-24xl-profiler (AMD Turin + TMA) |
−2.5% | −2.7% (under 3% threshold) |
x86-aws-m7i.metal-24xl (Sapphire Rapids) |
−3.8% | −4.1% |
x86-aws-m8i.24xlarge (Granite Rapids) |
−3.2% | −4.3% |
The AMD regression is fully resolved. There's a small Intel residual at ∼−4% on hgetex-50f-10B that hovers near the noise floor on both runs — not introduced by the inline-revert.
TMA on x86-aws-m8a.metal-24xl-profiler — hgetex-50f-10B, level-2
| unstable | exp #1 (988806e) |
exp #3 (5fc11317) |
|
|---|---|---|---|
| Retiring | 41.5% | 39.3% (−2.2pp) | 40.5% (−1.0pp) |
| Frontend_Bound | 44.1% | 49.5% (+5.4pp) | 46.5% (+2.4pp) |
| ↳ Fetch_Bandwidth | 15.5% | 19.3% (+3.8pp) | 16.0% (+0.5pp) ✅ |
| ↳ Fetch_Latency | 28.6% | 30.3% (+1.7pp) | 30.5% (+1.9pp) |
| Memory_Bound | 10.9% | 8.0% (−2.9pp) | 9.8% (−1.1pp) |
Confirms the diagnosis: forced-inline icache pressure (Fetch_Bandwidth +3.8pp on Turin) is essentially gone (+0.5pp). Cost: the allocator + sdallocx + accounting Memory_Bound win shrunk from −2.9pp → −1.1pp because the un-inlined functions are less optimizable by the compiler. Net Retiring loss reduced from −2.2pp → −1.0pp.
Write-path wins (the headline of the PR) — retained on x86
| Test | m7i (SPR) | m8a (Turin) | m8i (GranR) | m8a-profiler |
|---|---|---|---|---|
load-hash-50f-100B |
+10.9% | +10.3% | +1.6% | +12.5% |
load-hash-50f-10B |
+7.4% | +10.2% | +4.4% | +14.0% |
load-hash-20f-1B-pipeline-30 |
+8.4% | +10.8% | +5.3% | +11.2% |
load-hash-50f-1000B |
+5.7% | +5.8% | −1.8% | +6.8% |
1000streams-xreadgroup-count-100 |
+3.6% | −0.1% | +0.8% | +6.1% |
htll-50f-10B |
+1.5% (was +6.6%) | +1.5% (was +5.3%) | +10.7% | +0.7% |
Hash-load wins still dominate (+5 to +14%). htll-50f-10B lost most of the SPR/Turin boost (matches the "5–10% Intel boost from inlining we're leaving out" call), but Granite Rapids grew it to +10.7%.
Trade paid: hgetex-persist-50f-10B regresses on x86
The +12.4% Granite-Rapids boost in exp #1 was inline-driven; the revert turns it into a small regression on x86:
| Platform | exp #1 | exp #3 |
|---|---|---|
| m7i (SPR) | flat | −2.6% |
| m8a (Turin) | +3.6% | −4.6% |
| m8i (GranR) | +12.4% | −4.2% |
| m8a-profiler | flat | −0.7% |
This is the most visible cost of the revert. hgetex-persist is a narrower workload than plain HSET, so on-balance the trade (close AMD hgetex-50f −7.2% / lose ∼5% on hgetex-persist) looks reasonable, but flagging it explicitly.
Bottom line
- Original AMD
hgetex-50fregression resolved (−7.2% → +0.3% on m8a) ✅ - All write-path hash-load wins retained (+5 to +14% on x86)
- Most of the allocator improvements survive (TMA Memory_Bound win shrunk −2.9pp → −1.1pp but didn't disappear)
- Trade paid:
hgetex-persist-50f-10Bnow regresses −4 to −5% on x86 - Intel
hgetex-50f-10Bcontinues to hover at the noise floor (∼−4%, present in both runs, unrelated to the inline change)
PGO follow-up sounds like the right path for recovering the Intel boosts without the AMD icache cost. Happy to share raw CSVs / individual TMA funnels.
ARM (Graviton4 + Axion) is still running — m8g was held up behind unrelated long-tail work and the scoped run is now queued; will follow up with the ARM side as soon as it lands.
ARM follow-up — Graviton4 (
|
| Test | exp #1 (988806e, with inline) |
exp #3 (5fc11317, post-revert) |
|---|---|---|
load-hash-50f-100B |
+14.2% | +8.2% |
load-hash-50f-1000B-expiration |
+3.9% | +4.5% |
load-hash-50f-1000B |
+4.8% | +3.3% |
xreadgroup-count-100 |
+3.3% | +1.5% |
hgetex-persist-50f-10B |
(flat) | +1.6% |
hgetex-50f-10B |
+1.0% | +0.3% |
load-hash-50f-10B-short-expiration |
(flat) | +0.1% |
load-hash-50f-10B-long-expiration |
(flat) | −1.0% |
load-hash-50f-10B-expiration |
(flat) | −1.0% |
load-hash-20f-1B-pipeline-30 |
(flat) | −3.1% |
htll-50f-10B |
−3.8% | −3.4% |
load-hash-50f-10B |
(flat) | −3.7% |
Read on Graviton4
load-hash-50f-100Bwin held but shrunk meaningfully (+14.2% → +8.2%)htll-50f-10Bremains the one consistent ARM regression we'd already flagged in exp VM can't suppert Lists Or sets have much numbers。 #1, roughly unchanged (−3.8% → −3.4%)- Two new small regressions appeared in exp Cannot assign requested address #3:
load-hash-50f-10B−3.7% andload-hash-20f-1B-pipeline-30−3.1% hgetex-50f-10Bstays effectively flat on m8g across both runs (+1.0% → +0.3%) — Graviton4 wasn't impacted by the AMD-driven revert in either direction
The inline revert traded more on Graviton4 than I'd anticipated. Without ARM TMA in our fleet I can't decompose Retiring/Frontend/Memory the way I did on Turin — but the pattern (load-hash wins shrunk, two new small regressions on the smallest-value variants) is consistent with Graviton4 having been benefiting from the inlining similarly to Sapphire Rapids on those workloads.
Net for ARM: the headline load-hash-50f-100B win is still solid (+8.2%) and the persistent htll-50f regression isn't worse — but the smaller-value variants regressed slightly. Probably worth a second look on Graviton4 if PGO recovers some of this.
ShooterIT
left a comment
There was a problem hiding this comment.
maybe separate PRs would be better, we can know the effect of each part
|
@ShooterIT It's a bit harder to benchmark changes separately when each change contributes a few percent. With the help of AI, I tried to color the relevant boxes: For the above test, gains in terms of cpu percentage:
|
oranagra
left a comment
There was a problem hiding this comment.
i've reviewed the PR description and commented about a couple of concerns.
|
@tezc — apologies for the delayed follow-up. Two confounders:
Re-ran today on the current PR head
|
| Test | Baseline 7ecc04f59d |
PR c4abf7f4 |
Δ | (was on 988806e) |
|---|---|---|---|---|
100Kkeys-hash-hgetex-50-fields-10B-values |
22 385.06 | 21 972.97 | −1.84 % | (−3.2 to −7.2 %) |
100Kkeys-hash-hgetex-persist-50-fields-10B-values |
38 063.94 | 37 550.94 | −1.35 % | (−4 to −5 %) |
100Kkeys-hash-htll-50-fields-10B-values |
54 713.68 | 57 929.55 | +5.88 % | (clean on x86 originally) |
arm-aws-m8g.metal-24xl-2 (Graviton4)
| Test | Baseline 7ecc04f59d |
PR c4abf7f4 |
Δ | (was on 988806e) |
|---|---|---|---|---|
100Kkeys-hash-hgetex-50-fields-10B-values |
22 386.94 | 22 740.17 | +1.58 % | (clean on m8g originally) |
100Kkeys-hash-hgetex-persist-50-fields-10B-values |
38 335.13 | 39 086.45 | +1.96 % | (clean on m8g originally) |
100Kkeys-hash-htll-50-fields-10B-values |
60 280.59 | 59 849.56 | −0.72 % | (−3.8 %) |
All 6 deltas are within ±2 % noise (single dp each, consistent direction across both arches), and htll-50f is +5.88 % on x86. The two prior regressions are no longer reproducible on the current head. Matches your manual observation on m8i / m8g.
From our side: nothing blocking the merge on these tests. Sorry for the noise from the stale snapshot.
Fixes: 1. After #15096, we pass -flto to jemalloc. On Azure Linux, the resulting jemalloc library cannot be handled at link time and the build fails. Adding -ffat-lto-objects so the compiler also emits regular object code that the linker can fall back to when it cannot handle the LTO-compiled library. 2. Fixed a warning about `path` being NULL in `moduleLoadInternalModules()`. 3. Fixed compile warnings on older GCC versions introduced by #15162 (reported on Ubuntu 20.04) Co-authored-by: debing.sun <debing.sun@redis.com>

While profiling command execution, I noticed that command argv object alloc/free overhead is quite high for workloads with many small arguments (e.g.
HSETwith many fields). The effect is much more visible with pipelining when Redis becomes CPU bound.I experimented with replacing argv object alloc/free with a simple object pool and saw significant speedups.
(Note: related effort around this topic: #13726)
In this PR, I tried to improve the main hotspots in the memory allocation path (focusing on command arg allocations) to close the gap with custom pool performance, so we can avoid having a dedicated memory pools and let the whole codebase benefit from these optimizations.
Changes
1) Faster dealloc via passing size hint to jemalloc (separate PR #15071)
Jemalloc does more work than an object pool on free (a lookup on a tree to find the allocation's size class). For some deallocations, we can reduce free path overhead by passing a size hint to jemalloc (i.e.
sdallocx()) which can skip metadata lookup in the common case. This PR introduceszfree_with_size()and uses it where we can know the allocation size i.e.OBJ_ENCODING_EMBSTRobjects indecrRefCount()and SDS free path.2) Reduce atomic operation cost for stat updates
update_zmalloc_stat_alloc()/update_zmalloc_stat_free()previously used atomic read-modify-write (RMW) operations (atomicIncrGet/atomicDecr) which can emit expensive locked instructions on x86.When we can guarantee a single writer to a counter, we can use a cheaper load+add+store sequence instead of a locked RMW. This PR gives the first 16 threads dedicated slots for used_memory stats (intended to cover the main thread/ I/O threads) so they can use this single writer fast path. Threads beyond that fall back to a shared pool and continue to use full atomic RMW.
3) Improve jemalloc tcache hit rate
With the default
lookahead=16config, a pipelined HSET with ~20 fields does ~40 small allocations per command (fields + values), so you can get 16 x 40 = ~640 allocations. When args are small, many of these land in the 32 byte size class (oftenEMBSTR). Jemalloc’s default per-bin tcache cap is 200, so this kind of burst overflows the cache and it does frequent flushes. I raised the small-bin tcache limits (lg_tcache_nslots_mul:3, tcache_nslots_small_max:1000) to handle these bursts better. In the worst case, tcache may have a higher memory usage due to this change. Perhaps, another option was loweringlookaheadto tune it differently.4) Inlining
When you have a simple pool, it has a few small functions and it is easy for compiler to inline them. Compared to that, jemalloc alloc/free path has a deeper call stack. Also, jemalloc was not compiled with
-fltowhich was preventing inlining jemalloc functions. As part of this PR, I added-fltoflag to jemalloc when it is enabled for Redis.Compiler also chooses not to inline some hot path functions in Redis. This suggests PGO (profile-guided optimization) could provide additional wins and perhaps we can start experimenting with it sometime. We could try to force inlining with attributes like
always_inlinebut it is hard to apply across a deep call stack and misuse can cause code bloat. So, rather than going in this direction, I addedinlinekeyword to some functions for now. This doesn't make compiler to inline all hot path functions but at least it is a step ahead. (If we can further improve this in future, performance gets very close to custom memory pool implementation).Benchmark results
Commands were like:
Note
Medium Risk
Touches allocator configuration and per-thread
used_memoryaccounting, including new thread-slot reservation logic for IO threads; mistakes could skew memory stats or introduce subtle threading issues. Build-system LTO changes may also affect portability/diagnostics across toolchains.Overview
Reduces allocator and accounting overhead by adding compile-time jemalloc tuning (
je_malloc_conf) to increase small-bin tcache limits, and by introducing a single-writer fast path for per-threadused_memoryupdates via newatomicIncrGetSingleWriterplus dedicated accounting slots for the main + IO threads.Updates startup and IO thread initialization to reserve/claim these dedicated slots (
zmalloc_reserve_thread_slots,zmalloc_register_reserved_slot), adds an embstr-specific fast free path indecrRefCount()usingzfree_with_size, and marks a few hot functions asstatic inlineto encourage inlining.Build changes propagate Redis LTO settings into dependency builds, including compiling
jemallocwithENABLE_LTOwhen LTO is enabled, and adds a debug-only unit test to assert the jemalloc tuning is active.Reviewed by Cursor Bugbot for commit ed45523. Bugbot is set up for automated code reviews on this repo. Configure here.