Reduce memory allocation overhead by tezc · Pull Request #15096 · redis/redis

tezc · 2026-04-23T07:59:39Z

While profiling command execution, I noticed that command argv object alloc/free overhead is quite high for workloads with many small arguments (e.g. HSET with many fields). The effect is much more visible with pipelining when Redis becomes CPU bound.

I experimented with replacing argv object alloc/free with a simple object pool and saw significant speedups.
(Note: related effort around this topic: #13726)

In this PR, I tried to improve the main hotspots in the memory allocation path (focusing on command arg allocations) to close the gap with custom pool performance, so we can avoid having a dedicated memory pools and let the whole codebase benefit from these optimizations.

Changes

1) Faster dealloc via passing size hint to jemalloc (separate PR #15071)

Jemalloc does more work than an object pool on free (a lookup on a tree to find the allocation's size class). For some deallocations, we can reduce free path overhead by passing a size hint to jemalloc (i.e. sdallocx()) which can skip metadata lookup in the common case. This PR introduces zfree_with_size() and uses it where we can know the allocation size i.e. OBJ_ENCODING_EMBSTR objects in decrRefCount() and SDS free path.

2) Reduce atomic operation cost for stat updates

update_zmalloc_stat_alloc() / update_zmalloc_stat_free() previously used atomic read-modify-write (RMW) operations (atomicIncrGet / atomicDecr) which can emit expensive locked instructions on x86.

When we can guarantee a single writer to a counter, we can use a cheaper load+add+store sequence instead of a locked RMW. This PR gives the first 16 threads dedicated slots for used_memory stats (intended to cover the main thread/ I/O threads) so they can use this single writer fast path. Threads beyond that fall back to a shared pool and continue to use full atomic RMW.

3) Improve jemalloc tcache hit rate

With the default lookahead=16 config, a pipelined HSET with ~20 fields does ~40 small allocations per command (fields + values), so you can get 16 x 40 = ~640 allocations. When args are small, many of these land in the 32 byte size class (often EMBSTR). Jemalloc’s default per-bin tcache cap is 200, so this kind of burst overflows the cache and it does frequent flushes. I raised the small-bin tcache limits (lg_tcache_nslots_mul:3, tcache_nslots_small_max:1000) to handle these bursts better. In the worst case, tcache may have a higher memory usage due to this change. Perhaps, another option was lowering lookahead to tune it differently.

4) Inlining

When you have a simple pool, it has a few small functions and it is easy for compiler to inline them. Compared to that, jemalloc alloc/free path has a deeper call stack. Also, jemalloc was not compiled with -flto which was preventing inlining jemalloc functions. As part of this PR, I added -flto flag to jemalloc when it is enabled for Redis.

Compiler also chooses not to inline some hot path functions in Redis. This suggests PGO (profile-guided optimization) could provide additional wins and perhaps we can start experimenting with it sometime. We could try to force inlining with attributes like always_inline but it is hard to apply across a deep call stack and misuse can cause code bloat. So, rather than going in this direction, I added inline keyword to some functions for now. This doesn't make compiler to inline all hot path functions but at least it is a step ahead. (If we can further improve this in future, performance gets very close to custom memory pool implementation).

Benchmark results

Commands were like:

memtier_benchmark   --command="HSET __key__ username john_doe email john@example.com password hashed_pwd_123 created_at 1709125200 updated_at 1709125200 first_name John last_name Doe phone_number +1234567890 address 123_Main_St city NewYork country USA postal_code 10001 company Acme_Corp job_title Engineer bio Loves_coding"   --command-ratio=1   --command-key-pattern=P   --key-prefix="hsetkey"   --key-minimum=1   --key-maximum=100000   -n 1000000   -c 50   -t 2   --hide-histogram --pipeline 50

Benchmark	Improvement
SET	+0%
SET (pipeline)	+8%
HSET 15 fields	+2%
HSET 15 fields (pipeline)	+17%
ZADD 15 elements	+3%
ZADD 15 elements (pipeline)	+15%

Note

Medium Risk
Touches allocator configuration and per-thread used_memory accounting, including new thread-slot reservation logic for IO threads; mistakes could skew memory stats or introduce subtle threading issues. Build-system LTO changes may also affect portability/diagnostics across toolchains.

Overview
Reduces allocator and accounting overhead by adding compile-time jemalloc tuning (je_malloc_conf) to increase small-bin tcache limits, and by introducing a single-writer fast path for per-thread used_memory updates via new atomicIncrGetSingleWriter plus dedicated accounting slots for the main + IO threads.

Updates startup and IO thread initialization to reserve/claim these dedicated slots (zmalloc_reserve_thread_slots, zmalloc_register_reserved_slot), adds an embstr-specific fast free path in decrRefCount() using zfree_with_size, and marks a few hot functions as static inline to encourage inlining.

Build changes propagate Redis LTO settings into dependency builds, including compiling jemalloc with ENABLE_LTO when LTO is enabled, and adds a debug-only unit test to assert the jemalloc tuning is active.

^{Reviewed by Cursor Bugbot for commit ed45523. Bugbot is set up for automated code reviews on this repo. Configure here.}

augmentcode · 2026-04-23T08:09:28Z

🤖 Augment PR Summary

Summary: This PR reduces allocation/free overhead in Redis’ command execution hot paths (notably workloads with many small args and pipelining).

Changes:

Adds sized-free support via zfree_with_size() and uses it for embedded-string object deallocation and the SDS free path when the usable size is known.
Introduces atomicAddSingleWriter() and a dedicated-slot scheme for the first 16 threads’ used_memory counters to avoid atomic RMW in common cases.
Tunes jemalloc small-bin tcache limits via je_malloc_conf to improve hit rate during allocation bursts.
Propagates LTO flags to dependency builds (jemalloc) when LTO is enabled for Redis.
Adds createStringObjectInline() and switches RESP parsing to use it on a hot path.
Improves CI/test detection of jemalloc sized-deallocation misuse (size-mismatch reporting + opt size checks in daily CI).

Technical Notes: Overflow threads share hashed accounting slots (atomic RMW) as before; increased tcache capacity may raise worst-case memory footprint.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

fcostaoliveira · 2026-04-23T08:13:16Z

CE Performance Automation : step 1 of 2 (build) DONE.

This comment was automatically generated given a benchmark was triggered.
Started building at 2026-05-09 08:48:31.668969 and took 56 seconds.
You can check each build/benchmark progress in grafana:

git hash: ed45523
git branch: tezc:faster-alloc
commit date and time: n/a
commit summary: n/a
test filters:
- command priority lower limit: 0
- command priority upper limit: 10000
- test name regex: .*
- command group regex: .*

You can check a comparison in detail via the grafana link

fcostaoliveira · 2026-04-23T08:14:13Z

CE Performance Automation : step 2 of 2 (benchmark) FINISHED.

This comment was automatically generated given a benchmark was triggered.

Started benchmark suite at 2026-05-03 20:34:44.628983 and took 57577.83804 seconds to finish.
Status: [################################################################################] 100.0% completed.

In total will run 378 benchmarks.
- 0 pending.
- 378 completed:
- 358 successful.
- 20 failed.
You can check a the status in detail via the grafana link

fcostaoliveira · 2026-04-23T14:20:37Z

Preliminary benchmark results — fleet run `988806e` vs `unstable`

Ran the PR on 3 completed x86 platforms so far (m7i.metal-24xl, m8i.24xlarge, m8a.metal-24xl). The HSET-load and HTTL wins track the direction you reported — up to +20.9% on load-hash-20-fields-1B-pipeline-30 (m8a). Full-suite runs on the profiler + gcp-c4-standard-48 still in flight.

One cross-platform regression worth flagging: HGETEX on 50-field hashes

Platform	% change	Note
`x86-aws-m7i.metal-24xl`	-3.8%	22,770 → 21,909 ops/s
`x86-aws-m8i.24xlarge`	-3.2%
`x86-aws-m8a.metal-24xl`	-7.2%	33,658 → 31,224 ops/s

Interestingly, the hgetex-persist variant of the same test improves (+3.6% to +12.7%) — so it's specific to read-only HGETEX, not HGETEX-as-a-write.

TMA funnel on x86-aws-m8a.metal-24xl-profiler (level-2, 30s windows):

	unstable	faster-alloc	Δ
Retiring	40.8%	39.3%	-1.5pp
Frontend_Bound	44.9%	49.5%	+4.6pp
↳ Fetch_Bandwidth	15.3%	19.3%	+4.0pp
↳ Fetch_Latency	29.6%	30.3%	+0.7pp
Memory_Bound	10.9%	8.0%	-2.9pp
Bad_Speculation	2.3%	2.2%	-0.1pp

The allocator work delivered the expected Memory_Bound win (-2.9pp). But the combined effect of -flto into jemalloc + the always_inline createStringObjectInline() appears to bloat the instruction footprint enough that Fetch_Bandwidth jumps +4.0pp on reply-heavy paths (HGETEX-50f runs createStringObject 50× per request, so icache / DSB pressure shows up). The icache cost outweighs the Memory_Bound win on this test, so Retiring drops 1.5pp → throughput -3 to -7%.

The hgetex-persist variant improves because it has a write side (TTL clear) where the allocator fastpaths dominate; plain read-only HGETEX pays the icache cost without the write-path benefit.

This tracks the concern you already called out: "misuse [of always_inline] can cause code bloat." Might be worth either scoping createStringObjectInline more narrowly, or gating on __attribute__((hot)) / PGO rather than always_inline — otherwise the read-path regression is likely to surface on more reply-heavy workloads.

Happy to share the raw funnels or try a variant that reverts just the forced-inline.

fcostaoliveira · 2026-04-24T09:00:09Z

Final benchmark results — fleet run `988806e` vs `unstable` (5 x86 + 2 ARM platforms)

Follow-up to the TMA-based preliminary comment above. Now that the run has completed (or is ≥94% complete) on every runner, here is the consolidated picture.

Per-platform summary — clean baselines (both branches from 2026-04-23 to 2026-04-24)

Platform	ISA	# improvements ≥3%	# regressions ≤-3%	Top win
`x86-aws-m7i.metal-24xl` (Sapphire Rapids)	x86	7	2	`load-hash-50f-100B` +9.3%
`x86-aws-m8i.24xlarge` (Granite Rapids)	x86	2	0	`hgetex-persist-50f` +12.4%
`x86-aws-m8a.metal-24xl` (AMD Turin)	x86	7	1	`load-hash-20f-1B-pipeline-30` +20.9%
`arm-aws-m8g.metal-24xl` (Graviton4)	ARM	6	1	`load-hash-50f-100B` +14.2%
`arm-gcp-c4a-standard-48` (Axion / Neoverse V2)	ARM	7	0	`load-hash-50f-100B` +12.5%

Two additional x86 runners (gcp-c4-standard-48 and m8a.metal-24xl-profiler) also completed (or are 94% complete), but their unstable baselines are older (9 days and 3 days respectively), so deltas there bake in 3–9 days of unrelated commits and are not directly comparable. Directional signal on those runners is consistent with the clean-baseline set: broad hash/stream-load improvements.

Cross-platform consistent improvements (observed on ≥3 platforms, ≥3%)

Test	x86 m7i	x86 m8i	x86 m8a	ARM m8g	ARM c4a
`load-hash-50-fields-100B`	+9.3%	—	—	+14.2%	+12.5%
`load-hash-50-fields-1000B`	+8.0%	—	—	+4.8%	+6.5%
`load-hash-50-fields-1000B-expiration`	+3.9%	—	—	+3.9%	+3.2%
`1000streams-xreadgroup-count-100`	—	—	+3.6%	+3.3%	+4.3%
`hgetex-persist-50-fields-10B`	—	+12.4%	+3.6%	—	+3.2%
`hash-htll-50-fields-10B` (x86 only)	+6.6%	+3.6%	+5.3%	−3.8%	+0.8%

The HSET/load-hash pipelined wins you reported in the PR description reproduce across both ISAs and all hardware generations. Best observed delta is +20.9% on load-hash-20-fields-1B-pipeline-30 on m8a.metal-24xl.

Cross-platform regressions

1. hash-hgetex-50-fields-10B — x86 only (covered in the earlier comment)

Platform	Δ	ops/s
`x86-aws-m7i.metal-24xl`	−3.8%	22,770 → 21,909
`x86-aws-m8i.24xlarge`	−3.2%	(sub-threshold)
`x86-aws-m8a.metal-24xl`	−7.2%	33,658 → 31,224
`x86-aws-m8a.metal-24xl-profiler`	−2.5%	33,679 → 32,846
`arm-aws-m8g.metal-24xl`	+1.0%	22,885 → 23,106
`arm-gcp-c4a-standard-48`	+1.0%	23,776 → 24,005

The x86-only regression pattern matches the TMA diagnosis in the earlier comment: Memory_Bound shrinks −2.9pp (expected allocator win) but Fetch_Bandwidth grows +4.0pp from the always_inline createStringObjectInline bloating the hot path on reply-heavy workloads. ARM is flat — consistent with the absence of a uop cache / DSB-equivalent bottleneck on Neoverse-class cores.

2. hash-htll-50-fields-10B — ARM m8g only (−3.8%)

Platform	Δ
`x86-aws-m7i.metal-24xl`	+6.6%
`x86-aws-m8i.24xlarge`	+3.6%
`x86-aws-m8a.metal-24xl`	+5.3%
`arm-aws-m8g.metal-24xl`	−3.8%
`arm-gcp-c4a-standard-48`	+0.8%

Opposite sign on Graviton4 vs the rest of the fleet. Only one data point on one ARM platform, and the test shows 0.2% std.dev so it is not noise — but we can't fully attribute without ARM TMA (no ARM topdown runner in our fleet). Might be worth a second look in case the zmalloc single-writer-slot reshuffle interacts badly with Graviton4's cache layout. Can share the raw timeseries if useful.

Test failures (context)

All the test failures observed on profiler and c4-standard-48 runners (string-auth-reconnect-10B, pubsub-mixed-*, replica-only-parallel-fullsync-*, 3Mkeys-string-mixed-50-50-with-expiration-pipeline-10-400_conns, zremrangebyscore-pipeline-10) reproduce on both platforms and on both branches — they are test-framework/infrastructure issues, not caused by this PR.

Bottom line

Write-path wins are confirmed and consistent across x86 (Intel + AMD) and ARM (Graviton4 + Axion).
The hgetex-50-fields-10B regression on x86 (up to −7.2%) is real and TMA-attributable to always_inline-driven icache pressure; ARM is not affected.
The htll-50-fields-10B regression on Graviton4 (−3.8%) is narrow, within-noise-margin, but worth a sanity check.
The allocator-accounting + sdallocx parts of the PR look like clear wins across the board. The part worth scoping more carefully is the forced-inline — see earlier comment.

Happy to share raw CSVs, individual timeseries, or a scoped re-run if you want to validate a fix.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit ea42cc6. Configure here.}

fcostaoliveira · 2026-04-27T23:14:00Z

Re-run on `5fc11317` — scoped 12-test subset (post-inline-revert)

Re-ran the targeted subset after revert most of inline + avoid forcing inlining for AMD. Used a focused subset (the cross-platform wins from the prior run plus hash-hgetex-50-fields-10B-values to validate the AMD regression close).

Headline answer: AMD `hgetex-50f` regression is closed

Platform	exp #1 (`988806e`, with inline)	exp #3 (`5fc11317`, post-revert)
`x86-aws-m8a.metal-24xl` (AMD Turin)	−7.2%	+0.3% ✅
`x86-aws-m8a.metal-24xl-profiler` (AMD Turin + TMA)	−2.5%	−2.7% (under 3% threshold)
`x86-aws-m7i.metal-24xl` (Sapphire Rapids)	−3.8%	−4.1%
`x86-aws-m8i.24xlarge` (Granite Rapids)	−3.2%	−4.3%

The AMD regression is fully resolved. There's a small Intel residual at ∼−4% on hgetex-50f-10B that hovers near the noise floor on both runs — not introduced by the inline-revert.

TMA on `x86-aws-m8a.metal-24xl-profiler` — `hgetex-50f-10B`, level-2

	unstable	exp #1 (`988806e`)	exp #3 (`5fc11317`)
Retiring	41.5%	39.3% (−2.2pp)	40.5% (−1.0pp)
Frontend_Bound	44.1%	49.5% (+5.4pp)	46.5% (+2.4pp)
↳ Fetch_Bandwidth	15.5%	19.3% (+3.8pp)	16.0% (+0.5pp) ✅
↳ Fetch_Latency	28.6%	30.3% (+1.7pp)	30.5% (+1.9pp)
Memory_Bound	10.9%	8.0% (−2.9pp)	9.8% (−1.1pp)

Confirms the diagnosis: forced-inline icache pressure (Fetch_Bandwidth +3.8pp on Turin) is essentially gone (+0.5pp). Cost: the allocator + sdallocx + accounting Memory_Bound win shrunk from −2.9pp → −1.1pp because the un-inlined functions are less optimizable by the compiler. Net Retiring loss reduced from −2.2pp → −1.0pp.

Write-path wins (the headline of the PR) — retained on x86

Test	m7i (SPR)	m8a (Turin)	m8i (GranR)	m8a-profiler
`load-hash-50f-100B`	+10.9%	+10.3%	+1.6%	+12.5%
`load-hash-50f-10B`	+7.4%	+10.2%	+4.4%	+14.0%
`load-hash-20f-1B-pipeline-30`	+8.4%	+10.8%	+5.3%	+11.2%
`load-hash-50f-1000B`	+5.7%	+5.8%	−1.8%	+6.8%
`1000streams-xreadgroup-count-100`	+3.6%	−0.1%	+0.8%	+6.1%
`htll-50f-10B`	+1.5% (was +6.6%)	+1.5% (was +5.3%)	+10.7%	+0.7%

Hash-load wins still dominate (+5 to +14%). htll-50f-10B lost most of the SPR/Turin boost (matches the "5–10% Intel boost from inlining we're leaving out" call), but Granite Rapids grew it to +10.7%.

Trade paid: `hgetex-persist-50f-10B` regresses on x86

The +12.4% Granite-Rapids boost in exp #1 was inline-driven; the revert turns it into a small regression on x86:

Platform	exp #1	exp #3
m7i (SPR)	flat	−2.6%
m8a (Turin)	+3.6%	−4.6%
m8i (GranR)	+12.4%	−4.2%
m8a-profiler	flat	−0.7%

This is the most visible cost of the revert. hgetex-persist is a narrower workload than plain HSET, so on-balance the trade (close AMD hgetex-50f −7.2% / lose ∼5% on hgetex-persist) looks reasonable, but flagging it explicitly.

Bottom line

Original AMD hgetex-50f regression resolved (−7.2% → +0.3% on m8a) ✅
All write-path hash-load wins retained (+5 to +14% on x86)
Most of the allocator improvements survive (TMA Memory_Bound win shrunk −2.9pp → −1.1pp but didn't disappear)
Trade paid: hgetex-persist-50f-10B now regresses −4 to −5% on x86
Intel hgetex-50f-10B continues to hover at the noise floor (∼−4%, present in both runs, unrelated to the inline change)

PGO follow-up sounds like the right path for recovering the Intel boosts without the AMD icache cost. Happy to share raw CSVs / individual TMA funnels.

ARM (Graviton4 + Axion) is still running — m8g was held up behind unrelated long-tail work and the scoped run is now queued; will follow up with the ARM side as soon as it lands.

fcostaoliveira · 2026-04-27T23:43:52Z

ARM follow-up — Graviton4 (`arm-aws-m8g.metal-24xl`)

Re-ran the same 12-test subset on 5fc11317 against unstable. Axion (arm-gcp-c4a-standard-48) was tied up on unrelated work, so this is m8g-only for now — will tack Axion numbers on later if it frees up before merge.

Test	exp #1 (`988806e`, with inline)	exp #3 (`5fc11317`, post-revert)
`load-hash-50f-100B`	+14.2%	+8.2%
`load-hash-50f-1000B-expiration`	+3.9%	+4.5%
`load-hash-50f-1000B`	+4.8%	+3.3%
`xreadgroup-count-100`	+3.3%	+1.5%
`hgetex-persist-50f-10B`	(flat)	+1.6%
`hgetex-50f-10B`	+1.0%	+0.3%
`load-hash-50f-10B-short-expiration`	(flat)	+0.1%
`load-hash-50f-10B-long-expiration`	(flat)	−1.0%
`load-hash-50f-10B-expiration`	(flat)	−1.0%
`load-hash-20f-1B-pipeline-30`	(flat)	−3.1%
`htll-50f-10B`	−3.8%	−3.4%
`load-hash-50f-10B`	(flat)	−3.7%

Read on Graviton4

load-hash-50f-100B win held but shrunk meaningfully (+14.2% → +8.2%)
htll-50f-10B remains the one consistent ARM regression we'd already flagged in exp VM can't suppert Lists Or sets have much numbers。 #1, roughly unchanged (−3.8% → −3.4%)
Two new small regressions appeared in exp Cannot assign requested address #3: load-hash-50f-10B −3.7% and load-hash-20f-1B-pipeline-30 −3.1%
hgetex-50f-10B stays effectively flat on m8g across both runs (+1.0% → +0.3%) — Graviton4 wasn't impacted by the AMD-driven revert in either direction

The inline revert traded more on Graviton4 than I'd anticipated. Without ARM TMA in our fleet I can't decompose Retiring/Frontend/Memory the way I did on Turin — but the pattern (load-hash wins shrunk, two new small regressions on the smallest-value variants) is consistent with Graviton4 having been benefiting from the inlining similarly to Sapphire Rapids on those workloads.

Net for ARM: the headline load-hash-50f-100B win is still solid (+8.2%) and the persistent htll-50f regression isn't worse — but the smaller-value variants regressed slightly. Probably worth a second look on Graviton4 if PGO recovers some of this.

skaslev

lgtm with a nit inline

ShooterIT

maybe separate PRs would be better, we can know the effect of each part

tezc · 2026-04-29T19:34:04Z

@ShooterIT It's a bit harder to benchmark changes separately when each change contributes a few percent. With the help of AI, I tried to color the relevant boxes:

memtier_benchmark   --command="HSET __key__ username john_doe email john@example.com password hashed_pwd_123 created_at 1709125200 updated_at 1709125200 first_name John last_name Doe phone_number +1234567890 address 123_Main_St city NewYork country USA postal_code 10001 company Acme_Corp job_title Engineer bio Loves_coding"   --command-ratio=1   --command-key-pattern=P   --key-prefix="hsetkey"   --key-minimum=1   --key-maximum=100000   -n 1000000   -c 50   -t 2   --hide-histogram --pipeline 100

Before:

After:

For the above test, gains in terms of cpu percentage:

avoiding zfree size lookup: ~%1
avoding jemalloc tcache flush: ~%3
zmalloc stat update: ~%7
hard to show impact of inlining here.

oranagra

i've reviewed the PR description and commented about a couple of concerns.

fcostaoliveira · 2026-05-06T10:39:05Z

@tezc — apologies for the delayed follow-up. Two confounders:

The two earlier comments (1, 2) were against commit 988806e, the initial PR head. Since then there have been ≥5 commits — most importantly ca5de1fb (avoid forcing inlining for AMD) and ea42cc65 (revert most of inline) — both directly addressing the Fetch_Bandwidth +4.0pp shift the TMA flagged at the time.
Our queued confirmation re-trigger sat behind a 495-item fleet backlog and aged out before producing fresh comparable numbers — sorry for not surfacing that earlier.

Re-ran today on the current PR head c4abf7f4 against unstable 7ecc04f59d, scoped to the previously-regressing tests, on the -2 siblings of the same CPUs (Sapphire Rapids on m7i-2, Graviton4 on m8g-2):

`x86-aws-m7i.metal-24xl-2` (Sapphire Rapids)

Test	Baseline `7ecc04f59d`	PR `c4abf7f4`	Δ	(was on `988806e`)
`100Kkeys-hash-hgetex-50-fields-10B-values`	22 385.06	21 972.97	−1.84 %	(−3.2 to −7.2 %)
`100Kkeys-hash-hgetex-persist-50-fields-10B-values`	38 063.94	37 550.94	−1.35 %	(−4 to −5 %)
`100Kkeys-hash-htll-50-fields-10B-values`	54 713.68	57 929.55	+5.88 %	(clean on x86 originally)

`arm-aws-m8g.metal-24xl-2` (Graviton4)

Test	Baseline `7ecc04f59d`	PR `c4abf7f4`	Δ	(was on `988806e`)
`100Kkeys-hash-hgetex-50-fields-10B-values`	22 386.94	22 740.17	+1.58 %	(clean on m8g originally)
`100Kkeys-hash-hgetex-persist-50-fields-10B-values`	38 335.13	39 086.45	+1.96 %	(clean on m8g originally)
`100Kkeys-hash-htll-50-fields-10B-values`	60 280.59	59 849.56	−0.72 %	(−3.8 %)

All 6 deltas are within ±2 % noise (single dp each, consistent direction across both arches), and htll-50f is +5.88 % on x86. The two prior regressions are no longer reproducible on the current head. Matches your manual observation on m8i / m8g.

From our side: nothing blocking the merge on these tests. Sorry for the noise from the stale snapshot.

…-alloc

Fixes: 1. After #15096, we pass -flto to jemalloc. On Azure Linux, the resulting jemalloc library cannot be handled at link time and the build fails. Adding -ffat-lto-objects so the compiler also emits regular object code that the linker can fall back to when it cannot handle the LTO-compiled library. 2. Fixed a warning about `path` being NULL in `moduleLoadInternalModules()`. 3. Fixed compile warnings on older GCC versions introduced by #15162 (reported on Ubuntu 20.04) Co-authored-by: debing.sun <debing.sun@redis.com>

tezc added 2 commits April 20, 2026 09:34

Pass size hint to jemalloc for faster deallocation

5524e56

Reduce memory allocation overhead

988806e

augmentcode Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread src/object.c Outdated

Comment thread src/sdsalloc.h

fcostaoliveira added the action:run-benchmark Triggers the benchmark suite for this Pull Request label Apr 23, 2026

tezc added 3 commits April 27, 2026 10:14

avoid forcing inlining for amd processors

ca5de1f

fix macos

338a5f1

revert most of inline

ea42cc6

cursor Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread src/networking.c

minor

5fc1131

tezc requested review from ShooterIT, oranagra and sundb April 28, 2026 09:49

skaslev approved these changes Apr 29, 2026

View reviewed changes

Comment thread src/zmalloc.c Outdated

ShooterIT reviewed Apr 29, 2026

View reviewed changes

Comment thread src/atomicvar.h Outdated

Comment thread src/object.c Outdated

tezc added 2 commits May 2, 2026 13:19

larger cap 1000

1fa5908

comments

55672b0

oranagra reviewed May 3, 2026

View reviewed changes

Comment thread src/zmalloc.c

Comment thread src/zmalloc.c

add test for je_malloc_conf

c4abf7f

tezc added 2 commits May 6, 2026 15:32

reserve dedicated slots for iothreads

a567559

Merge branch 'unstable' of https://github.com/redis/redis into faster…

8ce3979

…-alloc

ShooterIT reviewed May 7, 2026

View reviewed changes

Comment thread src/object.c

Comment thread src/zmalloc.c Outdated

ShooterIT approved these changes May 7, 2026

View reviewed changes

tezc added 2 commits May 7, 2026 16:23

comment

d9f0080

Merge branch 'unstable' of https://github.com/redis/redis into faster…

ed45523

…-alloc

sundb approved these changes May 9, 2026

View reviewed changes

tezc added the release-notes indication that this issue needs to be mentioned in the release notes label May 9, 2026

tezc merged commit 7bdab45 into redis:unstable May 9, 2026
18 checks passed

tezc deleted the faster-alloc branch May 9, 2026 08:48

tezc mentioned this pull request May 18, 2026

Fix compile on some linux distros #15225

Merged

Conversation

tezc commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

1) Faster dealloc via passing size hint to jemalloc (separate PR #15071)

2) Reduce atomic operation cost for stat updates

3) Improve jemalloc tcache hit rate

4) Inlining

Benchmark results

Uh oh!

augmentcode Bot commented Apr 23, 2026

Uh oh!

augmentcode Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fcostaoliveira commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CE Performance Automation : step 1 of 2 (build) DONE.

Uh oh!

fcostaoliveira commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CE Performance Automation : step 2 of 2 (benchmark) FINISHED.

Uh oh!

fcostaoliveira commented Apr 23, 2026

Preliminary benchmark results — fleet run 988806e vs unstable

Uh oh!

fcostaoliveira commented Apr 24, 2026

Final benchmark results — fleet run 988806e vs unstable (5 x86 + 2 ARM platforms)

Per-platform summary — clean baselines (both branches from 2026-04-23 to 2026-04-24)

Cross-platform consistent improvements (observed on ≥3 platforms, ≥3%)

Cross-platform regressions

Test failures (context)

Bottom line

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fcostaoliveira commented Apr 27, 2026

Re-run on 5fc11317 — scoped 12-test subset (post-inline-revert)

Headline answer: AMD hgetex-50f regression is closed

TMA on x86-aws-m8a.metal-24xl-profiler — hgetex-50f-10B, level-2

Write-path wins (the headline of the PR) — retained on x86

Trade paid: hgetex-persist-50f-10B regresses on x86

Bottom line

Uh oh!

fcostaoliveira commented Apr 27, 2026

ARM follow-up — Graviton4 (arm-aws-m8g.metal-24xl)

Read on Graviton4

Uh oh!

skaslev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ShooterIT left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tezc commented Apr 29, 2026

Uh oh!

oranagra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fcostaoliveira commented May 6, 2026

x86-aws-m7i.metal-24xl-2 (Sapphire Rapids)

arm-aws-m8g.metal-24xl-2 (Graviton4)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

tezc commented Apr 23, 2026 •

edited

Loading

fcostaoliveira commented Apr 23, 2026 •

edited

Loading

fcostaoliveira commented Apr 23, 2026 •

edited

Loading

Preliminary benchmark results — fleet run `988806e` vs `unstable`

Final benchmark results — fleet run `988806e` vs `unstable` (5 x86 + 2 ARM platforms)

Re-run on `5fc11317` — scoped 12-test subset (post-inline-revert)

Headline answer: AMD `hgetex-50f` regression is closed

TMA on `x86-aws-m8a.metal-24xl-profiler` — `hgetex-50f-10B`, level-2

Trade paid: `hgetex-persist-50f-10B` regresses on x86

ARM follow-up — Graviton4 (`arm-aws-m8g.metal-24xl`)

`x86-aws-m7i.metal-24xl-2` (Sapphire Rapids)

`arm-aws-m8g.metal-24xl-2` (Graviton4)