Skip to content

Reduce memory allocation overhead#15096

Merged
tezc merged 13 commits into
redis:unstablefrom
tezc:faster-alloc
May 9, 2026
Merged

Reduce memory allocation overhead#15096
tezc merged 13 commits into
redis:unstablefrom
tezc:faster-alloc

Conversation

@tezc
Copy link
Copy Markdown
Collaborator

@tezc tezc commented Apr 23, 2026

While profiling command execution, I noticed that command argv object alloc/free overhead is quite high for workloads with many small arguments (e.g. HSET with many fields). The effect is much more visible with pipelining when Redis becomes CPU bound.

I experimented with replacing argv object alloc/free with a simple object pool and saw significant speedups.
(Note: related effort around this topic: #13726)

In this PR, I tried to improve the main hotspots in the memory allocation path (focusing on command arg allocations) to close the gap with custom pool performance, so we can avoid having a dedicated memory pools and let the whole codebase benefit from these optimizations.

Changes

1) Faster dealloc via passing size hint to jemalloc (separate PR #15071)

Jemalloc does more work than an object pool on free (a lookup on a tree to find the allocation's size class). For some deallocations, we can reduce free path overhead by passing a size hint to jemalloc (i.e. sdallocx()) which can skip metadata lookup in the common case. This PR introduces zfree_with_size() and uses it where we can know the allocation size i.e. OBJ_ENCODING_EMBSTR objects in decrRefCount() and SDS free path.

2) Reduce atomic operation cost for stat updates

update_zmalloc_stat_alloc() / update_zmalloc_stat_free() previously used atomic read-modify-write (RMW) operations (atomicIncrGet / atomicDecr) which can emit expensive locked instructions on x86.

When we can guarantee a single writer to a counter, we can use a cheaper load+add+store sequence instead of a locked RMW. This PR gives the first 16 threads dedicated slots for used_memory stats (intended to cover the main thread/ I/O threads) so they can use this single writer fast path. Threads beyond that fall back to a shared pool and continue to use full atomic RMW.

3) Improve jemalloc tcache hit rate

With the default lookahead=16 config, a pipelined HSET with ~20 fields does ~40 small allocations per command (fields + values), so you can get 16 x 40 = ~640 allocations. When args are small, many of these land in the 32 byte size class (often EMBSTR). Jemalloc’s default per-bin tcache cap is 200, so this kind of burst overflows the cache and it does frequent flushes. I raised the small-bin tcache limits (lg_tcache_nslots_mul:3, tcache_nslots_small_max:1000) to handle these bursts better. In the worst case, tcache may have a higher memory usage due to this change. Perhaps, another option was lowering lookahead to tune it differently.

4) Inlining

When you have a simple pool, it has a few small functions and it is easy for compiler to inline them. Compared to that, jemalloc alloc/free path has a deeper call stack. Also, jemalloc was not compiled with -flto which was preventing inlining jemalloc functions. As part of this PR, I added -flto flag to jemalloc when it is enabled for Redis.

Compiler also chooses not to inline some hot path functions in Redis. This suggests PGO (profile-guided optimization) could provide additional wins and perhaps we can start experimenting with it sometime. We could try to force inlining with attributes like always_inline but it is hard to apply across a deep call stack and misuse can cause code bloat. So, rather than going in this direction, I added inline keyword to some functions for now. This doesn't make compiler to inline all hot path functions but at least it is a step ahead. (If we can further improve this in future, performance gets very close to custom memory pool implementation).

Benchmark results

Commands were like:

memtier_benchmark   --command="HSET __key__ username john_doe email john@example.com password hashed_pwd_123 created_at 1709125200 updated_at 1709125200 first_name John last_name Doe phone_number +1234567890 address 123_Main_St city NewYork country USA postal_code 10001 company Acme_Corp job_title Engineer bio Loves_coding"   --command-ratio=1   --command-key-pattern=P   --key-prefix="hsetkey"   --key-minimum=1   --key-maximum=100000   -n 1000000   -c 50   -t 2   --hide-histogram --pipeline 50
Benchmark Improvement
SET +0%
SET (pipeline) +8%
HSET 15 fields +2%
HSET 15 fields (pipeline) +17%
ZADD 15 elements +3%
ZADD 15 elements (pipeline) +15%

Note

Medium Risk
Touches allocator configuration and per-thread used_memory accounting, including new thread-slot reservation logic for IO threads; mistakes could skew memory stats or introduce subtle threading issues. Build-system LTO changes may also affect portability/diagnostics across toolchains.

Overview
Reduces allocator and accounting overhead by adding compile-time jemalloc tuning (je_malloc_conf) to increase small-bin tcache limits, and by introducing a single-writer fast path for per-thread used_memory updates via new atomicIncrGetSingleWriter plus dedicated accounting slots for the main + IO threads.

Updates startup and IO thread initialization to reserve/claim these dedicated slots (zmalloc_reserve_thread_slots, zmalloc_register_reserved_slot), adds an embstr-specific fast free path in decrRefCount() using zfree_with_size, and marks a few hot functions as static inline to encourage inlining.

Build changes propagate Redis LTO settings into dependency builds, including compiling jemalloc with ENABLE_LTO when LTO is enabled, and adds a debug-only unit test to assert the jemalloc tuning is active.

Reviewed by Cursor Bugbot for commit ed45523. Bugbot is set up for automated code reviews on this repo. Configure here.

@augmentcode
Copy link
Copy Markdown

augmentcode Bot commented Apr 23, 2026

🤖 Augment PR Summary

Summary: This PR reduces allocation/free overhead in Redis’ command execution hot paths (notably workloads with many small args and pipelining).

Changes:

  • Adds sized-free support via zfree_with_size() and uses it for embedded-string object deallocation and the SDS free path when the usable size is known.
  • Introduces atomicAddSingleWriter() and a dedicated-slot scheme for the first 16 threads’ used_memory counters to avoid atomic RMW in common cases.
  • Tunes jemalloc small-bin tcache limits via je_malloc_conf to improve hit rate during allocation bursts.
  • Propagates LTO flags to dependency builds (jemalloc) when LTO is enabled for Redis.
  • Adds createStringObjectInline() and switches RESP parsing to use it on a hot path.
  • Improves CI/test detection of jemalloc sized-deallocation misuse (size-mismatch reporting + opt size checks in daily CI).

Technical Notes: Overflow threads share hashed accounting slots (atomic RMW) as before; increased tcache capacity may raise worst-case memory footprint.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

Comment thread src/object.c Outdated
Comment thread src/sdsalloc.h
@fcostaoliveira fcostaoliveira added the action:run-benchmark Triggers the benchmark suite for this Pull Request label Apr 23, 2026
@fcostaoliveira
Copy link
Copy Markdown
Collaborator

fcostaoliveira commented Apr 23, 2026

CE Performance Automation : step 1 of 2 (build) DONE.

This comment was automatically generated given a benchmark was triggered.
Started building at 2026-05-09 08:48:31.668969 and took 56 seconds.
You can check each build/benchmark progress in grafana:

  • git hash: ed45523
  • git branch: tezc:faster-alloc
  • commit date and time: n/a
  • commit summary: n/a
  • test filters:
    • command priority lower limit: 0
    • command priority upper limit: 10000
    • test name regex: .*
    • command group regex: .*

You can check a comparison in detail via the grafana link

@fcostaoliveira
Copy link
Copy Markdown
Collaborator

fcostaoliveira commented Apr 23, 2026

CE Performance Automation : step 2 of 2 (benchmark) FINISHED.

This comment was automatically generated given a benchmark was triggered.

Started benchmark suite at 2026-05-03 20:34:44.628983 and took 57577.83804 seconds to finish.
Status: [################################################################################] 100.0% completed.

In total will run 378 benchmarks.
- 0 pending.
- 378 completed:
- 358 successful.
- 20 failed.
You can check a the status in detail via the grafana link

@fcostaoliveira
Copy link
Copy Markdown
Collaborator

Preliminary benchmark results — fleet run 988806e vs unstable

Ran the PR on 3 completed x86 platforms so far (m7i.metal-24xl, m8i.24xlarge, m8a.metal-24xl). The HSET-load and HTTL wins track the direction you reported — up to +20.9% on load-hash-20-fields-1B-pipeline-30 (m8a). Full-suite runs on the profiler + gcp-c4-standard-48 still in flight.

One cross-platform regression worth flagging: HGETEX on 50-field hashes

Platform % change Note
x86-aws-m7i.metal-24xl -3.8% 22,770 → 21,909 ops/s
x86-aws-m8i.24xlarge -3.2%
x86-aws-m8a.metal-24xl -7.2% 33,658 → 31,224 ops/s

Interestingly, the hgetex-persist variant of the same test improves (+3.6% to +12.7%) — so it's specific to read-only HGETEX, not HGETEX-as-a-write.

TMA funnel on x86-aws-m8a.metal-24xl-profiler (level-2, 30s windows):

unstable faster-alloc Δ
Retiring 40.8% 39.3% -1.5pp
Frontend_Bound 44.9% 49.5% +4.6pp
↳ Fetch_Bandwidth 15.3% 19.3% +4.0pp
↳ Fetch_Latency 29.6% 30.3% +0.7pp
Memory_Bound 10.9% 8.0% -2.9pp
Bad_Speculation 2.3% 2.2% -0.1pp

The allocator work delivered the expected Memory_Bound win (-2.9pp). But the combined effect of -flto into jemalloc + the always_inline createStringObjectInline() appears to bloat the instruction footprint enough that Fetch_Bandwidth jumps +4.0pp on reply-heavy paths (HGETEX-50f runs createStringObject 50× per request, so icache / DSB pressure shows up). The icache cost outweighs the Memory_Bound win on this test, so Retiring drops 1.5pp → throughput -3 to -7%.

The hgetex-persist variant improves because it has a write side (TTL clear) where the allocator fastpaths dominate; plain read-only HGETEX pays the icache cost without the write-path benefit.

This tracks the concern you already called out: "misuse [of always_inline] can cause code bloat." Might be worth either scoping createStringObjectInline more narrowly, or gating on __attribute__((hot)) / PGO rather than always_inline — otherwise the read-path regression is likely to surface on more reply-heavy workloads.

Happy to share the raw funnels or try a variant that reverts just the forced-inline.

@fcostaoliveira
Copy link
Copy Markdown
Collaborator

Final benchmark results — fleet run 988806e vs unstable (5 x86 + 2 ARM platforms)

Follow-up to the TMA-based preliminary comment above. Now that the run has completed (or is ≥94% complete) on every runner, here is the consolidated picture.

Per-platform summary — clean baselines (both branches from 2026-04-23 to 2026-04-24)

Platform ISA # improvements ≥3% # regressions ≤-3% Top win
x86-aws-m7i.metal-24xl (Sapphire Rapids) x86 7 2 load-hash-50f-100B +9.3%
x86-aws-m8i.24xlarge (Granite Rapids) x86 2 0 hgetex-persist-50f +12.4%
x86-aws-m8a.metal-24xl (AMD Turin) x86 7 1 load-hash-20f-1B-pipeline-30 +20.9%
arm-aws-m8g.metal-24xl (Graviton4) ARM 6 1 load-hash-50f-100B +14.2%
arm-gcp-c4a-standard-48 (Axion / Neoverse V2) ARM 7 0 load-hash-50f-100B +12.5%

Two additional x86 runners (gcp-c4-standard-48 and m8a.metal-24xl-profiler) also completed (or are 94% complete), but their unstable baselines are older (9 days and 3 days respectively), so deltas there bake in 3–9 days of unrelated commits and are not directly comparable. Directional signal on those runners is consistent with the clean-baseline set: broad hash/stream-load improvements.

Cross-platform consistent improvements (observed on ≥3 platforms, ≥3%)

Test x86 m7i x86 m8i x86 m8a ARM m8g ARM c4a
load-hash-50-fields-100B +9.3% +14.2% +12.5%
load-hash-50-fields-1000B +8.0% +4.8% +6.5%
load-hash-50-fields-1000B-expiration +3.9% +3.9% +3.2%
1000streams-xreadgroup-count-100 +3.6% +3.3% +4.3%
hgetex-persist-50-fields-10B +12.4% +3.6% +3.2%
hash-htll-50-fields-10B (x86 only) +6.6% +3.6% +5.3% −3.8% +0.8%

The HSET/load-hash pipelined wins you reported in the PR description reproduce across both ISAs and all hardware generations. Best observed delta is +20.9% on load-hash-20-fields-1B-pipeline-30 on m8a.metal-24xl.

Cross-platform regressions

1. hash-hgetex-50-fields-10B — x86 only (covered in the earlier comment)

Platform Δ ops/s
x86-aws-m7i.metal-24xl −3.8% 22,770 → 21,909
x86-aws-m8i.24xlarge −3.2% (sub-threshold)
x86-aws-m8a.metal-24xl −7.2% 33,658 → 31,224
x86-aws-m8a.metal-24xl-profiler −2.5% 33,679 → 32,846
arm-aws-m8g.metal-24xl +1.0% 22,885 → 23,106
arm-gcp-c4a-standard-48 +1.0% 23,776 → 24,005

The x86-only regression pattern matches the TMA diagnosis in the earlier comment: Memory_Bound shrinks −2.9pp (expected allocator win) but Fetch_Bandwidth grows +4.0pp from the always_inline createStringObjectInline bloating the hot path on reply-heavy workloads. ARM is flat — consistent with the absence of a uop cache / DSB-equivalent bottleneck on Neoverse-class cores.

2. hash-htll-50-fields-10B — ARM m8g only (−3.8%)

Platform Δ
x86-aws-m7i.metal-24xl +6.6%
x86-aws-m8i.24xlarge +3.6%
x86-aws-m8a.metal-24xl +5.3%
arm-aws-m8g.metal-24xl −3.8%
arm-gcp-c4a-standard-48 +0.8%

Opposite sign on Graviton4 vs the rest of the fleet. Only one data point on one ARM platform, and the test shows 0.2% std.dev so it is not noise — but we can't fully attribute without ARM TMA (no ARM topdown runner in our fleet). Might be worth a second look in case the zmalloc single-writer-slot reshuffle interacts badly with Graviton4's cache layout. Can share the raw timeseries if useful.

Test failures (context)

All the test failures observed on profiler and c4-standard-48 runners (string-auth-reconnect-10B, pubsub-mixed-*, replica-only-parallel-fullsync-*, 3Mkeys-string-mixed-50-50-with-expiration-pipeline-10-400_conns, zremrangebyscore-pipeline-10) reproduce on both platforms and on both branches — they are test-framework/infrastructure issues, not caused by this PR.

Bottom line

  • Write-path wins are confirmed and consistent across x86 (Intel + AMD) and ARM (Graviton4 + Axion).
  • The hgetex-50-fields-10B regression on x86 (up to −7.2%) is real and TMA-attributable to always_inline-driven icache pressure; ARM is not affected.
  • The htll-50-fields-10B regression on Graviton4 (−3.8%) is narrow, within-noise-margin, but worth a sanity check.
  • The allocator-accounting + sdallocx parts of the PR look like clear wins across the board. The part worth scoping more carefully is the forced-inline — see earlier comment.

Happy to share raw CSVs, individual timeseries, or a scoped re-run if you want to validate a fix.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit ea42cc6. Configure here.

Comment thread src/networking.c
@fcostaoliveira
Copy link
Copy Markdown
Collaborator

Re-run on 5fc11317 — scoped 12-test subset (post-inline-revert)

Re-ran the targeted subset after revert most of inline + avoid forcing inlining for AMD. Used a focused subset (the cross-platform wins from the prior run plus hash-hgetex-50-fields-10B-values to validate the AMD regression close).

Headline answer: AMD hgetex-50f regression is closed

Platform exp #1 (988806e, with inline) exp #3 (5fc11317, post-revert)
x86-aws-m8a.metal-24xl (AMD Turin) −7.2% +0.3%
x86-aws-m8a.metal-24xl-profiler (AMD Turin + TMA) −2.5% −2.7% (under 3% threshold)
x86-aws-m7i.metal-24xl (Sapphire Rapids) −3.8% −4.1%
x86-aws-m8i.24xlarge (Granite Rapids) −3.2% −4.3%

The AMD regression is fully resolved. There's a small Intel residual at ∼−4% on hgetex-50f-10B that hovers near the noise floor on both runs — not introduced by the inline-revert.

TMA on x86-aws-m8a.metal-24xl-profilerhgetex-50f-10B, level-2

unstable exp #1 (988806e) exp #3 (5fc11317)
Retiring 41.5% 39.3% (−2.2pp) 40.5% (−1.0pp)
Frontend_Bound 44.1% 49.5% (+5.4pp) 46.5% (+2.4pp)
Fetch_Bandwidth 15.5% 19.3% (+3.8pp) 16.0% (+0.5pp)
↳ Fetch_Latency 28.6% 30.3% (+1.7pp) 30.5% (+1.9pp)
Memory_Bound 10.9% 8.0% (−2.9pp) 9.8% (−1.1pp)

Confirms the diagnosis: forced-inline icache pressure (Fetch_Bandwidth +3.8pp on Turin) is essentially gone (+0.5pp). Cost: the allocator + sdallocx + accounting Memory_Bound win shrunk from −2.9pp → −1.1pp because the un-inlined functions are less optimizable by the compiler. Net Retiring loss reduced from −2.2pp → −1.0pp.

Write-path wins (the headline of the PR) — retained on x86

Test m7i (SPR) m8a (Turin) m8i (GranR) m8a-profiler
load-hash-50f-100B +10.9% +10.3% +1.6% +12.5%
load-hash-50f-10B +7.4% +10.2% +4.4% +14.0%
load-hash-20f-1B-pipeline-30 +8.4% +10.8% +5.3% +11.2%
load-hash-50f-1000B +5.7% +5.8% −1.8% +6.8%
1000streams-xreadgroup-count-100 +3.6% −0.1% +0.8% +6.1%
htll-50f-10B +1.5% (was +6.6%) +1.5% (was +5.3%) +10.7% +0.7%

Hash-load wins still dominate (+5 to +14%). htll-50f-10B lost most of the SPR/Turin boost (matches the "5–10% Intel boost from inlining we're leaving out" call), but Granite Rapids grew it to +10.7%.

Trade paid: hgetex-persist-50f-10B regresses on x86

The +12.4% Granite-Rapids boost in exp #1 was inline-driven; the revert turns it into a small regression on x86:

Platform exp #1 exp #3
m7i (SPR) flat −2.6%
m8a (Turin) +3.6% −4.6%
m8i (GranR) +12.4% −4.2%
m8a-profiler flat −0.7%

This is the most visible cost of the revert. hgetex-persist is a narrower workload than plain HSET, so on-balance the trade (close AMD hgetex-50f −7.2% / lose ∼5% on hgetex-persist) looks reasonable, but flagging it explicitly.

Bottom line

  • Original AMD hgetex-50f regression resolved (−7.2% → +0.3% on m8a) ✅
  • All write-path hash-load wins retained (+5 to +14% on x86)
  • Most of the allocator improvements survive (TMA Memory_Bound win shrunk −2.9pp → −1.1pp but didn't disappear)
  • Trade paid: hgetex-persist-50f-10B now regresses −4 to −5% on x86
  • Intel hgetex-50f-10B continues to hover at the noise floor (∼−4%, present in both runs, unrelated to the inline change)

PGO follow-up sounds like the right path for recovering the Intel boosts without the AMD icache cost. Happy to share raw CSVs / individual TMA funnels.

ARM (Graviton4 + Axion) is still running — m8g was held up behind unrelated long-tail work and the scoped run is now queued; will follow up with the ARM side as soon as it lands.

@fcostaoliveira
Copy link
Copy Markdown
Collaborator

ARM follow-up — Graviton4 (arm-aws-m8g.metal-24xl)

Re-ran the same 12-test subset on 5fc11317 against unstable. Axion (arm-gcp-c4a-standard-48) was tied up on unrelated work, so this is m8g-only for now — will tack Axion numbers on later if it frees up before merge.

Test exp #1 (988806e, with inline) exp #3 (5fc11317, post-revert)
load-hash-50f-100B +14.2% +8.2%
load-hash-50f-1000B-expiration +3.9% +4.5%
load-hash-50f-1000B +4.8% +3.3%
xreadgroup-count-100 +3.3% +1.5%
hgetex-persist-50f-10B (flat) +1.6%
hgetex-50f-10B +1.0% +0.3%
load-hash-50f-10B-short-expiration (flat) +0.1%
load-hash-50f-10B-long-expiration (flat) −1.0%
load-hash-50f-10B-expiration (flat) −1.0%
load-hash-20f-1B-pipeline-30 (flat) −3.1%
htll-50f-10B −3.8% −3.4%
load-hash-50f-10B (flat) −3.7%

Read on Graviton4

  • load-hash-50f-100B win held but shrunk meaningfully (+14.2% → +8.2%)
  • htll-50f-10B remains the one consistent ARM regression we'd already flagged in exp VM can't suppert Lists Or sets have much numbers。 #1, roughly unchanged (−3.8% → −3.4%)
  • Two new small regressions appeared in exp Cannot assign requested address #3: load-hash-50f-10B −3.7% and load-hash-20f-1B-pipeline-30 −3.1%
  • hgetex-50f-10B stays effectively flat on m8g across both runs (+1.0% → +0.3%) — Graviton4 wasn't impacted by the AMD-driven revert in either direction

The inline revert traded more on Graviton4 than I'd anticipated. Without ARM TMA in our fleet I can't decompose Retiring/Frontend/Memory the way I did on Turin — but the pattern (load-hash wins shrunk, two new small regressions on the smallest-value variants) is consistent with Graviton4 having been benefiting from the inlining similarly to Sapphire Rapids on those workloads.

Net for ARM: the headline load-hash-50f-100B win is still solid (+8.2%) and the persistent htll-50f regression isn't worse — but the smaller-value variants regressed slightly. Probably worth a second look on Graviton4 if PGO recovers some of this.

@tezc tezc requested review from ShooterIT, oranagra and sundb April 28, 2026 09:49
Copy link
Copy Markdown
Collaborator

@skaslev skaslev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with a nit inline

Comment thread src/zmalloc.c Outdated
Copy link
Copy Markdown
Member

@ShooterIT ShooterIT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe separate PRs would be better, we can know the effect of each part

Comment thread src/atomicvar.h Outdated
Comment thread src/object.c Outdated
@tezc
Copy link
Copy Markdown
Collaborator Author

tezc commented Apr 29, 2026

@ShooterIT It's a bit harder to benchmark changes separately when each change contributes a few percent. With the help of AI, I tried to color the relevant boxes:

memtier_benchmark   --command="HSET __key__ username john_doe email john@example.com password hashed_pwd_123 created_at 1709125200 updated_at 1709125200 first_name John last_name Doe phone_number +1234567890 address 123_Main_St city NewYork country USA postal_code 10001 company Acme_Corp job_title Engineer bio Loves_coding"   --command-ratio=1   --command-key-pattern=P   --key-prefix="hsetkey"   --key-minimum=1   --key-maximum=100000   -n 1000000   -c 50   -t 2   --hide-histogram --pipeline 100

Before:
orig annotated

After:
new annotated

For the above test, gains in terms of cpu percentage:

  1. avoiding zfree size lookup: ~%1
  2. avoding jemalloc tcache flush: ~%3
  3. zmalloc stat update: ~%7
  4. hard to show impact of inlining here.

Copy link
Copy Markdown
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've reviewed the PR description and commented about a couple of concerns.

Comment thread src/zmalloc.c
Comment thread src/zmalloc.c
@fcostaoliveira
Copy link
Copy Markdown
Collaborator

@tezc — apologies for the delayed follow-up. Two confounders:

  1. The two earlier comments (1, 2) were against commit 988806e, the initial PR head. Since then there have been ≥5 commits — most importantly ca5de1fb (avoid forcing inlining for AMD) and ea42cc65 (revert most of inline) — both directly addressing the Fetch_Bandwidth +4.0pp shift the TMA flagged at the time.
  2. Our queued confirmation re-trigger sat behind a 495-item fleet backlog and aged out before producing fresh comparable numbers — sorry for not surfacing that earlier.

Re-ran today on the current PR head c4abf7f4 against unstable 7ecc04f59d, scoped to the previously-regressing tests, on the -2 siblings of the same CPUs (Sapphire Rapids on m7i-2, Graviton4 on m8g-2):

x86-aws-m7i.metal-24xl-2 (Sapphire Rapids)

Test Baseline 7ecc04f59d PR c4abf7f4 Δ (was on 988806e)
100Kkeys-hash-hgetex-50-fields-10B-values 22 385.06 21 972.97 −1.84 % (−3.2 to −7.2 %)
100Kkeys-hash-hgetex-persist-50-fields-10B-values 38 063.94 37 550.94 −1.35 % (−4 to −5 %)
100Kkeys-hash-htll-50-fields-10B-values 54 713.68 57 929.55 +5.88 % (clean on x86 originally)

arm-aws-m8g.metal-24xl-2 (Graviton4)

Test Baseline 7ecc04f59d PR c4abf7f4 Δ (was on 988806e)
100Kkeys-hash-hgetex-50-fields-10B-values 22 386.94 22 740.17 +1.58 % (clean on m8g originally)
100Kkeys-hash-hgetex-persist-50-fields-10B-values 38 335.13 39 086.45 +1.96 % (clean on m8g originally)
100Kkeys-hash-htll-50-fields-10B-values 60 280.59 59 849.56 −0.72 % (−3.8 %)

All 6 deltas are within ±2 % noise (single dp each, consistent direction across both arches), and htll-50f is +5.88 % on x86. The two prior regressions are no longer reproducible on the current head. Matches your manual observation on m8i / m8g.

From our side: nothing blocking the merge on these tests. Sorry for the noise from the stale snapshot.

Comment thread src/object.c
Comment thread src/zmalloc.c Outdated
@tezc tezc added the release-notes indication that this issue needs to be mentioned in the release notes label May 9, 2026
@tezc tezc merged commit 7bdab45 into redis:unstable May 9, 2026
18 checks passed
@tezc tezc deleted the faster-alloc branch May 9, 2026 08:48
tezc added a commit that referenced this pull request May 18, 2026
Fixes: 

1. After #15096, we pass -flto to jemalloc. On Azure Linux, the
resulting jemalloc library cannot be handled at link time and the build
fails. Adding -ffat-lto-objects so the compiler also emits regular
object code that the linker can fall back to when it cannot handle the
LTO-compiled library.
2. Fixed a warning about `path` being NULL in
`moduleLoadInternalModules()`.
3. Fixed compile warnings on older GCC versions introduced by #15162  
    (reported on Ubuntu 20.04)

Co-authored-by: debing.sun <debing.sun@redis.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

action:run-benchmark Triggers the benchmark suite for this Pull Request release-notes indication that this issue needs to be mentioned in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants