Skip to content

Memory usage improvements #9565

@mykaul

Description

@mykaul

While trying to improve Tempoal performance for ScyllaDB I (OpenCode really, but with my directions) bumped across mostly unrelated memory optimizations opportunities. Is the team interested in them? I don't wish to overload the team if there's little interest or value.
They can be cherry-picked independently. Here's the proposed content:

commit 0db4ad05c4601796c1c8f3f3675ca350ab8cda8b (HEAD -> main)
Author: Yaniv Michael Kaul <yaniv.kaul@scylladb.com>
Date:   Tue Mar 17 23:57:05 2026 +0200

    perf: cache Tagged scope lookups in tallyMetricsHandler to reduce allocations
    
    Add a scopeCache (sync.Map) to tallyMetricsHandler that caches the
    result of scope.Tagged() calls per unique tag combination. This avoids
    repeated map allocations and tally scope registry lookups on every
    Counter/Gauge/Timer/Histogram Record call with inline tags.
    
    The scope cache builds on the WithTags handler cache from the previous
    commit: WithTags caches entire handler subtrees, while scopeCache
    targets the per-metric-emission path where tags are passed inline.
    
    The cache is bounded to 1024 entries (approximate, may slightly overshoot
    under high concurrency due to check-then-store races). Tags are
    normalized through excludeTags before cache key computation so that
    different raw values which map to the same excluded placeholder share
    a single cache entry.
    
    Combined with the WithTags cache, the two layers eliminated the top
    allocation sources from tally (verified via pprof alloc_space, ScyllaDB
    omes throughput_stress 5-min run, 60 iterations):
    
      Allocation site                     Pre-metrics    Scope-cache    Delta
      tagsToMap.func1 (map inserts)       1,101 MB       0 MB           -100%
      tally.scopeRegistry.Subscope        1,012 MB       0 MB           -100%
      tagsToMap (make map)                  177 MB       0 MB           -100%
      tagsCacheKey (new: cache keys)          0 MB     369 MB           +369 MB
      WithTags (cum)                      1,972 MB     326 MB           -83.5%
    
      Total allocations:                 18,465 MB  16,511 MB    -1,954 MB (-10.6%)
      Net metrics savings: ~1,921 MB eliminated, 369 MB new cost
    
    Benchmark: omes throughput_stress, 5 min, 128 shards, QPS unlimited,
    GOMAXPROCS=4, GOGC=200, 4-core pinning per component (i7-1270P).
    
    UpdateWorkflowExecution latency (avg) and total persistence ops:
    
      ScyllaDB:
        prev commit (WithTags cache): 2.33ms, 85,885 ops, 60 iters
        this commit (+ scope cache):  2.27ms, 85,886 ops, 60 iters
        delta: -0.06ms (-2.6%)
    
      Cassandra:
        tuned baseline (no metrics opts): 2.64ms, 85,739 ops, 60 iters
        this commit (WithTags + scope):   2.61ms, 85,922 ops, 60 iters
        delta: -0.03ms (-1.2%)

And:

commit fb46e34a00742cf18f659c72c4a1674b312251fd
Author: Yaniv Michael Kaul <yaniv.kaul@scylladb.com>
Date:   Tue Mar 17 21:21:31 2026 +0200

    perf: cache WithTags handlers in tallyMetricsHandler to reduce allocations
    
    Add sync.Map-based caching of child handlers in WithTags(). On cache
    hit, zero allocations — skips tagsToMap(), scope.Tagged(), and handler
    struct allocation entirely.
    
    Benchmark (omes throughput_stress, 5min, ScyllaDB, standard 4-core
    layout: Temporal cores 0-3, DB cores 4-7, GOMAXPROCS=4 GOGC=200,
    128 shards, QPS=0):
    
      WithTags cache:  avg=2.33ms  ops=85,885  iters=60
      Tuned baseline:  avg=2.27ms  ops=86,133  iters=60
      Delta:           ops -0.29% (within run-to-run noise)
    
    Memory profiling (pprof alloc_space, 5min ScyllaDB workload):
    
      Total allocations:   16,481 MB (vs 18,030 MB baseline = -8.6%)
    
      Per-function breakdown:
      | Allocation site              | Baseline   | WithTags   | Delta  |
      |------------------------------|------------|------------|--------|
      | tagsToMap.func1 (map insert) | 1,106 MB   | 215 MB     | -80.6% |
      | tally Subscope               | 1,014 MB   | 164 MB     | -83.8% |
      | tagsCacheKey (new: keys)     | 0 MB       | 317 MB     | +317MB |
      | WithTags (cumulative)        | 1,930 MB   | 316 MB     | -83.6% |
    
    The throughput delta is not measurable at this concurrency level.
    Metrics overhead is ~3.2% of total server CPU; eliminating WithTags
    allocations frees ~3% of CPU which is below the noise floor of the
    end-to-end benchmark. The allocation reduction is real but the system
    is throughput-capped by aggregate server compute across many subsystems
    (syscalls 13%, GC 10%, gRPC 9%, proto 11%), not by any single one.

Let me know and I'll be happy to submit. I only briefly reviewed the code (certainly not the area I'm focused in, interested in or have experience or knowledge in).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions