Skip to content

perf: cache tally metrics handler scopes and WithTags handlers to reduce allocations#9620

Merged
rodrigozhou merged 1 commit into
temporalio:mainfrom
mykaul:perf/cache-metrics-handler-tags
May 19, 2026
Merged

perf: cache tally metrics handler scopes and WithTags handlers to reduce allocations#9620
rodrigozhou merged 1 commit into
temporalio:mainfrom
mykaul:perf/cache-metrics-handler-tags

Conversation

@mykaul
Copy link
Copy Markdown
Contributor

@mykaul mykaul commented Mar 23, 2026

Summary

  • Cache WithTags() child handlers via sync.Map to eliminate repeated tagsToMap(), scope.Tagged(), and handler struct allocations on the hot path
  • Cache scope.Tagged() results per unique inline tag combination in cachedTaggedScope(), bounded to 1024 entries with graceful degradation
  • Normalize excluded tags before cache key computation so high-cardinality excluded values (e.g. activityType) share a single cache entry, preventing unbounded cache growth

Design

Two complementary caching layers in tallyMetricsHandler:

  1. childCache (sync.Map): caches entire handler subtrees returned by WithTags(). On cache hit: zero allocations.
  2. scopeCache (sync.Map + atomic.Int64 size bound): caches tally.Scope objects returned by scope.Tagged() for inline tags passed to Counter/Gauge/Timer/Histogram Record() calls. Bounded to 1024 entries; beyond that, scopes are created but not cached.

Both caches use LoadOrStore for safe concurrent access. Tag normalization via normalizeTagsForCaching() ensures excluded tag variants collapse to the same cache key. The normalization has a zero-alloc fast path when no tags need substitution.

Allocation Reduction (pprof alloc_space, 5min ScyllaDB workload)

Commit 1: WithTags handler cache

Metric Before After Reduction
WithTags cumulative 1,930 MB 316 MB -83.6%
Total server allocs 18,030 MB 16,481 MB -8.6%

Commit 2: Scope cache for inline tags

Metric Before After Reduction
tagsToMap.func1 1,101 MB 0 MB -100%
tally Subscope 1,012 MB 0 MB -100%
Total server allocs 18,465 MB 16,511 MB -10.6%

Benchmark (omes throughput_stress, mc150, 5 min)

Host networking, i7-1270P 4 cores/component, inter-run data resets:

Database Baseline After commit 1 After commit 2
Cassandra 280 294 (+5.0%) 270 (-3.6%)
ScyllaDB 290 296 (+2.1%) 298 (+2.8%)

Note: Throughput variance at mc150 is ~5-10%. The allocation reduction is confirmed by pprof but throughput gains are within noise at this concurrency level.

Testing

  • Unit tests for all 4 metric types (Counter, Gauge, Timer, Histogram) with inline tags
  • Concurrency tests with race detector (32 goroutines × 100 iterations)
  • Cache bound enforcement test
  • Exclude-tag normalization tests (merge, allowed values, zero-alloc fast path)
  • Independent per-handler scope cache verification
  • All existing tests continue to pass

v2 — addressing review feedback

All 6 review comments addressed. Rebased on origin/main, squashed into a single commit.

Changes

  1. tagsCacheKey: Use strings.Builder with Grow pre-allocation — replaced manual []byte construction with strings.Builder. A sizing pass pre-computes the exact capacity via Grow() to avoid internal reallocation (1 alloc/op).

  2. tagsCacheKey: Remove single-tag special case, uniform \x00 separator — removed the len(tags) == 1 branch. Every tag pair now unconditionally appends \x00 after both key and value, making the format uniform and the code simpler.

  3. normalizeTagsForCaching: Use slices.Clone — replaced make([]Tag, len(tags)) + copy(tags[:i]) with slices.Clone(tags), which copies the entire slice upfront. This eliminates the if normalized != nil { normalized[i] = t } guard for unchanged tags after the clone point.

  4. Extract shared normalizeTag function — the exclude-tag check was duplicated between normalizeTagsForCaching and the convert closure in tagsToMap. Extracted normalizeTag(tag Tag, excl excludeTags) (Tag, bool) used by both, removing the duplication.

  5. Bound childCache to scopeCacheMaxSizechildCache (used by WithTags) was previously unbounded. Applied the same bounding strategy as scopeCache: atomic counter + stop caching beyond 1024 entries. Added childCacheSize atomic.Int64 field and TestWithTags_BoundedChildCacheSize test.

Micro-benchmark results (cached vs uncached)

goos: linux, goarch: amd64, cpu: 12th Gen Intel(R) Core(TM) i7-1270P

                                    ns/op     B/op   allocs/op
CounterRecord_Uncached              344       592    3
CounterRecord_CachedScope            93        48    2           ← 3.7x faster, 12x less memory
WithTags_Uncached                   325       592    3
WithTags_CacheHit                    50        16    1           ← 6.5x faster, 37x less memory
TagsCacheKey_SingleTag               29        24    1
TagsCacheKey_ThreeTags               49        64    1

Not changed (deliberate)

  • Cache key is order-sensitive / does not deduplicate keys — Tally internally canonicalizes tag maps (sorted keys, rightmost precedence), so the cache key could theoretically miss on reordered-but-equivalent tag sets. Verified across 100+ call sites: tag ordering is fully consistent and duplicate keys never appear in the codebase. Adding sort+dedup to the hot path would add cost without real-world benefit.

@mykaul mykaul requested review from a team as code owners March 23, 2026 07:57
@mykaul mykaul force-pushed the perf/cache-metrics-handler-tags branch from 5cd1af6 to 0069456 Compare March 27, 2026 08:13
@mykaul mykaul requested a review from a team as a code owner March 27, 2026 08:13
@mykaul
Copy link
Copy Markdown
Contributor Author

mykaul commented Mar 27, 2026

v2 — rebased on main, fix unbounded childCache growth

Rebased on current main and added a fix for unbounded childCache memory growth with high-cardinality excluded tags.

What changed

WithTags now normalizes excluded tags before computing the childCache key (using the same normalizeTagsForCaching that scopeCache already uses).

Previously, childCache keyed on raw tag values. Since excluded tags like activityType have an empty allow-list, every distinct activity type name (e.g., TypeA, TypeB, ...) created a separate entry in childCache — a sync.Map with no eviction or size bound. In workloads with many distinct activity types, this leaked memory monotonically.

After this fix, all excluded-tag variants normalize to the same key (e.g., activityType\x00__excluded__), so they share a single cached child handler.

Test added

TestWithTags_ExcludedTagsShareChildHandler — verifies that WithTags(ActivityTypeTag("TypeA")), WithTags(ActivityTypeTag("TypeB")), and WithTags(ActivityTypeTag("TypeC")) all return the same handler pointer and correctly accumulate metrics.

@mykaul mykaul force-pushed the perf/cache-metrics-handler-tags branch from 0069456 to 7a986ca Compare April 24, 2026 09:16
@mykaul mykaul changed the title perf: cache metrics handler WithTags and Tagged scope lookups perf: cache tally metrics handler scopes and WithTags handlers to reduce allocations Apr 24, 2026
Comment thread common/metrics/tally_metrics_handler.go Outdated
Comment thread common/metrics/tally_metrics_handler.go Outdated
Comment thread common/metrics/tally_metrics_handler.go Outdated
Comment thread common/metrics/tally_metrics_handler.go Outdated
Comment thread common/metrics/tally_metrics_handler.go Outdated
Comment thread common/metrics/tally_metrics_handler.go Outdated
Comment on lines +31 to +32
childCache sync.Map // tagsCacheKey(tags) -> *tallyMetricsHandler
scopeCache sync.Map // tagsCacheKey(normalized tags) -> tally.Scope
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

childCache seems to be an unbounded cache, and scopeCache seems to be bounded at scopeCacheMaxSize (set to 1024) which will stop caching afterwards.

In both cases, this could be bad if the cache grows too much, specially since the tag value is part of the key. Wondering if an LRU cache would be better.
cc: @yycptt

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second though, this cache is per handler, ie., we get a cache every time a new handler is created (eg: from calling WithTags). I wonder how this is gonna perform in very large scale with thousands of namespaces, workflow types, etc., that are common tags. Specially curious about the additional memory usage.
cc: @yycptt

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @rodrigozhou that the cache should be more strictly bounded.

I'd suggest dropping childCache in favor of passing a shared scopeCache down to child handlers. Then introduce a scopeKey alongside the scope field so that child handlers can generate the correct cache key. Something like:

        tallyMetricsHandler struct {
                scope          tally.Scope
+               scopeKey       string
                perUnitBuckets map[MetricUnit]tally.Buckets
                excludeTags    excludeTags
-               childCache     sync.Map
-               childCacheSize atomic.Int64
-               scopeCache     sync.Map
+               scopeCache     *sync.Map  // shared by all handlers
-               scopeCacheSize atomic.Int64
+               scopeCacheSize *atomic.Int64 
        }
 )

Incorporate scopeKey into the cache key as a prefix.

-func tagsCacheKey(tags []Tag) string {
+func tagsCacheKey(prefix string, tags []Tag) string {
  // ....
  sb.WriteString(prefix)
  // ...
}

And propagate scopeCache to descendants.

func (tmh *tallyMetricsHandler) WithTags(tags ...Tag) Handler {
-       key := tagsCacheKey(normalizeTagsForCaching(tags, tmh.excludeTags))
+       key := tagsCacheKey(tmh.scopeKey, normalizeTagsForCaching(tags, tmh.excl
udeTags))

        child := &tallyMetricsHandler{
                scope:          tmh.scope.Tagged(tagsToMap(tags, tmh.excludeTags)),
+               scopeKey:       key
                perUnitBuckets: tmh.perUnitBuckets,
                excludeTags:    tmh.excludeTags,
+               scopeCache:     tmh.scopeCache,
+               scopeCacheSize: tmh.scopeCacheSize,
        }
}

@rodrigozhou rodrigozhou requested a review from yycptt May 1, 2026 22:38
@mykaul mykaul force-pushed the perf/cache-metrics-handler-tags branch from 7a986ca to 297c342 Compare May 2, 2026 09:50
Comment thread common/metrics/tally_metrics_handler.go Outdated
Comment thread common/metrics/tally_metrics_handler.go Outdated
// scopeCacheMaxSize is the approximate upper bound on cached scope entries.
// The bound may be slightly exceeded under high concurrency due to
// check-then-store races, which is acceptable.
const scopeCacheMaxSize = 1024
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename maxCacheSize

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This limit could probably be more generous, maybe 10k. There are a lot of metrics and once the handler hits the limit, all additional metrics are permanently uncached.

Consider clearing the cache periodically to compensate for working set shifts over time and reduce performance volatility.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is periodically? Once a ... ? Based on usage perhaps? Hit/Miss ratio?

Comment thread common/metrics/tally_metrics_handler.go Outdated
Comment on lines +31 to +32
childCache sync.Map // tagsCacheKey(tags) -> *tallyMetricsHandler
scopeCache sync.Map // tagsCacheKey(normalized tags) -> tally.Scope
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second though, this cache is per handler, ie., we get a cache every time a new handler is created (eg: from calling WithTags). I wonder how this is gonna perform in very large scale with thousands of namespaces, workflow types, etc., that are common tags. Specially curious about the additional memory usage.
cc: @yycptt

Copy link
Copy Markdown
Contributor

@BenEddy BenEddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description mentions micro-benchmarks -- could you please include the benchmarks alongside the tests or in a separate bench file?

Comment thread common/metrics/tally_metrics_handler.go Outdated
Comment on lines +31 to +32
childCache sync.Map // tagsCacheKey(tags) -> *tallyMetricsHandler
scopeCache sync.Map // tagsCacheKey(normalized tags) -> tally.Scope
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @rodrigozhou that the cache should be more strictly bounded.

I'd suggest dropping childCache in favor of passing a shared scopeCache down to child handlers. Then introduce a scopeKey alongside the scope field so that child handlers can generate the correct cache key. Something like:

        tallyMetricsHandler struct {
                scope          tally.Scope
+               scopeKey       string
                perUnitBuckets map[MetricUnit]tally.Buckets
                excludeTags    excludeTags
-               childCache     sync.Map
-               childCacheSize atomic.Int64
-               scopeCache     sync.Map
+               scopeCache     *sync.Map  // shared by all handlers
-               scopeCacheSize atomic.Int64
+               scopeCacheSize *atomic.Int64 
        }
 )

Incorporate scopeKey into the cache key as a prefix.

-func tagsCacheKey(tags []Tag) string {
+func tagsCacheKey(prefix string, tags []Tag) string {
  // ....
  sb.WriteString(prefix)
  // ...
}

And propagate scopeCache to descendants.

func (tmh *tallyMetricsHandler) WithTags(tags ...Tag) Handler {
-       key := tagsCacheKey(normalizeTagsForCaching(tags, tmh.excludeTags))
+       key := tagsCacheKey(tmh.scopeKey, normalizeTagsForCaching(tags, tmh.excl
udeTags))

        child := &tallyMetricsHandler{
                scope:          tmh.scope.Tagged(tagsToMap(tags, tmh.excludeTags)),
+               scopeKey:       key
                perUnitBuckets: tmh.perUnitBuckets,
                excludeTags:    tmh.excludeTags,
+               scopeCache:     tmh.scopeCache,
+               scopeCacheSize: tmh.scopeCacheSize,
        }
}

Comment thread common/metrics/tally_metrics_handler.go Outdated
// scopeCacheMaxSize is the approximate upper bound on cached scope entries.
// The bound may be slightly exceeded under high concurrency due to
// check-then-store races, which is acceptable.
const scopeCacheMaxSize = 1024
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This limit could probably be more generous, maybe 10k. There are a lot of metrics and once the handler hits the limit, all additional metrics are permanently uncached.

Consider clearing the cache periodically to compensate for working set shifts over time and reduce performance volatility.

@mykaul mykaul force-pushed the perf/cache-metrics-handler-tags branch from 297c342 to 06419dd Compare May 6, 2026 19:57
@mykaul
Copy link
Copy Markdown
Contributor Author

mykaul commented May 6, 2026

v2 — Review Cycle 4/5 Update

Force-pushed a single squashed commit replacing the previous 2-commit history.

Changes since v1:

  1. Length-prefixed cache keys — replaced \x00 delimiter with binary.PutUvarint + raw bytes encoding. Prevents collisions when tag keys/values contain NUL bytes.
  2. histogramCacheKey struct — replaced name + "\x00" + unit string concatenation with a typed struct as the sync.Map key. Eliminates ambiguity.
  3. Shared bounded cache — single sharedScopeCache with sync.RWMutex + double-check locking (replaces per-handler sync.Map from v1 for scope/handler caching).
  4. Per-handler closure cachingsync.Map for Counter/Timer/Histogram/Gauge closures (unchanged from v1).
  5. normalizeTagsForCaching — zero-alloc fast path when no tags are excluded; slices.Clone only when substitution is needed.
  6. Clear-on-overflow — configurable TagsCacheMaxSize (default 10000) with full cache clear when limit is reached.
  7. Comprehensive tests — 20+ test cases covering concurrency, cache bounds, excluded tag merging, key collision regression.

Benchmark Results (6 server cores, 4 ScyllaDB cores, mc1200, 3 min):

Metric Baseline (origin/main) With cache Delta
Throughput 168.4 iter/s 176.9 iter/s +5.0%
alloc_objects (60s) 160.9M 150.7M -6.3%
alloc_space (60s) 13,563 MB 11,957 MB -11.8%
Metrics layer allocs 1,742 MB 410 MB -76.5%

Review Cycle 5 (self-review):

  • No correctness issues found
  • Double-check locking pattern verified sound
  • Length-prefix encoding verified collision-free
  • create() outside write lock is intentional (avoids holding lock during scope.Tagged())

Comment thread common/metrics/tally_metrics_handler.go
@rodrigozhou rodrigozhou requested a review from BenEddy May 15, 2026 18:56
Introduce a shared bounded scope cache with sync.RWMutex and double-check
locking to eliminate repeated tally.Scope.Tagged() allocations. Cache
Counter/Timer/Histogram/Gauge closures per handler via sync.Map.

Key design choices:
- Length-prefixed cache keys (binary.PutUvarint) prevent collisions when
  tag keys/values contain NUL bytes
- histogramCacheKey struct avoids string concatenation ambiguity
- normalizeTagsForCaching applies excludeTags before key computation so
  high-cardinality excluded values share a single cache entry
- Clear-on-overflow bounds memory (configurable TagsCacheMaxSize, default 10000)
- Zero-alloc fast path when no tags need normalization

Benchmark results (6 server cores, ScyllaDB 2026.2.0-rc0, mc1200):
- Throughput: 176.9 iter/s (+5.0% vs baseline 168.4)
- alloc_objects: 150.7M (-6.3% vs baseline 160.9M)
- alloc_space: 11,957 MB (-11.8% vs baseline 13,563 MB)
- Metrics layer allocs: 410 MB (-76.5% vs baseline 1,742 MB)
@mykaul mykaul force-pushed the perf/cache-metrics-handler-tags branch from 06419dd to 84136be Compare May 17, 2026 12:51
Copy link
Copy Markdown
Contributor

@BenEddy BenEddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per-handler closure cachingsync.Map for Counter/Timer/Histogram/Gauge closures (unchanged from v1)

Hmm maybe obscured by force pushing, but this change is net new since I last reviewed the PR (I might have missed v1). The closure caches are per-handler and have no eviction and no bound, so memory usage scales with handlers x metrics and undermines the sharedScopeCache bound. We should either remove them or share them across handler instances.

@mykaul
Copy link
Copy Markdown
Contributor Author

mykaul commented May 18, 2026

Per-handler closure cachingsync.Map for Counter/Timer/Histogram/Gauge closures (unchanged from v1)

Hmm maybe obscured by force pushing, but this change is net new since I last reviewed the PR (I might have missed v1). The closure caches are per-handler and have no eviction and no bound, so memory usage scales with handlers x metrics and undermines the sharedScopeCache bound. We should either remove them or share them across handler instances.

ARGH - did I break something? Or confused branches? Let me check. If I remember I only wanted to fix the formatting CI failures.

@mykaul
Copy link
Copy Markdown
Contributor Author

mykaul commented May 18, 2026

Per-handler closure cachingsync.Map for Counter/Timer/Histogram/Gauge closures (unchanged from v1)

Hmm maybe obscured by force pushing, but this change is net new since I last reviewed the PR (I might have missed v1). The closure caches are per-handler and have no eviction and no bound, so memory usage scales with handlers x metrics and undermines the sharedScopeCache bound. We should either remove them or share them across handler instances.

OK, checked - only formatting changes.

  • To your concern: cachedTaggedScope is bound
  • keys use metric name - so in reality bounded by the metric catalog (about 200 names?)
  • sharedScopeCache is limited to 10K (in loadOrStoreScope() for example, it clears it if it's over the max - which is 10K by default.

So overall, 200 * 10K - is that acceptable?

Copy link
Copy Markdown
Contributor

@BenEddy BenEddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mykaul Thanks for your patience here -- and good point, the closure cache is effectively bounded by the metric catalog.

The reason I’m pushing on the cache bound is that we’ve seen large heaps with millions of long-lived, repeatedly scanned objects drive significant CPU volatility in production. Longer mark phases also tend to increase GC assist overhead. 200 * 10K is not problematic in isolation, but we’re generally cautious about increasing heap pointer density, and since callers can leak handlers, any bound is ultimately soft.

That said, I’m happy to land this as-is, and I can follow up with a separate PR to share the closure caches across handlers. Appreciate the work on this!

@mykaul
Copy link
Copy Markdown
Contributor Author

mykaul commented May 19, 2026

@mykaul Thanks for your patience here -- and good point, the closure cache is effectively bounded by the metric catalog.

The reason I’m pushing on the cache bound is that we’ve seen large heaps with millions of long-lived, repeatedly scanned objects drive significant CPU volatility in production. Longer mark phases also tend to increase GC assist overhead. 200 * 10K is not problematic in isolation, but we’re generally cautious about increasing heap pointer density, and since callers can leak handlers, any bound is ultimately soft.

That said, I’m happy to land this as-is, and I can follow up with a separate PR to share the closure caches across handlers. Appreciate the work on this!

Thanks - all I have is my poor laptop to test 'at scale' with Omes, and it's not the greatest setup, and I'm not running long lived perf tests, just 5-15m or so. Since I'm somewhat reluctant to change the schema (which is the main pain point I've identified), I'm left with the rest of the work - trying to parallelise work (which is challenging with the 27 serial LWT query executions) and reducing memory and GC all over. The rest of the optimizations are really minimalistic.

@rodrigozhou rodrigozhou merged commit e59d6da into temporalio:main May 19, 2026
70 of 72 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants