perf: cache tally metrics handler scopes and WithTags handlers to reduce allocations by mykaul · Pull Request #9620 · temporalio/temporal

mykaul · 2026-03-23T07:57:49Z

Summary

Cache WithTags() child handlers via sync.Map to eliminate repeated tagsToMap(), scope.Tagged(), and handler struct allocations on the hot path
Cache scope.Tagged() results per unique inline tag combination in cachedTaggedScope(), bounded to 1024 entries with graceful degradation
Normalize excluded tags before cache key computation so high-cardinality excluded values (e.g. activityType) share a single cache entry, preventing unbounded cache growth

Design

Two complementary caching layers in tallyMetricsHandler:

childCache (sync.Map): caches entire handler subtrees returned by WithTags(). On cache hit: zero allocations.
scopeCache (sync.Map + atomic.Int64 size bound): caches tally.Scope objects returned by scope.Tagged() for inline tags passed to Counter/Gauge/Timer/Histogram Record() calls. Bounded to 1024 entries; beyond that, scopes are created but not cached.

Both caches use LoadOrStore for safe concurrent access. Tag normalization via normalizeTagsForCaching() ensures excluded tag variants collapse to the same cache key. The normalization has a zero-alloc fast path when no tags need substitution.

Allocation Reduction (pprof alloc_space, 5min ScyllaDB workload)

Commit 1: WithTags handler cache

Metric	Before	After	Reduction
WithTags cumulative	1,930 MB	316 MB	-83.6%
Total server allocs	18,030 MB	16,481 MB	-8.6%

Commit 2: Scope cache for inline tags

Metric	Before	After	Reduction
tagsToMap.func1	1,101 MB	0 MB	-100%
tally Subscope	1,012 MB	0 MB	-100%
Total server allocs	18,465 MB	16,511 MB	-10.6%

Benchmark (omes throughput_stress, mc150, 5 min)

Host networking, i7-1270P 4 cores/component, inter-run data resets:

Database	Baseline	After commit 1	After commit 2
Cassandra	280	294 (+5.0%)	270 (-3.6%)
ScyllaDB	290	296 (+2.1%)	298 (+2.8%)

Note: Throughput variance at mc150 is ~5-10%. The allocation reduction is confirmed by pprof but throughput gains are within noise at this concurrency level.

Testing

Unit tests for all 4 metric types (Counter, Gauge, Timer, Histogram) with inline tags
Concurrency tests with race detector (32 goroutines × 100 iterations)
Cache bound enforcement test
Exclude-tag normalization tests (merge, allowed values, zero-alloc fast path)
Independent per-handler scope cache verification
All existing tests continue to pass

v2 — addressing review feedback

All 6 review comments addressed. Rebased on origin/main, squashed into a single commit.

Changes

tagsCacheKey: Use strings.Builder with Grow pre-allocation — replaced manual []byte construction with strings.Builder. A sizing pass pre-computes the exact capacity via Grow() to avoid internal reallocation (1 alloc/op).
tagsCacheKey: Remove single-tag special case, uniform \x00 separator — removed the len(tags) == 1 branch. Every tag pair now unconditionally appends \x00 after both key and value, making the format uniform and the code simpler.
normalizeTagsForCaching: Use slices.Clone — replaced make([]Tag, len(tags)) + copy(tags[:i]) with slices.Clone(tags), which copies the entire slice upfront. This eliminates the if normalized != nil { normalized[i] = t } guard for unchanged tags after the clone point.
Extract shared normalizeTag function — the exclude-tag check was duplicated between normalizeTagsForCaching and the convert closure in tagsToMap. Extracted normalizeTag(tag Tag, excl excludeTags) (Tag, bool) used by both, removing the duplication.
Bound childCache to scopeCacheMaxSize — childCache (used by WithTags) was previously unbounded. Applied the same bounding strategy as scopeCache: atomic counter + stop caching beyond 1024 entries. Added childCacheSize atomic.Int64 field and TestWithTags_BoundedChildCacheSize test.

Micro-benchmark results (cached vs uncached)

goos: linux, goarch: amd64, cpu: 12th Gen Intel(R) Core(TM) i7-1270P

                                    ns/op     B/op   allocs/op
CounterRecord_Uncached              344       592    3
CounterRecord_CachedScope            93        48    2           ← 3.7x faster, 12x less memory
WithTags_Uncached                   325       592    3
WithTags_CacheHit                    50        16    1           ← 6.5x faster, 37x less memory
TagsCacheKey_SingleTag               29        24    1
TagsCacheKey_ThreeTags               49        64    1

Not changed (deliberate)

Cache key is order-sensitive / does not deduplicate keys — Tally internally canonicalizes tag maps (sorted keys, rightmost precedence), so the cache key could theoretically miss on reordered-but-equivalent tag sets. Verified across 100+ call sites: tag ordering is fully consistent and duplicate keys never appear in the codebase. Adding sort+dedup to the hot path would add cost without real-world benefit.

mykaul · 2026-03-27T08:14:08Z

v2 — rebased on main, fix unbounded childCache growth

Rebased on current main and added a fix for unbounded childCache memory growth with high-cardinality excluded tags.

What changed

WithTags now normalizes excluded tags before computing the childCache key (using the same normalizeTagsForCaching that scopeCache already uses).

Previously, childCache keyed on raw tag values. Since excluded tags like activityType have an empty allow-list, every distinct activity type name (e.g., TypeA, TypeB, ...) created a separate entry in childCache — a sync.Map with no eviction or size bound. In workloads with many distinct activity types, this leaked memory monotonically.

After this fix, all excluded-tag variants normalize to the same key (e.g., activityType\x00__excluded__), so they share a single cached child handler.

Test added

TestWithTags_ExcludedTagsShareChildHandler — verifies that WithTags(ActivityTypeTag("TypeA")), WithTags(ActivityTypeTag("TypeB")), and WithTags(ActivityTypeTag("TypeC")) all return the same handler pointer and correctly accumulate metrics.

rodrigozhou · 2026-05-01T22:38:17Z

+		childCache     sync.Map // tagsCacheKey(tags) -> *tallyMetricsHandler
+		scopeCache     sync.Map // tagsCacheKey(normalized tags) -> tally.Scope


childCache seems to be an unbounded cache, and scopeCache seems to be bounded at scopeCacheMaxSize (set to 1024) which will stop caching afterwards.

In both cases, this could be bad if the cache grows too much, specially since the tag value is part of the key. Wondering if an LRU cache would be better.
cc: @yycptt

On a second though, this cache is per handler, ie., we get a cache every time a new handler is created (eg: from calling WithTags). I wonder how this is gonna perform in very large scale with thousands of namespaces, workflow types, etc., that are common tags. Specially curious about the additional memory usage.
cc: @yycptt

Agreed with @rodrigozhou that the cache should be more strictly bounded.

I'd suggest dropping childCache in favor of passing a shared scopeCache down to child handlers. Then introduce a scopeKey alongside the scope field so that child handlers can generate the correct cache key. Something like:

tallyMetricsHandler struct { scope tally.Scope + scopeKey string perUnitBuckets map[MetricUnit]tally.Buckets excludeTags excludeTags - childCache sync.Map - childCacheSize atomic.Int64 - scopeCache sync.Map + scopeCache *sync.Map // shared by all handlers - scopeCacheSize atomic.Int64 + scopeCacheSize *atomic.Int64 } )

Incorporate scopeKey into the cache key as a prefix.

-func tagsCacheKey(tags []Tag) string { +func tagsCacheKey(prefix string, tags []Tag) string { // .... sb.WriteString(prefix) // ... }

And propagate scopeCache to descendants.

func (tmh *tallyMetricsHandler) WithTags(tags ...Tag) Handler { - key := tagsCacheKey(normalizeTagsForCaching(tags, tmh.excludeTags)) + key := tagsCacheKey(tmh.scopeKey, normalizeTagsForCaching(tags, tmh.excl udeTags)) child := &tallyMetricsHandler{ scope: tmh.scope.Tagged(tagsToMap(tags, tmh.excludeTags)), + scopeKey: key perUnitBuckets: tmh.perUnitBuckets, excludeTags: tmh.excludeTags, + scopeCache: tmh.scopeCache, + scopeCacheSize: tmh.scopeCacheSize, } }

rodrigozhou · 2026-05-05T18:15:30Z

+// scopeCacheMaxSize is the approximate upper bound on cached scope entries.
+// The bound may be slightly exceeded under high concurrency due to
+// check-then-store races, which is acceptable.
+const scopeCacheMaxSize = 1024


Rename maxCacheSize

This limit could probably be more generous, maybe 10k. There are a lot of metrics and once the handler hits the limit, all additional metrics are permanently uncached.

Consider clearing the cache periodically to compensate for working set shifts over time and reduce performance volatility.

What is periodically? Once a ... ? Based on usage perhaps? Hit/Miss ratio?

rodrigozhou · 2026-05-05T18:45:12Z

+		childCache     sync.Map // tagsCacheKey(tags) -> *tallyMetricsHandler
+		scopeCache     sync.Map // tagsCacheKey(normalized tags) -> tally.Scope


On a second though, this cache is per handler, ie., we get a cache every time a new handler is created (eg: from calling WithTags). I wonder how this is gonna perform in very large scale with thousands of namespaces, workflow types, etc., that are common tags. Specially curious about the additional memory usage.
cc: @yycptt

BenEddy

The PR description mentions micro-benchmarks -- could you please include the benchmarks alongside the tests or in a separate bench file?

BenEddy · 2026-05-06T05:26:31Z

+		childCache     sync.Map // tagsCacheKey(tags) -> *tallyMetricsHandler
+		scopeCache     sync.Map // tagsCacheKey(normalized tags) -> tally.Scope


Agreed with @rodrigozhou that the cache should be more strictly bounded.

I'd suggest dropping childCache in favor of passing a shared scopeCache down to child handlers. Then introduce a scopeKey alongside the scope field so that child handlers can generate the correct cache key. Something like:

tallyMetricsHandler struct { scope tally.Scope + scopeKey string perUnitBuckets map[MetricUnit]tally.Buckets excludeTags excludeTags - childCache sync.Map - childCacheSize atomic.Int64 - scopeCache sync.Map + scopeCache *sync.Map // shared by all handlers - scopeCacheSize atomic.Int64 + scopeCacheSize *atomic.Int64 } )

Incorporate scopeKey into the cache key as a prefix.

-func tagsCacheKey(tags []Tag) string { +func tagsCacheKey(prefix string, tags []Tag) string { // .... sb.WriteString(prefix) // ... }

And propagate scopeCache to descendants.

func (tmh *tallyMetricsHandler) WithTags(tags ...Tag) Handler { - key := tagsCacheKey(normalizeTagsForCaching(tags, tmh.excludeTags)) + key := tagsCacheKey(tmh.scopeKey, normalizeTagsForCaching(tags, tmh.excl udeTags)) child := &tallyMetricsHandler{ scope: tmh.scope.Tagged(tagsToMap(tags, tmh.excludeTags)), + scopeKey: key perUnitBuckets: tmh.perUnitBuckets, excludeTags: tmh.excludeTags, + scopeCache: tmh.scopeCache, + scopeCacheSize: tmh.scopeCacheSize, } }

BenEddy · 2026-05-06T06:19:35Z

+// scopeCacheMaxSize is the approximate upper bound on cached scope entries.
+// The bound may be slightly exceeded under high concurrency due to
+// check-then-store races, which is acceptable.
+const scopeCacheMaxSize = 1024


This limit could probably be more generous, maybe 10k. There are a lot of metrics and once the handler hits the limit, all additional metrics are permanently uncached.

Consider clearing the cache periodically to compensate for working set shifts over time and reduce performance volatility.

mykaul · 2026-05-06T19:58:08Z

v2 — Review Cycle 4/5 Update

Force-pushed a single squashed commit replacing the previous 2-commit history.

Changes since v1:

Length-prefixed cache keys — replaced \x00 delimiter with binary.PutUvarint + raw bytes encoding. Prevents collisions when tag keys/values contain NUL bytes.
histogramCacheKey struct — replaced name + "\x00" + unit string concatenation with a typed struct as the sync.Map key. Eliminates ambiguity.
Shared bounded cache — single sharedScopeCache with sync.RWMutex + double-check locking (replaces per-handler sync.Map from v1 for scope/handler caching).
Per-handler closure caching — sync.Map for Counter/Timer/Histogram/Gauge closures (unchanged from v1).
normalizeTagsForCaching — zero-alloc fast path when no tags are excluded; slices.Clone only when substitution is needed.
Clear-on-overflow — configurable TagsCacheMaxSize (default 10000) with full cache clear when limit is reached.
Comprehensive tests — 20+ test cases covering concurrency, cache bounds, excluded tag merging, key collision regression.

Benchmark Results (6 server cores, 4 ScyllaDB cores, mc1200, 3 min):

Metric	Baseline (origin/main)	With cache	Delta
Throughput	168.4 iter/s	176.9 iter/s	+5.0%
alloc_objects (60s)	160.9M	150.7M	-6.3%
alloc_space (60s)	13,563 MB	11,957 MB	-11.8%
Metrics layer allocs	1,742 MB	410 MB	-76.5%

Review Cycle 5 (self-review):

No correctness issues found
Double-check locking pattern verified sound
Length-prefix encoding verified collision-free
create() outside write lock is intentional (avoids holding lock during scope.Tagged())

Introduce a shared bounded scope cache with sync.RWMutex and double-check locking to eliminate repeated tally.Scope.Tagged() allocations. Cache Counter/Timer/Histogram/Gauge closures per handler via sync.Map. Key design choices: - Length-prefixed cache keys (binary.PutUvarint) prevent collisions when tag keys/values contain NUL bytes - histogramCacheKey struct avoids string concatenation ambiguity - normalizeTagsForCaching applies excludeTags before key computation so high-cardinality excluded values share a single cache entry - Clear-on-overflow bounds memory (configurable TagsCacheMaxSize, default 10000) - Zero-alloc fast path when no tags need normalization Benchmark results (6 server cores, ScyllaDB 2026.2.0-rc0, mc1200): - Throughput: 176.9 iter/s (+5.0% vs baseline 168.4) - alloc_objects: 150.7M (-6.3% vs baseline 160.9M) - alloc_space: 11,957 MB (-11.8% vs baseline 13,563 MB) - Metrics layer allocs: 410 MB (-76.5% vs baseline 1,742 MB)

BenEddy

Per-handler closure caching — sync.Map for Counter/Timer/Histogram/Gauge closures (unchanged from v1)

Hmm maybe obscured by force pushing, but this change is net new since I last reviewed the PR (I might have missed v1). The closure caches are per-handler and have no eviction and no bound, so memory usage scales with handlers x metrics and undermines the sharedScopeCache bound. We should either remove them or share them across handler instances.

mykaul · 2026-05-18T07:16:43Z

Per-handler closure caching — sync.Map for Counter/Timer/Histogram/Gauge closures (unchanged from v1)

Hmm maybe obscured by force pushing, but this change is net new since I last reviewed the PR (I might have missed v1). The closure caches are per-handler and have no eviction and no bound, so memory usage scales with handlers x metrics and undermines the sharedScopeCache bound. We should either remove them or share them across handler instances.

ARGH - did I break something? Or confused branches? Let me check. If I remember I only wanted to fix the formatting CI failures.

mykaul · 2026-05-18T07:26:31Z

Per-handler closure caching — sync.Map for Counter/Timer/Histogram/Gauge closures (unchanged from v1)

Hmm maybe obscured by force pushing, but this change is net new since I last reviewed the PR (I might have missed v1). The closure caches are per-handler and have no eviction and no bound, so memory usage scales with handlers x metrics and undermines the sharedScopeCache bound. We should either remove them or share them across handler instances.

OK, checked - only formatting changes.

To your concern: cachedTaggedScope is bound
keys use metric name - so in reality bounded by the metric catalog (about 200 names?)
sharedScopeCache is limited to 10K (in loadOrStoreScope() for example, it clears it if it's over the max - which is 10K by default.

So overall, 200 * 10K - is that acceptable?

BenEddy

@mykaul Thanks for your patience here -- and good point, the closure cache is effectively bounded by the metric catalog.

The reason I’m pushing on the cache bound is that we’ve seen large heaps with millions of long-lived, repeatedly scanned objects drive significant CPU volatility in production. Longer mark phases also tend to increase GC assist overhead. 200 * 10K is not problematic in isolation, but we’re generally cautious about increasing heap pointer density, and since callers can leak handlers, any bound is ultimately soft.

That said, I’m happy to land this as-is, and I can follow up with a separate PR to share the closure caches across handlers. Appreciate the work on this!

mykaul · 2026-05-19T06:25:17Z

@mykaul Thanks for your patience here -- and good point, the closure cache is effectively bounded by the metric catalog.

The reason I’m pushing on the cache bound is that we’ve seen large heaps with millions of long-lived, repeatedly scanned objects drive significant CPU volatility in production. Longer mark phases also tend to increase GC assist overhead. 200 * 10K is not problematic in isolation, but we’re generally cautious about increasing heap pointer density, and since callers can leak handlers, any bound is ultimately soft.

That said, I’m happy to land this as-is, and I can follow up with a separate PR to share the closure caches across handlers. Appreciate the work on this!

Thanks - all I have is my poor laptop to test 'at scale' with Omes, and it's not the greatest setup, and I'm not running long lived perf tests, just 5-15m or so. Since I'm somewhat reluctant to change the schema (which is the main pain point I've identified), I'm left with the rest of the work - trying to parallelise work (which is challenging with the 27 serial LWT query executions) and reducing memory and GC all over. The rest of the optimizations are really minimalistic.

mykaul requested review from a team as code owners March 23, 2026 07:57

mykaul force-pushed the perf/cache-metrics-handler-tags branch from 5cd1af6 to 0069456 Compare March 27, 2026 08:13

mykaul requested a review from a team as a code owner March 27, 2026 08:13

mykaul force-pushed the perf/cache-metrics-handler-tags branch from 0069456 to 7a986ca Compare April 24, 2026 09:16

mykaul changed the title ~~perf: cache metrics handler WithTags and Tagged scope lookups~~ perf: cache tally metrics handler scopes and WithTags handlers to reduce allocations Apr 24, 2026

This was referenced Apr 24, 2026

Memory usage improvements #9565

Closed

perf: cache WithTags handlers in tallyMetricsHandler to reduce allocations (replaced with https://github.com/temporalio/temporal/pull/9620 ) #10049

Closed

rodrigozhou requested changes May 1, 2026

View reviewed changes

rodrigozhou requested a review from yycptt May 1, 2026 22:38

mykaul force-pushed the perf/cache-metrics-handler-tags branch from 7a986ca to 297c342 Compare May 2, 2026 09:50

rodrigozhou requested changes May 5, 2026

View reviewed changes

BenEddy reviewed May 6, 2026

View reviewed changes

mykaul force-pushed the perf/cache-metrics-handler-tags branch from 297c342 to 06419dd Compare May 6, 2026 19:57

rodrigozhou approved these changes May 15, 2026

View reviewed changes

Comment thread common/metrics/tally_metrics_handler.go

rodrigozhou requested a review from BenEddy May 15, 2026 18:56

mykaul force-pushed the perf/cache-metrics-handler-tags branch from 06419dd to 84136be Compare May 17, 2026 12:51

BenEddy requested changes May 17, 2026

View reviewed changes

BenEddy approved these changes May 19, 2026

View reviewed changes

rodrigozhou merged commit e59d6da into temporalio:main May 19, 2026
70 of 72 checks passed

		childCache sync.Map // tagsCacheKey(tags) -> *tallyMetricsHandler
		scopeCache sync.Map // tagsCacheKey(normalized tags) -> tally.Scope

Conversation

mykaul commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design

Allocation Reduction (pprof alloc_space, 5min ScyllaDB workload)

Commit 1: WithTags handler cache

Commit 2: Scope cache for inline tags

Benchmark (omes throughput_stress, mc150, 5 min)

Testing

v2 — addressing review feedback

Changes

Micro-benchmark results (cached vs uncached)

Not changed (deliberate)

Uh oh!

mykaul commented Mar 27, 2026

v2 — rebased on main, fix unbounded childCache growth

What changed

Test added

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BenEddy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mykaul commented May 6, 2026

v2 — Review Cycle 4/5 Update

Changes since v1:

Benchmark Results (6 server cores, 4 ScyllaDB cores, mc1200, 3 min):

Review Cycle 5 (self-review):

Uh oh!

Uh oh!

BenEddy left a comment

Choose a reason for hiding this comment

Uh oh!

mykaul commented May 18, 2026

Uh oh!

mykaul commented May 18, 2026

Uh oh!

BenEddy left a comment

Choose a reason for hiding this comment

Uh oh!

mykaul commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mykaul commented Mar 23, 2026 •

edited

Loading