perf: cache tally metrics handler scopes and WithTags handlers to reduce allocations#9620
Conversation
5cd1af6 to
0069456
Compare
v2 — rebased on main, fix unbounded childCache growthRebased on current What changed
Previously, After this fix, all excluded-tag variants normalize to the same key (e.g., Test added
|
0069456 to
7a986ca
Compare
| childCache sync.Map // tagsCacheKey(tags) -> *tallyMetricsHandler | ||
| scopeCache sync.Map // tagsCacheKey(normalized tags) -> tally.Scope |
There was a problem hiding this comment.
childCache seems to be an unbounded cache, and scopeCache seems to be bounded at scopeCacheMaxSize (set to 1024) which will stop caching afterwards.
In both cases, this could be bad if the cache grows too much, specially since the tag value is part of the key. Wondering if an LRU cache would be better.
cc: @yycptt
There was a problem hiding this comment.
On a second though, this cache is per handler, ie., we get a cache every time a new handler is created (eg: from calling WithTags). I wonder how this is gonna perform in very large scale with thousands of namespaces, workflow types, etc., that are common tags. Specially curious about the additional memory usage.
cc: @yycptt
There was a problem hiding this comment.
Agreed with @rodrigozhou that the cache should be more strictly bounded.
I'd suggest dropping childCache in favor of passing a shared scopeCache down to child handlers. Then introduce a scopeKey alongside the scope field so that child handlers can generate the correct cache key. Something like:
tallyMetricsHandler struct {
scope tally.Scope
+ scopeKey string
perUnitBuckets map[MetricUnit]tally.Buckets
excludeTags excludeTags
- childCache sync.Map
- childCacheSize atomic.Int64
- scopeCache sync.Map
+ scopeCache *sync.Map // shared by all handlers
- scopeCacheSize atomic.Int64
+ scopeCacheSize *atomic.Int64
}
)Incorporate scopeKey into the cache key as a prefix.
-func tagsCacheKey(tags []Tag) string {
+func tagsCacheKey(prefix string, tags []Tag) string {
// ....
sb.WriteString(prefix)
// ...
}And propagate scopeCache to descendants.
func (tmh *tallyMetricsHandler) WithTags(tags ...Tag) Handler {
- key := tagsCacheKey(normalizeTagsForCaching(tags, tmh.excludeTags))
+ key := tagsCacheKey(tmh.scopeKey, normalizeTagsForCaching(tags, tmh.excl
udeTags))
child := &tallyMetricsHandler{
scope: tmh.scope.Tagged(tagsToMap(tags, tmh.excludeTags)),
+ scopeKey: key
perUnitBuckets: tmh.perUnitBuckets,
excludeTags: tmh.excludeTags,
+ scopeCache: tmh.scopeCache,
+ scopeCacheSize: tmh.scopeCacheSize,
}
}7a986ca to
297c342
Compare
| // scopeCacheMaxSize is the approximate upper bound on cached scope entries. | ||
| // The bound may be slightly exceeded under high concurrency due to | ||
| // check-then-store races, which is acceptable. | ||
| const scopeCacheMaxSize = 1024 |
There was a problem hiding this comment.
This limit could probably be more generous, maybe 10k. There are a lot of metrics and once the handler hits the limit, all additional metrics are permanently uncached.
Consider clearing the cache periodically to compensate for working set shifts over time and reduce performance volatility.
There was a problem hiding this comment.
What is periodically? Once a ... ? Based on usage perhaps? Hit/Miss ratio?
| childCache sync.Map // tagsCacheKey(tags) -> *tallyMetricsHandler | ||
| scopeCache sync.Map // tagsCacheKey(normalized tags) -> tally.Scope |
There was a problem hiding this comment.
On a second though, this cache is per handler, ie., we get a cache every time a new handler is created (eg: from calling WithTags). I wonder how this is gonna perform in very large scale with thousands of namespaces, workflow types, etc., that are common tags. Specially curious about the additional memory usage.
cc: @yycptt
BenEddy
left a comment
There was a problem hiding this comment.
The PR description mentions micro-benchmarks -- could you please include the benchmarks alongside the tests or in a separate bench file?
| childCache sync.Map // tagsCacheKey(tags) -> *tallyMetricsHandler | ||
| scopeCache sync.Map // tagsCacheKey(normalized tags) -> tally.Scope |
There was a problem hiding this comment.
Agreed with @rodrigozhou that the cache should be more strictly bounded.
I'd suggest dropping childCache in favor of passing a shared scopeCache down to child handlers. Then introduce a scopeKey alongside the scope field so that child handlers can generate the correct cache key. Something like:
tallyMetricsHandler struct {
scope tally.Scope
+ scopeKey string
perUnitBuckets map[MetricUnit]tally.Buckets
excludeTags excludeTags
- childCache sync.Map
- childCacheSize atomic.Int64
- scopeCache sync.Map
+ scopeCache *sync.Map // shared by all handlers
- scopeCacheSize atomic.Int64
+ scopeCacheSize *atomic.Int64
}
)Incorporate scopeKey into the cache key as a prefix.
-func tagsCacheKey(tags []Tag) string {
+func tagsCacheKey(prefix string, tags []Tag) string {
// ....
sb.WriteString(prefix)
// ...
}And propagate scopeCache to descendants.
func (tmh *tallyMetricsHandler) WithTags(tags ...Tag) Handler {
- key := tagsCacheKey(normalizeTagsForCaching(tags, tmh.excludeTags))
+ key := tagsCacheKey(tmh.scopeKey, normalizeTagsForCaching(tags, tmh.excl
udeTags))
child := &tallyMetricsHandler{
scope: tmh.scope.Tagged(tagsToMap(tags, tmh.excludeTags)),
+ scopeKey: key
perUnitBuckets: tmh.perUnitBuckets,
excludeTags: tmh.excludeTags,
+ scopeCache: tmh.scopeCache,
+ scopeCacheSize: tmh.scopeCacheSize,
}
}| // scopeCacheMaxSize is the approximate upper bound on cached scope entries. | ||
| // The bound may be slightly exceeded under high concurrency due to | ||
| // check-then-store races, which is acceptable. | ||
| const scopeCacheMaxSize = 1024 |
There was a problem hiding this comment.
This limit could probably be more generous, maybe 10k. There are a lot of metrics and once the handler hits the limit, all additional metrics are permanently uncached.
Consider clearing the cache periodically to compensate for working set shifts over time and reduce performance volatility.
297c342 to
06419dd
Compare
v2 — Review Cycle 4/5 UpdateForce-pushed a single squashed commit replacing the previous 2-commit history. Changes since v1:
Benchmark Results (6 server cores, 4 ScyllaDB cores, mc1200, 3 min):
Review Cycle 5 (self-review):
|
Introduce a shared bounded scope cache with sync.RWMutex and double-check locking to eliminate repeated tally.Scope.Tagged() allocations. Cache Counter/Timer/Histogram/Gauge closures per handler via sync.Map. Key design choices: - Length-prefixed cache keys (binary.PutUvarint) prevent collisions when tag keys/values contain NUL bytes - histogramCacheKey struct avoids string concatenation ambiguity - normalizeTagsForCaching applies excludeTags before key computation so high-cardinality excluded values share a single cache entry - Clear-on-overflow bounds memory (configurable TagsCacheMaxSize, default 10000) - Zero-alloc fast path when no tags need normalization Benchmark results (6 server cores, ScyllaDB 2026.2.0-rc0, mc1200): - Throughput: 176.9 iter/s (+5.0% vs baseline 168.4) - alloc_objects: 150.7M (-6.3% vs baseline 160.9M) - alloc_space: 11,957 MB (-11.8% vs baseline 13,563 MB) - Metrics layer allocs: 410 MB (-76.5% vs baseline 1,742 MB)
06419dd to
84136be
Compare
BenEddy
left a comment
There was a problem hiding this comment.
Per-handler closure caching —
sync.Mapfor Counter/Timer/Histogram/Gauge closures (unchanged from v1)
Hmm maybe obscured by force pushing, but this change is net new since I last reviewed the PR (I might have missed v1). The closure caches are per-handler and have no eviction and no bound, so memory usage scales with handlers x metrics and undermines the sharedScopeCache bound. We should either remove them or share them across handler instances.
ARGH - did I break something? Or confused branches? Let me check. If I remember I only wanted to fix the formatting CI failures. |
OK, checked - only formatting changes.
So overall, 200 * 10K - is that acceptable? |
BenEddy
left a comment
There was a problem hiding this comment.
@mykaul Thanks for your patience here -- and good point, the closure cache is effectively bounded by the metric catalog.
The reason I’m pushing on the cache bound is that we’ve seen large heaps with millions of long-lived, repeatedly scanned objects drive significant CPU volatility in production. Longer mark phases also tend to increase GC assist overhead. 200 * 10K is not problematic in isolation, but we’re generally cautious about increasing heap pointer density, and since callers can leak handlers, any bound is ultimately soft.
That said, I’m happy to land this as-is, and I can follow up with a separate PR to share the closure caches across handlers. Appreciate the work on this!
Thanks - all I have is my poor laptop to test 'at scale' with Omes, and it's not the greatest setup, and I'm not running long lived perf tests, just 5-15m or so. Since I'm somewhat reluctant to change the schema (which is the main pain point I've identified), I'm left with the rest of the work - trying to parallelise work (which is challenging with the 27 serial LWT query executions) and reducing memory and GC all over. The rest of the optimizations are really minimalistic. |
Summary
WithTags()child handlers viasync.Mapto eliminate repeatedtagsToMap(),scope.Tagged(), and handler struct allocations on the hot pathscope.Tagged()results per unique inline tag combination incachedTaggedScope(), bounded to 1024 entries with graceful degradationactivityType) share a single cache entry, preventing unbounded cache growthDesign
Two complementary caching layers in
tallyMetricsHandler:childCache(sync.Map): caches entire handler subtrees returned byWithTags(). On cache hit: zero allocations.scopeCache(sync.Map+atomic.Int64size bound): cachestally.Scopeobjects returned byscope.Tagged()for inline tags passed toCounter/Gauge/Timer/HistogramRecord()calls. Bounded to 1024 entries; beyond that, scopes are created but not cached.Both caches use
LoadOrStorefor safe concurrent access. Tag normalization vianormalizeTagsForCaching()ensures excluded tag variants collapse to the same cache key. The normalization has a zero-alloc fast path when no tags need substitution.Allocation Reduction (pprof alloc_space, 5min ScyllaDB workload)
Commit 1: WithTags handler cache
Commit 2: Scope cache for inline tags
Benchmark (omes throughput_stress, mc150, 5 min)
Host networking, i7-1270P 4 cores/component, inter-run data resets:
Note: Throughput variance at mc150 is ~5-10%. The allocation reduction is confirmed by pprof but throughput gains are within noise at this concurrency level.
Testing
v2 — addressing review feedback
All 6 review comments addressed. Rebased on
origin/main, squashed into a single commit.Changes
tagsCacheKey: Usestrings.BuilderwithGrowpre-allocation — replaced manual[]byteconstruction withstrings.Builder. A sizing pass pre-computes the exact capacity viaGrow()to avoid internal reallocation (1 alloc/op).tagsCacheKey: Remove single-tag special case, uniform\x00separator — removed thelen(tags) == 1branch. Every tag pair now unconditionally appends\x00after both key and value, making the format uniform and the code simpler.normalizeTagsForCaching: Useslices.Clone— replacedmake([]Tag, len(tags))+copy(tags[:i])withslices.Clone(tags), which copies the entire slice upfront. This eliminates theif normalized != nil { normalized[i] = t }guard for unchanged tags after the clone point.Extract shared
normalizeTagfunction — the exclude-tag check was duplicated betweennormalizeTagsForCachingand theconvertclosure intagsToMap. ExtractednormalizeTag(tag Tag, excl excludeTags) (Tag, bool)used by both, removing the duplication.Bound
childCachetoscopeCacheMaxSize—childCache(used byWithTags) was previously unbounded. Applied the same bounding strategy asscopeCache: atomic counter + stop caching beyond 1024 entries. AddedchildCacheSize atomic.Int64field andTestWithTags_BoundedChildCacheSizetest.Micro-benchmark results (cached vs uncached)
Not changed (deliberate)