Skip to content

VersionMembershipCache: Metrics and refactorings!#8894

Merged
Shivs11 merged 6 commits intomainfrom
ss/history-cache-metrics
Dec 24, 2025
Merged

VersionMembershipCache: Metrics and refactorings!#8894
Shivs11 merged 6 commits intomainfrom
ss/history-cache-metrics

Conversation

@Shivs11
Copy link
Copy Markdown
Member

@Shivs11 Shivs11 commented Dec 22, 2025

What changed?

  • Added metrics, such as cache hits and cache misses, so that we can understand if the currently set TTL for this cache (of 1 second) is too low or too high.
  • Also did some re-factorings: While working on this, I realized that it was much simpler to add a new wrapper with a metrics handler, that specifically served the use case of understanding cache hits and missed, rather than use any of the existing implementations of caches that have a metrics handler attached to them. (See NewWithMetrics)
  • Thus, I took some inspiration from the way newEventsCache was implemented and came up with this.

Why?

  • Explained above.

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

Potential risks

  • None.

Note

Introduces a typed, instrumented cache for worker versioning and refactors call sites to use it.

  • Adds VersionMembershipCache interface and NewVersionMembershipCache wrapper emitting metrics (VersionMembershipCacheGet/Put with cache_type=version_membership)
  • Replaces direct cache.Cache usage and ad-hoc keys in worker_versioning validation with the new cache API
  • Wires the cache via FX provider in history service, returning the wrapped cache; plumbs through engine, starters, and APIs (startworkflow, signalwithstartworkflow, resetworkflow, multioperation, updateworkflowoptions)
  • Updates metrics definitions with new cache type tag and operation scopes
  • Test updates: introduce simple/noop implementations and adjust existing tests to the new interface

Written by Cursor Bugbot for commit 920106f. This will update automatically on new commits. Configure here.

@Shivs11 Shivs11 requested review from a team as code owners December 22, 2025 22:15
Comment on lines +50 to +53
type testVersionMembershipCache struct {
mu sync.Mutex
m map[testVersionMembershipCacheKey]bool
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually using an instance of cache.Cache in some of the unit tests in this file. I thought I might as well change that by using this newly implemented cache since it shall also then test it's functionality.

Comment thread service/history/version_membership_cache.go Outdated
func newVersionMembershipCache(c cache.Cache, metricsHandler metrics.Handler) worker_versioning.VersionMembershipCache {
h := metricsHandler.WithTags(metrics.CacheTypeTag("version_membership"))
return &versionMembershipCache{
cache: c,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's concerning to me that I don't see eviction happening anywhere. But maybe I've just missed it. In the events cache, the underlying cache is the LRU cache, which makes sense to me. That also provides the hit and miss metrics that you need, in a way that is already standardized. Is it possible to just use the existing LRU cache instead of having to reimplement and re-test eviction logic elsewhere?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how the cache here is being initialized (in the fx.go file):

func VersionMembershipCacheProvider(
	lc fx.Lifecycle,
	serviceConfig *configs.Config,
	metricsHandler metrics.Handler,
) worker_versioning.VersionMembershipCache {
	c := commoncache.New(serviceConfig.VersionMembershipCacheMaxSize(), &commoncache.Options{
		TTL: max(1*time.Second, serviceConfig.VersionMembershipCacheTTL()),
	})
	lc.Append(fx.Hook{
		OnStop: func(context.Context) error {
			c.Stop()
			return nil
		},
	})
	return newVersionMembershipCache(c, metricsHandler)
}

The underlying cache here is a StoppableCache(exactly similar to the events cache), which is also an LRU cache. Since the versioning cache is an LRU cache, the eviction logic exists in the lru.go file and the tests that I had added in my previous PR validate this.

The underlying StoppableCache does not emit cache hit and cache miss metrics (which is what we are interested in), which is why I had defined this new wrapper on top of it. This also seemed to be one of the main reasons why the events cache was implemented as a layer on top of the StoppableCache .

Copy link
Copy Markdown
Contributor

@prathyushpv prathyushpv Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually LRU cache can emit metrics. There is this constructor which will create a cache with a metrics handler:

func NewWithMetrics(maxSize int, opts *Options, handler metrics.Handler) StoppableCache {

Oh, I see that it doesn't emit hit and miss metrics :/

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ty for showing me where the TTL is set! my main concern is fine then.
I thought that the LRU cache / stoppable would have metrics to provide the hit rate, but honestly I'm not sure what some of the metrics it emits even mean, or how to compute hit rate from them:

  • NewGaugeDef("cache_pinned_usage") // I looked this up and it's the count of elements that are blocked from being evicted even if they are the LRU element
  • NewTimerDef("cache_entry_age_on_eviction")
  • NewGaugeDef("cache_usage")
  • NewTimerDef("cache_entry_age_on_get")

Comment thread common/worker_versioning/version_membership_cache.go Outdated
@Shivs11 Shivs11 enabled auto-merge (squash) December 23, 2025 23:27
@Shivs11 Shivs11 merged commit d8e8685 into main Dec 24, 2025
61 checks passed
@Shivs11 Shivs11 deleted the ss/history-cache-metrics branch December 24, 2025 00:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants