VersionMembershipCache: Metrics and refactorings!#8894
Conversation
| type testVersionMembershipCache struct { | ||
| mu sync.Mutex | ||
| m map[testVersionMembershipCacheKey]bool | ||
| } |
There was a problem hiding this comment.
I was actually using an instance of cache.Cache in some of the unit tests in this file. I thought I might as well change that by using this newly implemented cache since it shall also then test it's functionality.
| func newVersionMembershipCache(c cache.Cache, metricsHandler metrics.Handler) worker_versioning.VersionMembershipCache { | ||
| h := metricsHandler.WithTags(metrics.CacheTypeTag("version_membership")) | ||
| return &versionMembershipCache{ | ||
| cache: c, |
There was a problem hiding this comment.
it's concerning to me that I don't see eviction happening anywhere. But maybe I've just missed it. In the events cache, the underlying cache is the LRU cache, which makes sense to me. That also provides the hit and miss metrics that you need, in a way that is already standardized. Is it possible to just use the existing LRU cache instead of having to reimplement and re-test eviction logic elsewhere?
There was a problem hiding this comment.
This is how the cache here is being initialized (in the fx.go file):
func VersionMembershipCacheProvider(
lc fx.Lifecycle,
serviceConfig *configs.Config,
metricsHandler metrics.Handler,
) worker_versioning.VersionMembershipCache {
c := commoncache.New(serviceConfig.VersionMembershipCacheMaxSize(), &commoncache.Options{
TTL: max(1*time.Second, serviceConfig.VersionMembershipCacheTTL()),
})
lc.Append(fx.Hook{
OnStop: func(context.Context) error {
c.Stop()
return nil
},
})
return newVersionMembershipCache(c, metricsHandler)
}
The underlying cache here is a StoppableCache(exactly similar to the events cache), which is also an LRU cache. Since the versioning cache is an LRU cache, the eviction logic exists in the lru.go file and the tests that I had added in my previous PR validate this.
The underlying StoppableCache does not emit cache hit and cache miss metrics (which is what we are interested in), which is why I had defined this new wrapper on top of it. This also seemed to be one of the main reasons why the events cache was implemented as a layer on top of the StoppableCache .
There was a problem hiding this comment.
Actually LRU cache can emit metrics. There is this constructor which will create a cache with a metrics handler:
Line 143 in abb6359
Oh, I see that it doesn't emit hit and miss metrics :/
There was a problem hiding this comment.
ty for showing me where the TTL is set! my main concern is fine then.
I thought that the LRU cache / stoppable would have metrics to provide the hit rate, but honestly I'm not sure what some of the metrics it emits even mean, or how to compute hit rate from them:
NewGaugeDef("cache_pinned_usage") // I looked this up and it's the count of elements that are blocked from being evicted even if they are the LRU elementNewTimerDef("cache_entry_age_on_eviction")NewGaugeDef("cache_usage")NewTimerDef("cache_entry_age_on_get")
What changed?
Why?
How did you test it?
Potential risks
Note
Introduces a typed, instrumented cache for worker versioning and refactors call sites to use it.
VersionMembershipCacheinterface andNewVersionMembershipCachewrapper emitting metrics (VersionMembershipCacheGet/Putwithcache_type=version_membership)cache.Cacheusage and ad-hoc keys inworker_versioningvalidation with the new cache APIstartworkflow,signalwithstartworkflow,resetworkflow,multioperation,updateworkflowoptions)Written by Cursor Bugbot for commit 920106f. This will update automatically on new commits. Configure here.