Add cache for task queue routing info in History#9168
Add cache for task queue routing info in History#9168ShahabT merged 4 commits intotemporalio:mainfrom
Conversation
| // backlog but because of an ongoing transition. See ActivityStartDuringTransition error usage. | ||
| // TODO (shahab): can we limit this adjustment to apply only for activities who actually faced the | ||
| // ActivityStartDuringTransition error, and not all others? | ||
| info.ScheduledTime = timestamppb.New(ms.timeSource.Now()) |
There was a problem hiding this comment.
@yycptt could you please review this line? what to ensure it does not breaks some assumptions in the system.
There was a problem hiding this comment.
sorry to be pendantic but I wonder if this should come in a different PR which has some tests that test that there the time delay has been present
There was a problem hiding this comment.
It's a fair comment. I did try to spend some time writing the test but this part is soooo internal that I could not come up with a good way to write a non-flaky test. I did though test it manually and verified the delay existed before and my change is canceling it.
Shivs11
left a comment
There was a problem hiding this comment.
approved but I don't think we should have that activity scheduled time change going in without tests
| // RoutingInfoCache is used to cache results of GetTaskQueueUserData | ||
| // calls followed by CalculateTaskQueueVersioningInfo computation. |
There was a problem hiding this comment.
nit: we can just say that this cache is used to cache results of GetTaskQueueUserData operations since that is the main RPC that is actually intensive.
| } | ||
| tqData, ok := resp.GetUserData().GetData().GetPerType()[int32(taskQueueType)] | ||
| // Check cache first for task queue routing info (independent of workflow ID) | ||
| current, currentRevisionNumber, ramping, rampingPercentage, rampingRevisionNumber, ok := routingInfoCache.Get(namespaceID, taskQueueName, taskQueueType) |
There was a problem hiding this comment.
something to remember about this change - I know that the cache TTL is only one second which is quite small, but during that one second, there could in theory be stale task queue user data reads which could make AU workflow/activity tasks bounce back and forth.
I think this is fine, just wanted to paste this here.
| // backlog but because of an ongoing transition. See ActivityStartDuringTransition error usage. | ||
| // TODO (shahab): can we limit this adjustment to apply only for activities who actually faced the | ||
| // ActivityStartDuringTransition error, and not all others? | ||
| info.ScheduledTime = timestamppb.New(ms.timeSource.Now()) |
There was a problem hiding this comment.
sorry to be pendantic but I wonder if this should come in a different PR which has some tests that test that there the time delay has been present
| "Error message should include the limit value") | ||
| } | ||
|
|
||
| func (s *Versioning3Suite) TestActivityRetryAutoUpgradeDuringBackoff() { |
There was a problem hiding this comment.
should we also have some functional tests that test the actual cache implementation that is now present?
I had placed some tests in this file where I increased the TTL and noticed if the calls are going through or not.
There was a problem hiding this comment.
sure, I see you have placed unit tests for the cache and since we are in a time crunch, you can ignore my top comment.
tests/versioning_3_test.go
Outdated
| err := workflow.ExecuteActivity(workflow.WithActivityOptions(ctx, workflow.ActivityOptions{ | ||
| StartToCloseTimeout: 10 * time.Second, | ||
| RetryPolicy: &temporal.RetryPolicy{ | ||
| InitialInterval: 3 * time.Second, // Give us time to change deployment |
There was a problem hiding this comment.
could this be flaky? 3 seconds seems less....like what happens if we change the deployment before that?
## What changed? Cache the result of `GetTaskQueueUserData` that history makes to Matching when an activity wants to start a deployment version transition. ## Why? This protects the matching root partition from being hammered by requests when a lot of AutoUpgrade workflows want to start activity-initiated transitions. Activity initiated transitions happen in one of the following cases: 1) Target version changed while activity was backlogged. 2) Target version changed while activity was in retry backoff 3) Target version changed in some edge cases involving parallel activities ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) ## Potential risks The cache can potentially increase history mem usage but there are knobs to adjust size and ttl.
## What changed? Cache the result of `GetTaskQueueUserData` that history makes to Matching when an activity wants to start a deployment version transition. ## Why? This protects the matching root partition from being hammered by requests when a lot of AutoUpgrade workflows want to start activity-initiated transitions. Activity initiated transitions happen in one of the following cases: 1) Target version changed while activity was backlogged. 2) Target version changed while activity was in retry backoff 3) Target version changed in some edge cases involving parallel activities ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) ## Potential risks The cache can potentially increase history mem usage but there are knobs to adjust size and ttl.
## What changed? Cherry-pick versioning PRs - #9168 - Cache for system protection - #9262 - Cache for system protection - #9239 - Critical PR to enable sending `TargetVersionChanged` flag for Upgrade-on-CaN feature - #9147 - Tracks version drainage properly when version receives workflows via `VersioningOverride`. Needed for automated worker controllers to correctly scale versioned workers that received workflows via `VersioningOverride`. - #9300 - Needed for `approximate_backlog_count` metric to track Current and Ramping version tasks correctly - #9316 - Needed for `approximate_backlog_count` metric to track Current and Ramping version tasks correctly - #8957 - Contains minor metric improvement. Included because it adds a test harness that is used in the two metrics PRs above - #9250 - Bug fix of task rescheduling edge case during AutoUpgrade Transition ## Why? For OSS v1.30.2 ## How did you test it? - [x] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Touches history/matching worker-versioning paths, adding new caches and changing workflow task/start handling and backlog metric emission; incorrect caching or signaling could affect dispatch/upgrade behavior and observability. > > **Overview** > Adds new worker-versioning protections and upgrade signaling: workflow task started events now persist a `workflow_task_target_worker_deployment_version_changed` flag (and emit a new `workflow_target_version_changed_count` metric) under a new `EnableSendTargetVersionChanged` dynamic config. > > Introduces two new caches with metrics and dynamic config knobs: a `RoutingInfoCache` to avoid repeated `GetTaskQueueUserData` lookups during activity start/transition logic, and a `ReactivationSignalCache` plus `EnableVersionReactivationSignals` to dedupe and asynchronously send “reactivation” signals when workflows are pinned (via start/signal-with-start/reset/update-options) to potentially drained/inactive worker versions. > > Extends matching backlog metrics to support version-attributed reporting by adding `BacklogMetricsEmitInterval` and switching queue DB emission to *physical* backlog gauges (`physical_approximate_backlog_*`) when attribution is enabled, while keeping legacy gauges when disabled. > > Adds frontend scaffolding for a new visibility RPC `CountSchedules` (client plumbing, interception/metadata/quota wiring) but leaves the frontend handler unimplemented, and bumps `go.temporal.io/api` to `v1.62.2`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit cb8ae14. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Shahab Tajik <shahab@temporal.io> Co-authored-by: Shivam <57200924+Shivs11@users.noreply.github.com>
What changed?
Cache the result of
GetTaskQueueUserDatathat history makes to Matching when an activity wants to start a deployment version transition.Why?
This protects the matching root partition from being hammered by requests when a lot of AutoUpgrade workflows want to start activity-initiated transitions. Activity initiated transitions happen in one of the following cases:
How did you test it?
Potential risks
The cache can potentially increase history mem usage but there are knobs to adjust size and ttl.