Skip to content

Add cache for task queue routing info in History#9168

Merged
ShahabT merged 4 commits intotemporalio:mainfrom
ShahabT:cache-routing-info
Jan 30, 2026
Merged

Add cache for task queue routing info in History#9168
ShahabT merged 4 commits intotemporalio:mainfrom
ShahabT:cache-routing-info

Conversation

@ShahabT
Copy link
Contributor

@ShahabT ShahabT commented Jan 29, 2026

What changed?

Cache the result of GetTaskQueueUserData that history makes to Matching when an activity wants to start a deployment version transition.

Why?

This protects the matching root partition from being hammered by requests when a lot of AutoUpgrade workflows want to start activity-initiated transitions. Activity initiated transitions happen in one of the following cases:

  1. Target version changed while activity was backlogged.
  2. Target version changed while activity was in retry backoff
  3. Target version changed in some edge cases involving parallel activities

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

Potential risks

The cache can potentially increase history mem usage but there are knobs to adjust size and ttl.

@ShahabT ShahabT requested a review from Shivs11 January 29, 2026 19:50
@ShahabT ShahabT requested review from a team as code owners January 29, 2026 19:50
// backlog but because of an ongoing transition. See ActivityStartDuringTransition error usage.
// TODO (shahab): can we limit this adjustment to apply only for activities who actually faced the
// ActivityStartDuringTransition error, and not all others?
info.ScheduledTime = timestamppb.New(ms.timeSource.Now())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yycptt could you please review this line? what to ensure it does not breaks some assumptions in the system.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry to be pendantic but I wonder if this should come in a different PR which has some tests that test that there the time delay has been present

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a fair comment. I did try to spend some time writing the test but this part is soooo internal that I could not come up with a good way to write a non-flaky test. I did though test it manually and verified the delay existed before and my change is canceling it.

Copy link
Member

@Shivs11 Shivs11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved but I don't think we should have that activity scheduled time change going in without tests

Comment on lines +11 to +12
// RoutingInfoCache is used to cache results of GetTaskQueueUserData
// calls followed by CalculateTaskQueueVersioningInfo computation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can just say that this cache is used to cache results of GetTaskQueueUserData operations since that is the main RPC that is actually intensive.

}
tqData, ok := resp.GetUserData().GetData().GetPerType()[int32(taskQueueType)]
// Check cache first for task queue routing info (independent of workflow ID)
current, currentRevisionNumber, ramping, rampingPercentage, rampingRevisionNumber, ok := routingInfoCache.Get(namespaceID, taskQueueName, taskQueueType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something to remember about this change - I know that the cache TTL is only one second which is quite small, but during that one second, there could in theory be stale task queue user data reads which could make AU workflow/activity tasks bounce back and forth.

I think this is fine, just wanted to paste this here.

// backlog but because of an ongoing transition. See ActivityStartDuringTransition error usage.
// TODO (shahab): can we limit this adjustment to apply only for activities who actually faced the
// ActivityStartDuringTransition error, and not all others?
info.ScheduledTime = timestamppb.New(ms.timeSource.Now())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry to be pendantic but I wonder if this should come in a different PR which has some tests that test that there the time delay has been present

"Error message should include the limit value")
}

func (s *Versioning3Suite) TestActivityRetryAutoUpgradeDuringBackoff() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also have some functional tests that test the actual cache implementation that is now present?
I had placed some tests in this file where I increased the TTL and noticed if the calls are going through or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I see you have placed unit tests for the cache and since we are in a time crunch, you can ignore my top comment.

err := workflow.ExecuteActivity(workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
StartToCloseTimeout: 10 * time.Second,
RetryPolicy: &temporal.RetryPolicy{
InitialInterval: 3 * time.Second, // Give us time to change deployment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be flaky? 3 seconds seems less....like what happens if we change the deployment before that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it to 5s

@ShahabT ShahabT enabled auto-merge (squash) January 30, 2026 02:16
@ShahabT ShahabT merged commit b6e5e1e into temporalio:main Jan 30, 2026
108 of 110 checks passed
carlydf pushed a commit that referenced this pull request Mar 12, 2026
## What changed?
Cache the result of `GetTaskQueueUserData` that history makes to
Matching when an activity wants to start a deployment version
transition.

## Why?
This protects the matching root partition from being hammered by
requests when a lot of AutoUpgrade workflows want to start
activity-initiated transitions. Activity initiated transitions happen in
one of the following cases:
1) Target version changed while activity was backlogged.
2) Target version changed while activity was in retry backoff
3) Target version changed in some edge cases involving parallel
activities

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [x] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
The cache can potentially increase history mem usage but there are knobs
to adjust size and ttl.
carlydf pushed a commit that referenced this pull request Mar 17, 2026
## What changed?
Cache the result of `GetTaskQueueUserData` that history makes to
Matching when an activity wants to start a deployment version
transition.

## Why?
This protects the matching root partition from being hammered by
requests when a lot of AutoUpgrade workflows want to start
activity-initiated transitions. Activity initiated transitions happen in
one of the following cases:
1) Target version changed while activity was backlogged.
2) Target version changed while activity was in retry backoff
3) Target version changed in some edge cases involving parallel
activities

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [x] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
The cache can potentially increase history mem usage but there are knobs
to adjust size and ttl.
chaptersix pushed a commit that referenced this pull request Mar 17, 2026
## What changed?
Cherry-pick versioning PRs
- #9168
  - Cache for system protection
- #9262 
  - Cache for system protection
- #9239 
- Critical PR to enable sending `TargetVersionChanged` flag for
Upgrade-on-CaN feature
- #9147 
- Tracks version drainage properly when version receives workflows via
`VersioningOverride`. Needed for automated worker controllers to
correctly scale versioned workers that received workflows via
`VersioningOverride`.
- #9300 
- Needed for `approximate_backlog_count` metric to track Current and
Ramping version tasks correctly
- #9316 
- Needed for `approximate_backlog_count` metric to track Current and
Ramping version tasks correctly
- #8957 
- Contains minor metric improvement. Included because it adds a test
harness that is used in the two metrics PRs above
- #9250 
  - Bug fix of task rescheduling edge case during AutoUpgrade Transition

## Why?
For OSS v1.30.2

## How did you test it?
- [x] built
- [ ] run locally and tested manually
- [x] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)





<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches history/matching worker-versioning paths, adding new caches
and changing workflow task/start handling and backlog metric emission;
incorrect caching or signaling could affect dispatch/upgrade behavior
and observability.
> 
> **Overview**
> Adds new worker-versioning protections and upgrade signaling: workflow
task started events now persist a
`workflow_task_target_worker_deployment_version_changed` flag (and emit
a new `workflow_target_version_changed_count` metric) under a new
`EnableSendTargetVersionChanged` dynamic config.
> 
> Introduces two new caches with metrics and dynamic config knobs: a
`RoutingInfoCache` to avoid repeated `GetTaskQueueUserData` lookups
during activity start/transition logic, and a `ReactivationSignalCache`
plus `EnableVersionReactivationSignals` to dedupe and asynchronously
send “reactivation” signals when workflows are pinned (via
start/signal-with-start/reset/update-options) to potentially
drained/inactive worker versions.
> 
> Extends matching backlog metrics to support version-attributed
reporting by adding `BacklogMetricsEmitInterval` and switching queue DB
emission to *physical* backlog gauges (`physical_approximate_backlog_*`)
when attribution is enabled, while keeping legacy gauges when disabled.
> 
> Adds frontend scaffolding for a new visibility RPC `CountSchedules`
(client plumbing, interception/metadata/quota wiring) but leaves the
frontend handler unimplemented, and bumps `go.temporal.io/api` to
`v1.62.2`.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
cb8ae14. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Shahab Tajik <shahab@temporal.io>
Co-authored-by: Shivam <57200924+Shivs11@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants