Activate Drained/Inactive version if a workflow is "moved" to it#9147
Conversation
| // The history and mutable state generated above will be deleted by a background process. | ||
| resp, outcome, conflictErr := s.handleConflict(ctx, creationParams, currentWorkflowConditionFailedError) | ||
| if conflictErr == nil { | ||
| if conflictErr == nil && outcome == StartNew { |
There was a problem hiding this comment.
the main idea here was to only send the reactivation signal iff we have actually started/created a new workflow execution - all other outcomes, except StartNew, do not do that.
| shardContext historyi.ShardContext, | ||
| namespaceID namespace.ID, |
There was a problem hiding this comment.
I think most places who call this func already have the namespace entry. we should just get that instead of these two args.
There was a problem hiding this comment.
hmmm, you are right. Noticed that SignalVersionReactivation on the client already logs all errors via convertAndRecordError - so the Warn log in ReactivateVersionWorkflowIfPinned is redundant.
| // VersionReactivationSignalerFn is a function type for sending reactivation signals to version workflows. | ||
| // This abstraction allows the history API layer to use the deployment client without importing it directly, | ||
| // avoiding import cycles between history/api and worker/workerdeployment packages. | ||
| type VersionReactivationSignalerFn func( |
There was a problem hiding this comment.
any reason for this type instead of passing client itself?
There was a problem hiding this comment.
circular dependencies; my first stab at this involved me just directly passing the workerDeployment.Client, but that did not go well as:
service/history/api → service/worker/workerdeployment → service/history/api
| } | ||
|
|
||
| // Use signaler function to send the signal | ||
| err = signaler(ctx, nsEntry, pinnedVersion.GetDeploymentName(), pinnedVersion.GetBuildId()) |
There was a problem hiding this comment.
let' make sure this call is async so we don't add latency to user's request.
| go func() { | ||
| signaler(context.Background(), namespaceEntry, pinnedVersion.GetDeploymentName(), pinnedVersion.GetBuildId()) | ||
| }() |
There was a problem hiding this comment.
a thing to remember here is that any error, if present, would be caught in the client's recordAndError method and would be logged out. However, I am not returning the error over here since I don't think we should be stopping existing operations based on this error
carlydf
left a comment
There was a problem hiding this comment.
approved with small name change nit
Drained/Inactive versions if a workflow is "moved" to that version
Drained/Inactive versions if a workflow is "moved" to that versionDrained/Inactive version if a workflow is "moved" to it
…#9147) ## What changed? - activate drained/inactive versions to draining when they get a workflow started on it - also added a history cache, per history node, so that we don't bombard our version workflows with signals that shall change the drainage status of these workflows. ## Why? - versioning correctness, in the sense that if someone were to move a workflow on to a version that is drained, the drainage status should be updated to draining (since it now has one open workflow working on it) ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [x] added new functional test(s) ## Potential risks - Sure, this change is lowkey risky. Would appreciate a thorough review. <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Touches history start/reset/update flows and adds asynchronous signaling into worker-deployment workflows; mis-wiring or cache/config issues could cause missing or excessive reactivation signals and unexpected version state churn under load. > > **Overview** > When workflows are **pinned to a specific deployment version**, history now triggers a fire-and-forget `reactivate-version` signal to the corresponding worker-deployment *version workflow* so versions in `DRAINED/INACTIVE` transition back to `DRAINING`. > > This wiring is applied across start paths (`StartWorkflowExecution` incl. conflict handling, `SignalWithStart`, multi-op start), option changes (`UpdateWorkflowExecutionOptions` after persistence), and `ResetWorkflowExecution` post-reset operations. A new per-history-node `ReactivationSignalCache` (TTL/max-size + metrics tags) deduplicates signals, and a new dynamic config flag (`history.enableVersionReactivationSignals`) plus cache settings control/limit load. > > Worker-deployment adds `Client.SignalVersionReactivation` and the version workflow gains a version-gated handler for `reactivate-version` to update drainage/status and sync summaries; extensive functional tests cover reactivation and cache dedup behavior. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 15a5ac6. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Carly de Frondeville <cdefrondeville@berkeley.edu>
…#9147) ## What changed? - activate drained/inactive versions to draining when they get a workflow started on it - also added a history cache, per history node, so that we don't bombard our version workflows with signals that shall change the drainage status of these workflows. ## Why? - versioning correctness, in the sense that if someone were to move a workflow on to a version that is drained, the drainage status should be updated to draining (since it now has one open workflow working on it) ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [x] added new functional test(s) ## Potential risks - Sure, this change is lowkey risky. Would appreciate a thorough review. <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Touches history start/reset/update flows and adds asynchronous signaling into worker-deployment workflows; mis-wiring or cache/config issues could cause missing or excessive reactivation signals and unexpected version state churn under load. > > **Overview** > When workflows are **pinned to a specific deployment version**, history now triggers a fire-and-forget `reactivate-version` signal to the corresponding worker-deployment *version workflow* so versions in `DRAINED/INACTIVE` transition back to `DRAINING`. > > This wiring is applied across start paths (`StartWorkflowExecution` incl. conflict handling, `SignalWithStart`, multi-op start), option changes (`UpdateWorkflowExecutionOptions` after persistence), and `ResetWorkflowExecution` post-reset operations. A new per-history-node `ReactivationSignalCache` (TTL/max-size + metrics tags) deduplicates signals, and a new dynamic config flag (`history.enableVersionReactivationSignals`) plus cache settings control/limit load. > > Worker-deployment adds `Client.SignalVersionReactivation` and the version workflow gains a version-gated handler for `reactivate-version` to update drainage/status and sync summaries; extensive functional tests cover reactivation and cache dedup behavior. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 15a5ac6. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Carly de Frondeville <cdefrondeville@berkeley.edu>
## What changed? Cherry-pick versioning PRs - #9168 - Cache for system protection - #9262 - Cache for system protection - #9239 - Critical PR to enable sending `TargetVersionChanged` flag for Upgrade-on-CaN feature - #9147 - Tracks version drainage properly when version receives workflows via `VersioningOverride`. Needed for automated worker controllers to correctly scale versioned workers that received workflows via `VersioningOverride`. - #9300 - Needed for `approximate_backlog_count` metric to track Current and Ramping version tasks correctly - #9316 - Needed for `approximate_backlog_count` metric to track Current and Ramping version tasks correctly - #8957 - Contains minor metric improvement. Included because it adds a test harness that is used in the two metrics PRs above - #9250 - Bug fix of task rescheduling edge case during AutoUpgrade Transition ## Why? For OSS v1.30.2 ## How did you test it? - [x] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Touches history/matching worker-versioning paths, adding new caches and changing workflow task/start handling and backlog metric emission; incorrect caching or signaling could affect dispatch/upgrade behavior and observability. > > **Overview** > Adds new worker-versioning protections and upgrade signaling: workflow task started events now persist a `workflow_task_target_worker_deployment_version_changed` flag (and emit a new `workflow_target_version_changed_count` metric) under a new `EnableSendTargetVersionChanged` dynamic config. > > Introduces two new caches with metrics and dynamic config knobs: a `RoutingInfoCache` to avoid repeated `GetTaskQueueUserData` lookups during activity start/transition logic, and a `ReactivationSignalCache` plus `EnableVersionReactivationSignals` to dedupe and asynchronously send “reactivation” signals when workflows are pinned (via start/signal-with-start/reset/update-options) to potentially drained/inactive worker versions. > > Extends matching backlog metrics to support version-attributed reporting by adding `BacklogMetricsEmitInterval` and switching queue DB emission to *physical* backlog gauges (`physical_approximate_backlog_*`) when attribution is enabled, while keeping legacy gauges when disabled. > > Adds frontend scaffolding for a new visibility RPC `CountSchedules` (client plumbing, interception/metadata/quota wiring) but leaves the frontend handler unimplemented, and bumps `go.temporal.io/api` to `v1.62.2`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit cb8ae14. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Shahab Tajik <shahab@temporal.io> Co-authored-by: Shivam <57200924+Shivs11@users.noreply.github.com>
What changed?
Why?
How did you test it?
Potential risks
Note
Medium Risk
Touches history start/reset/signal-with-start/update-options code paths and introduces asynchronous signaling + caching; failures won’t block requests but could cause missed or excessive reactivation signals if misconfigured.
Overview
Ensures worker-versioning drained/inactive versions get reactivated: when a workflow is started, signal-with-started, reset with post-reset option updates, or has execution options updated with a pinned
VersioningOverride, the history service now fire-and-forgets areactivate-versionsignal to the corresponding version workflow to move it back to DRAINING.Adds a per-history-node
version_reactivation_signaldedup cache (TTL/max-size + metrics) and a global kill-switch (history.enableVersionReactivationSignals) to avoid flooding version workflows; wires the cache +workerdeployment.Client.SignalVersionReactivationthrough history engine/fx and adds functional tests covering reactivation and cache dedup behavior.Written by Cursor Bugbot for commit 06ad026. This will update automatically on new commits. Configure here.