Skip to content

Activate Drained/Inactive version if a workflow is "moved" to it#9147

Merged
carlydf merged 31 commits into
mainfrom
ss/reactive-drained-inactive-wfs
Feb 12, 2026
Merged

Activate Drained/Inactive version if a workflow is "moved" to it#9147
carlydf merged 31 commits into
mainfrom
ss/reactive-drained-inactive-wfs

Conversation

@Shivs11
Copy link
Copy Markdown
Member

@Shivs11 Shivs11 commented Jan 27, 2026

What changed?

  • activate drained/inactive versions to draining when they get a workflow started on it
  • also added a history cache, per history node, so that we don't bombard our version workflows with signals that shall change the drainage status of these workflows.

Why?

  • versioning correctness, in the sense that if someone were to move a workflow on to a version that is drained, the drainage status should be updated to draining (since it now has one open workflow working on it)

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

Potential risks

  • Sure, this change is lowkey risky. Would appreciate a thorough review.

Note

Medium Risk
Touches history start/reset/signal-with-start/update-options code paths and introduces asynchronous signaling + caching; failures won’t block requests but could cause missed or excessive reactivation signals if misconfigured.

Overview
Ensures worker-versioning drained/inactive versions get reactivated: when a workflow is started, signal-with-started, reset with post-reset option updates, or has execution options updated with a pinned VersioningOverride, the history service now fire-and-forgets a reactivate-version signal to the corresponding version workflow to move it back to DRAINING.

Adds a per-history-node version_reactivation_signal dedup cache (TTL/max-size + metrics) and a global kill-switch (history.enableVersionReactivationSignals) to avoid flooding version workflows; wires the cache + workerdeployment.Client.SignalVersionReactivation through history engine/fx and adds functional tests covering reactivation and cache dedup behavior.

Written by Cursor Bugbot for commit 06ad026. This will update automatically on new commits. Configure here.

Comment thread common/worker_versioning/worker_versioning.go Outdated
@Shivs11 Shivs11 marked this pull request as ready for review January 29, 2026 21:58
@Shivs11 Shivs11 requested review from a team as code owners January 29, 2026 21:58
Comment thread service/history/api/worker_versioning_util.go Outdated
Comment thread service/history/api/worker_versioning_util.go
Comment thread common/worker_versioning/version_reactivation_signal_cache.go
Comment thread service/worker/workerdeployment/version_workflow.go
Comment thread service/worker/workerdeployment/util.go
Comment thread service/history/api/worker_versioning_util.go Outdated
Comment thread service/history/api/updateworkflowoptions/api.go Outdated
Comment thread service/history/api/startworkflow/api.go Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Comment thread service/history/api/startworkflow/api.go
// The history and mutable state generated above will be deleted by a background process.
resp, outcome, conflictErr := s.handleConflict(ctx, creationParams, currentWorkflowConditionFailedError)
if conflictErr == nil {
if conflictErr == nil && outcome == StartNew {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main idea here was to only send the reactivation signal iff we have actually started/created a new workflow execution - all other outcomes, except StartNew, do not do that.

Comment thread common/worker_versioning/worker_versioning.go Outdated
Comment thread service/worker/workerdeployment/version_workflow.go
Comment on lines +28 to +29
shardContext historyi.ShardContext,
namespaceID namespace.ID,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most places who call this func already have the namespace entry. we should just get that instead of these two args.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, you are right. Noticed that SignalVersionReactivation on the client already logs all errors via convertAndRecordError - so the Warn log in ReactivateVersionWorkflowIfPinned is redundant.

// VersionReactivationSignalerFn is a function type for sending reactivation signals to version workflows.
// This abstraction allows the history API layer to use the deployment client without importing it directly,
// avoiding import cycles between history/api and worker/workerdeployment packages.
type VersionReactivationSignalerFn func(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason for this type instead of passing client itself?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

circular dependencies; my first stab at this involved me just directly passing the workerDeployment.Client, but that did not go well as:

service/history/api → service/worker/workerdeployment → service/history/api

}

// Use signaler function to send the signal
err = signaler(ctx, nsEntry, pinnedVersion.GetDeploymentName(), pinnedVersion.GetBuildId())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let' make sure this call is async so we don't add latency to user's request.

Comment on lines +59 to +61
go func() {
signaler(context.Background(), namespaceEntry, pinnedVersion.GetDeploymentName(), pinnedVersion.GetBuildId())
}()
Copy link
Copy Markdown
Member Author

@Shivs11 Shivs11 Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a thing to remember here is that any error, if present, would be caught in the client's recordAndError method and would be logged out. However, I am not returning the error over here since I don't think we should be stopping existing operations based on this error

Copy link
Copy Markdown
Contributor

@carlydf carlydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved with small name change nit

Comment thread common/dynamicconfig/constants.go Outdated
Comment thread common/metrics/metric_defs.go Outdated
Comment thread service/history/api/updateworkflowoptions/api.go
@carlydf carlydf enabled auto-merge (squash) February 12, 2026 22:22
@carlydf carlydf changed the title [Worker-Versioning improvement]: Activate "Drained/Inactive" versions Activate "Drained/Inactive" versions if a workflow is "moved" to that version Feb 12, 2026
@carlydf carlydf changed the title Activate "Drained/Inactive" versions if a workflow is "moved" to that version Activate Drained/Inactive versions if a workflow is "moved" to that version Feb 12, 2026
@carlydf carlydf changed the title Activate Drained/Inactive versions if a workflow is "moved" to that version Activate Drained/Inactive version if a workflow is "moved" to it Feb 12, 2026
@carlydf carlydf merged commit 4668ca5 into main Feb 12, 2026
74 checks passed
@carlydf carlydf deleted the ss/reactive-drained-inactive-wfs branch February 12, 2026 23:30
carlydf added a commit that referenced this pull request Mar 12, 2026
…#9147)

## What changed?
- activate drained/inactive versions to draining when they get a
workflow started on it
- also added a history cache, per history node, so that we don't bombard
our version workflows with signals that shall change the drainage status
of these workflows.

## Why?
- versioning correctness, in the sense that if someone were to move a
workflow on to a version that is drained, the drainage status should be
updated to draining (since it now has one open workflow working on it)

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [x] added new functional test(s)

## Potential risks
- Sure, this change is lowkey risky. Would appreciate a thorough review.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches history start/reset/update flows and adds asynchronous
signaling into worker-deployment workflows; mis-wiring or cache/config
issues could cause missing or excessive reactivation signals and
unexpected version state churn under load.
> 
> **Overview**
> When workflows are **pinned to a specific deployment version**,
history now triggers a fire-and-forget `reactivate-version` signal to
the corresponding worker-deployment *version workflow* so versions in
`DRAINED/INACTIVE` transition back to `DRAINING`.
> 
> This wiring is applied across start paths (`StartWorkflowExecution`
incl. conflict handling, `SignalWithStart`, multi-op start), option
changes (`UpdateWorkflowExecutionOptions` after persistence), and
`ResetWorkflowExecution` post-reset operations. A new per-history-node
`ReactivationSignalCache` (TTL/max-size + metrics tags) deduplicates
signals, and a new dynamic config flag
(`history.enableVersionReactivationSignals`) plus cache settings
control/limit load.
> 
> Worker-deployment adds `Client.SignalVersionReactivation` and the
version workflow gains a version-gated handler for `reactivate-version`
to update drainage/status and sync summaries; extensive functional tests
cover reactivation and cache dedup behavior.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
15a5ac6. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Carly de Frondeville <cdefrondeville@berkeley.edu>
carlydf added a commit that referenced this pull request Mar 17, 2026
…#9147)

## What changed?
- activate drained/inactive versions to draining when they get a
workflow started on it
- also added a history cache, per history node, so that we don't bombard
our version workflows with signals that shall change the drainage status
of these workflows.

## Why?
- versioning correctness, in the sense that if someone were to move a
workflow on to a version that is drained, the drainage status should be
updated to draining (since it now has one open workflow working on it)

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [x] added new functional test(s)

## Potential risks
- Sure, this change is lowkey risky. Would appreciate a thorough review.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches history start/reset/update flows and adds asynchronous
signaling into worker-deployment workflows; mis-wiring or cache/config
issues could cause missing or excessive reactivation signals and
unexpected version state churn under load.
> 
> **Overview**
> When workflows are **pinned to a specific deployment version**,
history now triggers a fire-and-forget `reactivate-version` signal to
the corresponding worker-deployment *version workflow* so versions in
`DRAINED/INACTIVE` transition back to `DRAINING`.
> 
> This wiring is applied across start paths (`StartWorkflowExecution`
incl. conflict handling, `SignalWithStart`, multi-op start), option
changes (`UpdateWorkflowExecutionOptions` after persistence), and
`ResetWorkflowExecution` post-reset operations. A new per-history-node
`ReactivationSignalCache` (TTL/max-size + metrics tags) deduplicates
signals, and a new dynamic config flag
(`history.enableVersionReactivationSignals`) plus cache settings
control/limit load.
> 
> Worker-deployment adds `Client.SignalVersionReactivation` and the
version workflow gains a version-gated handler for `reactivate-version`
to update drainage/status and sync summaries; extensive functional tests
cover reactivation and cache dedup behavior.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
15a5ac6. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Carly de Frondeville <cdefrondeville@berkeley.edu>
chaptersix pushed a commit that referenced this pull request Mar 17, 2026
## What changed?
Cherry-pick versioning PRs
- #9168
  - Cache for system protection
- #9262 
  - Cache for system protection
- #9239 
- Critical PR to enable sending `TargetVersionChanged` flag for
Upgrade-on-CaN feature
- #9147 
- Tracks version drainage properly when version receives workflows via
`VersioningOverride`. Needed for automated worker controllers to
correctly scale versioned workers that received workflows via
`VersioningOverride`.
- #9300 
- Needed for `approximate_backlog_count` metric to track Current and
Ramping version tasks correctly
- #9316 
- Needed for `approximate_backlog_count` metric to track Current and
Ramping version tasks correctly
- #8957 
- Contains minor metric improvement. Included because it adds a test
harness that is used in the two metrics PRs above
- #9250 
  - Bug fix of task rescheduling edge case during AutoUpgrade Transition

## Why?
For OSS v1.30.2

## How did you test it?
- [x] built
- [ ] run locally and tested manually
- [x] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)





<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches history/matching worker-versioning paths, adding new caches
and changing workflow task/start handling and backlog metric emission;
incorrect caching or signaling could affect dispatch/upgrade behavior
and observability.
> 
> **Overview**
> Adds new worker-versioning protections and upgrade signaling: workflow
task started events now persist a
`workflow_task_target_worker_deployment_version_changed` flag (and emit
a new `workflow_target_version_changed_count` metric) under a new
`EnableSendTargetVersionChanged` dynamic config.
> 
> Introduces two new caches with metrics and dynamic config knobs: a
`RoutingInfoCache` to avoid repeated `GetTaskQueueUserData` lookups
during activity start/transition logic, and a `ReactivationSignalCache`
plus `EnableVersionReactivationSignals` to dedupe and asynchronously
send “reactivation” signals when workflows are pinned (via
start/signal-with-start/reset/update-options) to potentially
drained/inactive worker versions.
> 
> Extends matching backlog metrics to support version-attributed
reporting by adding `BacklogMetricsEmitInterval` and switching queue DB
emission to *physical* backlog gauges (`physical_approximate_backlog_*`)
when attribution is enabled, while keeping legacy gauges when disabled.
> 
> Adds frontend scaffolding for a new visibility RPC `CountSchedules`
(client plumbing, interception/metadata/quota wiring) but leaves the
frontend handler unimplemented, and bumps `go.temporal.io/api` to
`v1.62.2`.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
cb8ae14. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Shahab Tajik <shahab@temporal.io>
Co-authored-by: Shivam <57200924+Shivs11@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants