Conversation
de967ad to
6bdf79e
Compare
| if attributionEnabled { | ||
| backlogCountGauge = metrics.PhysicalApproximateBacklogCount | ||
| backlogAgeGauge = metrics.PhysicalApproximateBacklogAgeSeconds | ||
| } |
There was a problem hiding this comment.
Metric family switch leaves stale gauges
Medium Severity
emitZeroPhysicalBacklogGauges now clears only one gauge family based on BacklogMetricsEmitInterval(). If the config flips after metrics were emitted, the previously used family is never reset, so old backlog values remain published for unloaded queues. This leaves stale values in either physical_* or approximate_* metrics.
Additional Locations (1)
There was a problem hiding this comment.
I think this might be ok, because, when attribution is enabled, emitZeroPhysicalBacklogGauges zeroes the physical_* metric, and emitZeroLogicalBacklogForQueue zeroes the approximate_* metric.
When attribution is disabled, emitZeroPhysicalBacklogGauges zeroes the approximate_* metric, and emitZeroLogicalBacklogForQueue also zeroes the `approximate_* metric. That seems weird, but I think double-zeroing might be ok? It's only happening on unload...
anyway, ideally we aren't flip flopping between enable and disable much anyways..
There was a problem hiding this comment.
Most important is that if we need to revert the new change, we can, and after we revert, things get added to and zeroed out correctly. I think it's ok if the newly-added physical_* metrics become stale as a result, in the worst case scenario
## What changed? - [This](#9300) PR is the part 1 of this effort. - PR #9300 added a dynamic config with the hope that simply making it a 0 would revert the backlog count (and age) metric to be emitted from the old location. However, I quickly realized that it was not doing that. This PR fixes that by first checking if that dynamic config is enabled or not. If it is enabled, we emit the newly created *physical level backlog* metrics. If it is not emitted, we emit the backlog metrics the way they used to before #9300 went in. - Additionally, cursor bot had this [comment](#9300 (comment)) pointing out that our use of `Describe` task queue to get the backlog metrics to emit would cause task queues to remain "alive" indefinitely and never be unloaded, potentially causing strain on Matching service. This PR fixes that by skipping the "mark alive" step when `Describe` is called internally for metrics use, and only marking it alive when it's called externally. ## Why? - Safety and correctness. ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks - these are risky, but i have a dynamic config flag now to quickly revert things in prod if things go wrong. <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Changes affect Matching backlog metrics and task-queue liveness behavior; incorrect gating could break dashboards/alerts or impact queue unloading behavior under load. > > **Overview** > Adjusts backlog gauge emission to **conditionally switch** between the new physical backlog metrics and the legacy `approximate_backlog_*` metrics based on `BacklogMetricsEmitInterval`: when attribution is enabled, only the unversioned queue emits `physical_approximate_backlog_*`; when disabled, all applicable queues fall back to emitting the original `approximate_backlog_*` series. > > Refactors `Describe` into an internal `describe(..., skipMarkAlive)` variant and uses it for periodic logical backlog metric emission so internal metric collection no longer calls `MarkAlive`, avoiding unintended prevention of idle task queue unloading. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit e79f933. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Carly de Frondeville <carly.defrondeville@temporal.io> Co-authored-by: Carly de Frondeville <cdefrondeville@berkeley.edu>
## What changed? - [This](#9300) PR is the part 1 of this effort. - PR #9300 added a dynamic config with the hope that simply making it a 0 would revert the backlog count (and age) metric to be emitted from the old location. However, I quickly realized that it was not doing that. This PR fixes that by first checking if that dynamic config is enabled or not. If it is enabled, we emit the newly created *physical level backlog* metrics. If it is not emitted, we emit the backlog metrics the way they used to before #9300 went in. - Additionally, cursor bot had this [comment](#9300 (comment)) pointing out that our use of `Describe` task queue to get the backlog metrics to emit would cause task queues to remain "alive" indefinitely and never be unloaded, potentially causing strain on Matching service. This PR fixes that by skipping the "mark alive" step when `Describe` is called internally for metrics use, and only marking it alive when it's called externally. ## Why? - Safety and correctness. ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks - these are risky, but i have a dynamic config flag now to quickly revert things in prod if things go wrong. <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Changes affect Matching backlog metrics and task-queue liveness behavior; incorrect gating could break dashboards/alerts or impact queue unloading behavior under load. > > **Overview** > Adjusts backlog gauge emission to **conditionally switch** between the new physical backlog metrics and the legacy `approximate_backlog_*` metrics based on `BacklogMetricsEmitInterval`: when attribution is enabled, only the unversioned queue emits `physical_approximate_backlog_*`; when disabled, all applicable queues fall back to emitting the original `approximate_backlog_*` series. > > Refactors `Describe` into an internal `describe(..., skipMarkAlive)` variant and uses it for periodic logical backlog metric emission so internal metric collection no longer calls `MarkAlive`, avoiding unintended prevention of idle task queue unloading. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit e79f933. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Carly de Frondeville <carly.defrondeville@temporal.io> Co-authored-by: Carly de Frondeville <cdefrondeville@berkeley.edu>
## What changed? Cherry-pick versioning PRs - #9168 - Cache for system protection - #9262 - Cache for system protection - #9239 - Critical PR to enable sending `TargetVersionChanged` flag for Upgrade-on-CaN feature - #9147 - Tracks version drainage properly when version receives workflows via `VersioningOverride`. Needed for automated worker controllers to correctly scale versioned workers that received workflows via `VersioningOverride`. - #9300 - Needed for `approximate_backlog_count` metric to track Current and Ramping version tasks correctly - #9316 - Needed for `approximate_backlog_count` metric to track Current and Ramping version tasks correctly - #8957 - Contains minor metric improvement. Included because it adds a test harness that is used in the two metrics PRs above - #9250 - Bug fix of task rescheduling edge case during AutoUpgrade Transition ## Why? For OSS v1.30.2 ## How did you test it? - [x] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Touches history/matching worker-versioning paths, adding new caches and changing workflow task/start handling and backlog metric emission; incorrect caching or signaling could affect dispatch/upgrade behavior and observability. > > **Overview** > Adds new worker-versioning protections and upgrade signaling: workflow task started events now persist a `workflow_task_target_worker_deployment_version_changed` flag (and emit a new `workflow_target_version_changed_count` metric) under a new `EnableSendTargetVersionChanged` dynamic config. > > Introduces two new caches with metrics and dynamic config knobs: a `RoutingInfoCache` to avoid repeated `GetTaskQueueUserData` lookups during activity start/transition logic, and a `ReactivationSignalCache` plus `EnableVersionReactivationSignals` to dedupe and asynchronously send “reactivation” signals when workflows are pinned (via start/signal-with-start/reset/update-options) to potentially drained/inactive worker versions. > > Extends matching backlog metrics to support version-attributed reporting by adding `BacklogMetricsEmitInterval` and switching queue DB emission to *physical* backlog gauges (`physical_approximate_backlog_*`) when attribution is enabled, while keeping legacy gauges when disabled. > > Adds frontend scaffolding for a new visibility RPC `CountSchedules` (client plumbing, interception/metadata/quota wiring) but leaves the frontend handler unimplemented, and bumps `go.temporal.io/api` to `v1.62.2`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit cb8ae14. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Shahab Tajik <shahab@temporal.io> Co-authored-by: Shivam <57200924+Shivs11@users.noreply.github.com>


What changed?
Describetask queue to get the backlog metrics to emit would cause task queues to remain "alive" indefinitely and never be unloaded, potentially causing strain on Matching service. This PR fixes that by skipping the "mark alive" step whenDescribeis called internally for metrics use, and only marking it alive when it's called externally.Why?
How did you test it?
Potential risks
Note
Medium Risk
Changes affect Matching backlog metrics and task-queue liveness behavior; incorrect gating could break dashboards/alerts or impact queue unloading behavior under load.
Overview
Adjusts backlog gauge emission to conditionally switch between the new physical backlog metrics and the legacy
approximate_backlog_*metrics based onBacklogMetricsEmitInterval: when attribution is enabled, only the unversioned queue emitsphysical_approximate_backlog_*; when disabled, all applicable queues fall back to emitting the originalapproximate_backlog_*series.Refactors
Describeinto an internaldescribe(..., skipMarkAlive)variant and uses it for periodic logical backlog metric emission so internal metric collection no longer callsMarkAlive, avoiding unintended prevention of idle task queue unloading.Written by Cursor Bugbot for commit e79f933. This will update automatically on new commits. Configure here.