Backlog Metric Attribution: part 2 by Shivs11 · Pull Request #9316 · temporalio/temporal

Shivs11 · 2026-02-13T03:29:40Z

What changed?

This PR is the part 1 of this effort.
PR Backlog metrics (count, age) to now have versioning based attribution #9300 added a dynamic config with the hope that simply making it a 0 would revert the backlog count (and age) metric to be emitted from the old location. However, I quickly realized that it was not doing that. This PR fixes that by first checking if that dynamic config is enabled or not. If it is enabled, we emit the newly created physical level backlog metrics. If it is not emitted, we emit the backlog metrics the way they used to before Backlog metrics (count, age) to now have versioning based attribution #9300 went in.
Additionally, cursor bot had this comment pointing out that our use of Describe task queue to get the backlog metrics to emit would cause task queues to remain "alive" indefinitely and never be unloaded, potentially causing strain on Matching service. This PR fixes that by skipping the "mark alive" step when Describe is called internally for metrics use, and only marking it alive when it's called externally.
See cdf/revert-9300...cdf/pick-9300-9316 to picture part 1 and part 2 as one diff, if that's helpful

Why?

Safety and correctness.

How did you test it?

Potential risks

these are risky, but i have a dynamic config flag now to quickly revert things in prod if things go wrong.

Note

Medium Risk
Changes affect Matching backlog metrics and task-queue liveness behavior; incorrect gating could break dashboards/alerts or impact queue unloading behavior under load.

Overview
Adjusts backlog gauge emission to conditionally switch between the new physical backlog metrics and the legacy approximate_backlog_* metrics based on BacklogMetricsEmitInterval: when attribution is enabled, only the unversioned queue emits physical_approximate_backlog_*; when disabled, all applicable queues fall back to emitting the original approximate_backlog_* series.

Refactors Describe into an internal describe(..., skipMarkAlive) variant and uses it for periodic logical backlog metric emission so internal metric collection no longer calls MarkAlive, avoiding unintended prevention of idle task queue unloading.

^{Written by Cursor Bugbot for commit e79f933. This will update automatically on new commits. Configure here.}

service/matching/db.go

…bled"

…e accidentally skipped by external callers

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

cursor · 2026-02-13T07:25:26Z

service/matching/db.go

+	if attributionEnabled {
+		backlogCountGauge = metrics.PhysicalApproximateBacklogCount
+		backlogAgeGauge = metrics.PhysicalApproximateBacklogAgeSeconds
+	}


Metric family switch leaves stale gauges

Medium Severity

emitZeroPhysicalBacklogGauges now clears only one gauge family based on BacklogMetricsEmitInterval(). If the config flips after metrics were emitted, the previously used family is never reset, so old backlog values remain published for unloaded queues. This leaves stale values in either physical_* or approximate_* metrics.

Additional Locations (1)

service/matching/db.go#L781-L787

I think this might be ok, because, when attribution is enabled, emitZeroPhysicalBacklogGauges zeroes the physical_* metric, and emitZeroLogicalBacklogForQueue zeroes the approximate_* metric.

When attribution is disabled, emitZeroPhysicalBacklogGauges zeroes the approximate_* metric, and emitZeroLogicalBacklogForQueue also zeroes the `approximate_* metric. That seems weird, but I think double-zeroing might be ok? It's only happening on unload...

anyway, ideally we aren't flip flopping between enable and disable much anyways..

Most important is that if we need to revert the new change, we can, and after we revert, things get added to and zeroed out correctly. I think it's ok if the newly-added physical_* metrics become stale as a result, in the worst case scenario

## What changed? - [This](#9300) PR is the part 1 of this effort. - PR #9300 added a dynamic config with the hope that simply making it a 0 would revert the backlog count (and age) metric to be emitted from the old location. However, I quickly realized that it was not doing that. This PR fixes that by first checking if that dynamic config is enabled or not. If it is enabled, we emit the newly created *physical level backlog* metrics. If it is not emitted, we emit the backlog metrics the way they used to before #9300 went in. - Additionally, cursor bot had this [comment](#9300 (comment)) pointing out that our use of `Describe` task queue to get the backlog metrics to emit would cause task queues to remain "alive" indefinitely and never be unloaded, potentially causing strain on Matching service. This PR fixes that by skipping the "mark alive" step when `Describe` is called internally for metrics use, and only marking it alive when it's called externally. ## Why? - Safety and correctness. ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks - these are risky, but i have a dynamic config flag now to quickly revert things in prod if things go wrong.  --- > [!NOTE] > **Medium Risk** > Changes affect Matching backlog metrics and task-queue liveness behavior; incorrect gating could break dashboards/alerts or impact queue unloading behavior under load. > > **Overview** > Adjusts backlog gauge emission to **conditionally switch** between the new physical backlog metrics and the legacy `approximate_backlog_*` metrics based on `BacklogMetricsEmitInterval`: when attribution is enabled, only the unversioned queue emits `physical_approximate_backlog_*`; when disabled, all applicable queues fall back to emitting the original `approximate_backlog_*` series. > > Refactors `Describe` into an internal `describe(..., skipMarkAlive)` variant and uses it for periodic logical backlog metric emission so internal metric collection no longer calls `MarkAlive`, avoiding unintended prevention of idle task queue unloading. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit e79f933. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>  --------- Co-authored-by: Carly de Frondeville <carly.defrondeville@temporal.io> Co-authored-by: Carly de Frondeville <cdefrondeville@berkeley.edu>

## What changed? Cherry-pick versioning PRs - #9168 - Cache for system protection - #9262 - Cache for system protection - #9239 - Critical PR to enable sending `TargetVersionChanged` flag for Upgrade-on-CaN feature - #9147 - Tracks version drainage properly when version receives workflows via `VersioningOverride`. Needed for automated worker controllers to correctly scale versioned workers that received workflows via `VersioningOverride`. - #9300 - Needed for `approximate_backlog_count` metric to track Current and Ramping version tasks correctly - #9316 - Needed for `approximate_backlog_count` metric to track Current and Ramping version tasks correctly - #8957 - Contains minor metric improvement. Included because it adds a test harness that is used in the two metrics PRs above - #9250 - Bug fix of task rescheduling edge case during AutoUpgrade Transition ## Why? For OSS v1.30.2 ## How did you test it? - [x] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s)  --- > [!NOTE] > **Medium Risk** > Touches history/matching worker-versioning paths, adding new caches and changing workflow task/start handling and backlog metric emission; incorrect caching or signaling could affect dispatch/upgrade behavior and observability. > > **Overview** > Adds new worker-versioning protections and upgrade signaling: workflow task started events now persist a `workflow_task_target_worker_deployment_version_changed` flag (and emit a new `workflow_target_version_changed_count` metric) under a new `EnableSendTargetVersionChanged` dynamic config. > > Introduces two new caches with metrics and dynamic config knobs: a `RoutingInfoCache` to avoid repeated `GetTaskQueueUserData` lookups during activity start/transition logic, and a `ReactivationSignalCache` plus `EnableVersionReactivationSignals` to dedupe and asynchronously send “reactivation” signals when workflows are pinned (via start/signal-with-start/reset/update-options) to potentially drained/inactive worker versions. > > Extends matching backlog metrics to support version-attributed reporting by adding `BacklogMetricsEmitInterval` and switching queue DB emission to *physical* backlog gauges (`physical_approximate_backlog_*`) when attribution is enabled, while keeping legacy gauges when disabled. > > Adds frontend scaffolding for a new visibility RPC `CountSchedules` (client plumbing, interception/metadata/quota wiring) but leaves the frontend handler unimplemented, and bumps `go.temporal.io/api` to `v1.62.2`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit cb8ae14. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>  --------- Co-authored-by: Shahab Tajik <shahab@temporal.io> Co-authored-by: Shivam <57200924+Shivs11@users.noreply.github.com>

fixes

8564897

Shivs11 marked this pull request as ready for review February 13, 2026 03:39

Shivs11 requested review from a team as code owners February 13, 2026 03:39

code cleanliness

5e44bb2

carlydf mentioned this pull request Feb 13, 2026

Backlog metrics (count, age) to now have versioning based attribution #9300

Merged

5 tasks

carlydf reviewed Feb 13, 2026

View reviewed changes

service/matching/db.go Outdated Show resolved Hide resolved

carlydf and others added 3 commits February 12, 2026 22:36

Comment: "When disabled" -> "When version-attributed metrics are disa…

af59562

…bled"

change conditions from not attributionEnabled to attributionEnabled

399cd75

only allow skipMarkAlive flag in private method, so that it cannot b…

6bdf79e

…e accidentally skipped by external callers

carlydf force-pushed the metrics-attribution-v2 branch from de967ad to 6bdf79e Compare February 13, 2026 07:22

cursor bot reviewed Feb 13, 2026

View reviewed changes

fix lint

e79f933

carlydf approved these changes Feb 13, 2026

View reviewed changes

carlydf enabled auto-merge (squash) February 13, 2026 07:49

carlydf merged commit 2a9e477 into main Feb 13, 2026
48 checks passed

carlydf deleted the metrics-attribution-v2 branch February 13, 2026 08:09

carlydf added the release/1.30.2 label Mar 12, 2026

carlydf mentioned this pull request Mar 17, 2026

Versioning cherry-pick for OSS v1.30.2 #9543

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backlog Metric Attribution: part 2#9316

Backlog Metric Attribution: part 2#9316
carlydf merged 6 commits intomainfrom
metrics-attribution-v2

Shivs11 commented Feb 13, 2026 •

edited by carlydf

Loading

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 13, 2026

Uh oh!

carlydf Feb 13, 2026

Uh oh!

carlydf Feb 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Shivs11 commented Feb 13, 2026 • edited by carlydf Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed?

Why?

How did you test it?

Potential risks

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 13, 2026

Choose a reason for hiding this comment

Metric family switch leaves stale gauges

Uh oh!

carlydf Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

carlydf Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Shivs11 commented Feb 13, 2026 •

edited by carlydf

Loading

carlydf Feb 13, 2026 •

edited

Loading