Skip to content

Backlog Metric Attribution: part 2#9316

Merged
carlydf merged 6 commits intomainfrom
metrics-attribution-v2
Feb 13, 2026
Merged

Backlog Metric Attribution: part 2#9316
carlydf merged 6 commits intomainfrom
metrics-attribution-v2

Conversation

@Shivs11
Copy link
Member

@Shivs11 Shivs11 commented Feb 13, 2026

What changed?

  • This PR is the part 1 of this effort.
  • PR Backlog metrics (count, age) to now have versioning based attribution #9300 added a dynamic config with the hope that simply making it a 0 would revert the backlog count (and age) metric to be emitted from the old location. However, I quickly realized that it was not doing that. This PR fixes that by first checking if that dynamic config is enabled or not. If it is enabled, we emit the newly created physical level backlog metrics. If it is not emitted, we emit the backlog metrics the way they used to before Backlog metrics (count, age) to now have versioning based attribution #9300 went in.
  • Additionally, cursor bot had this comment pointing out that our use of Describe task queue to get the backlog metrics to emit would cause task queues to remain "alive" indefinitely and never be unloaded, potentially causing strain on Matching service. This PR fixes that by skipping the "mark alive" step when Describe is called internally for metrics use, and only marking it alive when it's called externally.
  • See cdf/revert-9300...cdf/pick-9300-9316 to picture part 1 and part 2 as one diff, if that's helpful

Why?

  • Safety and correctness.

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

Potential risks

  • these are risky, but i have a dynamic config flag now to quickly revert things in prod if things go wrong.

Note

Medium Risk
Changes affect Matching backlog metrics and task-queue liveness behavior; incorrect gating could break dashboards/alerts or impact queue unloading behavior under load.

Overview
Adjusts backlog gauge emission to conditionally switch between the new physical backlog metrics and the legacy approximate_backlog_* metrics based on BacklogMetricsEmitInterval: when attribution is enabled, only the unversioned queue emits physical_approximate_backlog_*; when disabled, all applicable queues fall back to emitting the original approximate_backlog_* series.

Refactors Describe into an internal describe(..., skipMarkAlive) variant and uses it for periodic logical backlog metric emission so internal metric collection no longer calls MarkAlive, avoiding unintended prevention of idle task queue unloading.

Written by Cursor Bugbot for commit e79f933. This will update automatically on new commits. Configure here.

@Shivs11 Shivs11 marked this pull request as ready for review February 13, 2026 03:39
@Shivs11 Shivs11 requested review from a team as code owners February 13, 2026 03:39
@carlydf carlydf force-pushed the metrics-attribution-v2 branch from de967ad to 6bdf79e Compare February 13, 2026 07:22
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

if attributionEnabled {
backlogCountGauge = metrics.PhysicalApproximateBacklogCount
backlogAgeGauge = metrics.PhysicalApproximateBacklogAgeSeconds
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metric family switch leaves stale gauges

Medium Severity

emitZeroPhysicalBacklogGauges now clears only one gauge family based on BacklogMetricsEmitInterval(). If the config flips after metrics were emitted, the previously used family is never reset, so old backlog values remain published for unloaded queues. This leaves stale values in either physical_* or approximate_* metrics.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be ok, because, when attribution is enabled, emitZeroPhysicalBacklogGauges zeroes the physical_* metric, and emitZeroLogicalBacklogForQueue zeroes the approximate_* metric.

When attribution is disabled, emitZeroPhysicalBacklogGauges zeroes the approximate_* metric, and emitZeroLogicalBacklogForQueue also zeroes the `approximate_* metric. That seems weird, but I think double-zeroing might be ok? It's only happening on unload...

anyway, ideally we aren't flip flopping between enable and disable much anyways..

Copy link
Contributor

@carlydf carlydf Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most important is that if we need to revert the new change, we can, and after we revert, things get added to and zeroed out correctly. I think it's ok if the newly-added physical_* metrics become stale as a result, in the worst case scenario

@carlydf carlydf enabled auto-merge (squash) February 13, 2026 07:49
@carlydf carlydf merged commit 2a9e477 into main Feb 13, 2026
48 checks passed
@carlydf carlydf deleted the metrics-attribution-v2 branch February 13, 2026 08:09
carlydf added a commit that referenced this pull request Mar 12, 2026
## What changed?
- [This](#9300) PR is the
part 1 of this effort.
- PR #9300 added a dynamic config with the hope that simply making it a
0 would revert the backlog count (and age) metric to be emitted from the
old location. However, I quickly realized that it was not doing that.
This PR fixes that by first checking if that dynamic config is enabled
or not. If it is enabled, we emit the newly created *physical level
backlog* metrics. If it is not emitted, we emit the backlog metrics the
way they used to before #9300 went in.
- Additionally, cursor bot had this
[comment](#9300 (comment))
pointing out that our use of `Describe` task queue to get the backlog
metrics to emit would cause task queues to remain "alive" indefinitely
and never be unloaded, potentially causing strain on Matching service.
This PR fixes that by skipping the "mark alive" step when `Describe` is
called internally for metrics use, and only marking it alive when it's
called externally.

## Why?
- Safety and correctness.

## How did you test it?
- [x] built
- [x] run locally and tested manually
- [x] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
- these are risky, but i have a dynamic config flag now to quickly
revert things in prod if things go wrong.


<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Changes affect Matching backlog metrics and task-queue liveness
behavior; incorrect gating could break dashboards/alerts or impact queue
unloading behavior under load.
> 
> **Overview**
> Adjusts backlog gauge emission to **conditionally switch** between the
new physical backlog metrics and the legacy `approximate_backlog_*`
metrics based on `BacklogMetricsEmitInterval`: when attribution is
enabled, only the unversioned queue emits
`physical_approximate_backlog_*`; when disabled, all applicable queues
fall back to emitting the original `approximate_backlog_*` series.
> 
> Refactors `Describe` into an internal `describe(..., skipMarkAlive)`
variant and uses it for periodic logical backlog metric emission so
internal metric collection no longer calls `MarkAlive`, avoiding
unintended prevention of idle task queue unloading.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
e79f933. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Carly de Frondeville <carly.defrondeville@temporal.io>
Co-authored-by: Carly de Frondeville <cdefrondeville@berkeley.edu>
carlydf added a commit that referenced this pull request Mar 17, 2026
## What changed?
- [This](#9300) PR is the
part 1 of this effort.
- PR #9300 added a dynamic config with the hope that simply making it a
0 would revert the backlog count (and age) metric to be emitted from the
old location. However, I quickly realized that it was not doing that.
This PR fixes that by first checking if that dynamic config is enabled
or not. If it is enabled, we emit the newly created *physical level
backlog* metrics. If it is not emitted, we emit the backlog metrics the
way they used to before #9300 went in.
- Additionally, cursor bot had this
[comment](#9300 (comment))
pointing out that our use of `Describe` task queue to get the backlog
metrics to emit would cause task queues to remain "alive" indefinitely
and never be unloaded, potentially causing strain on Matching service.
This PR fixes that by skipping the "mark alive" step when `Describe` is
called internally for metrics use, and only marking it alive when it's
called externally.

## Why?
- Safety and correctness.

## How did you test it?
- [x] built
- [x] run locally and tested manually
- [x] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
- these are risky, but i have a dynamic config flag now to quickly
revert things in prod if things go wrong.


<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Changes affect Matching backlog metrics and task-queue liveness
behavior; incorrect gating could break dashboards/alerts or impact queue
unloading behavior under load.
> 
> **Overview**
> Adjusts backlog gauge emission to **conditionally switch** between the
new physical backlog metrics and the legacy `approximate_backlog_*`
metrics based on `BacklogMetricsEmitInterval`: when attribution is
enabled, only the unversioned queue emits
`physical_approximate_backlog_*`; when disabled, all applicable queues
fall back to emitting the original `approximate_backlog_*` series.
> 
> Refactors `Describe` into an internal `describe(..., skipMarkAlive)`
variant and uses it for periodic logical backlog metric emission so
internal metric collection no longer calls `MarkAlive`, avoiding
unintended prevention of idle task queue unloading.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
e79f933. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Carly de Frondeville <carly.defrondeville@temporal.io>
Co-authored-by: Carly de Frondeville <cdefrondeville@berkeley.edu>
chaptersix pushed a commit that referenced this pull request Mar 17, 2026
## What changed?
Cherry-pick versioning PRs
- #9168
  - Cache for system protection
- #9262 
  - Cache for system protection
- #9239 
- Critical PR to enable sending `TargetVersionChanged` flag for
Upgrade-on-CaN feature
- #9147 
- Tracks version drainage properly when version receives workflows via
`VersioningOverride`. Needed for automated worker controllers to
correctly scale versioned workers that received workflows via
`VersioningOverride`.
- #9300 
- Needed for `approximate_backlog_count` metric to track Current and
Ramping version tasks correctly
- #9316 
- Needed for `approximate_backlog_count` metric to track Current and
Ramping version tasks correctly
- #8957 
- Contains minor metric improvement. Included because it adds a test
harness that is used in the two metrics PRs above
- #9250 
  - Bug fix of task rescheduling edge case during AutoUpgrade Transition

## Why?
For OSS v1.30.2

## How did you test it?
- [x] built
- [ ] run locally and tested manually
- [x] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)





<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches history/matching worker-versioning paths, adding new caches
and changing workflow task/start handling and backlog metric emission;
incorrect caching or signaling could affect dispatch/upgrade behavior
and observability.
> 
> **Overview**
> Adds new worker-versioning protections and upgrade signaling: workflow
task started events now persist a
`workflow_task_target_worker_deployment_version_changed` flag (and emit
a new `workflow_target_version_changed_count` metric) under a new
`EnableSendTargetVersionChanged` dynamic config.
> 
> Introduces two new caches with metrics and dynamic config knobs: a
`RoutingInfoCache` to avoid repeated `GetTaskQueueUserData` lookups
during activity start/transition logic, and a `ReactivationSignalCache`
plus `EnableVersionReactivationSignals` to dedupe and asynchronously
send “reactivation” signals when workflows are pinned (via
start/signal-with-start/reset/update-options) to potentially
drained/inactive worker versions.
> 
> Extends matching backlog metrics to support version-attributed
reporting by adding `BacklogMetricsEmitInterval` and switching queue DB
emission to *physical* backlog gauges (`physical_approximate_backlog_*`)
when attribution is enabled, while keeping legacy gauges when disabled.
> 
> Adds frontend scaffolding for a new visibility RPC `CountSchedules`
(client plumbing, interception/metadata/quota wiring) but leaves the
frontend handler unimplemented, and bumps `go.temporal.io/api` to
`v1.62.2`.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
cb8ae14. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Shahab Tajik <shahab@temporal.io>
Co-authored-by: Shivam <57200924+Shivs11@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants