Improve no_poller_tasks metric by Shivs11 · Pull Request #8957 · temporalio/temporal

Shivs11 · 2026-01-07T21:20:32Z

What changed?

This metric, although debatable if it should exist or not, is incremented when there are no pollers on the root partition for 2 minutes (configurable by DC) and with tasks being added to this partition.
A customer pointed out a race condition recently where we noticed that on pod restarts, which resulted in reloading the root partition, led to this metric being wrongly emitted. This was because it did not take into consideration pod restarts.

Why?

Correctness

How did you test it?

Potential risks

None, I think

Note

Prevents false positives in the no-recent-poller metric by tracking partition load time and gating emission.

Add loadTime to taskQueuePartitionManagerImpl, set in Start()
Emit metrics.NoRecentPollerTasksPerTaskQueueCounter only when the root partition has been loaded > noPollerThreshold and there have been no pollers in the last threshold
Add focused unit tests using metricstest.Capture covering newly loaded vs. older partitions and with/without recent pollers
Minor wiring: import metricstest, helper to setup a manager with capturing metrics

^{Written by Cursor Bugbot for commit b6380e5. This will update automatically on new commits. Configure here.}

## What changed? - This metric, although debatable if it should exist or not, is incremented when there are no pollers on the root partition for 2 minutes (configurable by DC) and with tasks being added to this partition. - A customer pointed out a race condition recently where we noticed that on pod restarts, which resulted in reloading the root partition, led to this metric being wrongly emitted. This was because it did not take into consideration pod restarts. ## Why? - Correctness ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) ## Potential risks - None, I think  --- > [!NOTE] > Prevents false positives in the no-recent-poller metric by tracking partition load time and gating emission. > > - Add `loadTime` to `taskQueuePartitionManagerImpl`, set in `Start()` > - Emit `metrics.NoRecentPollerTasksPerTaskQueueCounter` only when the root partition has been loaded > `noPollerThreshold` and there have been no pollers in the last threshold > - Add focused unit tests using `metricstest.Capture` covering newly loaded vs. older partitions and with/without recent pollers > - Minor wiring: import `metricstest`, helper to setup a manager with capturing metrics > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit b6380e5. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>

## What changed? Cherry-pick versioning PRs - #9168 - Cache for system protection - #9262 - Cache for system protection - #9239 - Critical PR to enable sending `TargetVersionChanged` flag for Upgrade-on-CaN feature - #9147 - Tracks version drainage properly when version receives workflows via `VersioningOverride`. Needed for automated worker controllers to correctly scale versioned workers that received workflows via `VersioningOverride`. - #9300 - Needed for `approximate_backlog_count` metric to track Current and Ramping version tasks correctly - #9316 - Needed for `approximate_backlog_count` metric to track Current and Ramping version tasks correctly - #8957 - Contains minor metric improvement. Included because it adds a test harness that is used in the two metrics PRs above - #9250 - Bug fix of task rescheduling edge case during AutoUpgrade Transition ## Why? For OSS v1.30.2 ## How did you test it? - [x] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s)  --- > [!NOTE] > **Medium Risk** > Touches history/matching worker-versioning paths, adding new caches and changing workflow task/start handling and backlog metric emission; incorrect caching or signaling could affect dispatch/upgrade behavior and observability. > > **Overview** > Adds new worker-versioning protections and upgrade signaling: workflow task started events now persist a `workflow_task_target_worker_deployment_version_changed` flag (and emit a new `workflow_target_version_changed_count` metric) under a new `EnableSendTargetVersionChanged` dynamic config. > > Introduces two new caches with metrics and dynamic config knobs: a `RoutingInfoCache` to avoid repeated `GetTaskQueueUserData` lookups during activity start/transition logic, and a `ReactivationSignalCache` plus `EnableVersionReactivationSignals` to dedupe and asynchronously send “reactivation” signals when workflows are pinned (via start/signal-with-start/reset/update-options) to potentially drained/inactive worker versions. > > Extends matching backlog metrics to support version-attributed reporting by adding `BacklogMetricsEmitInterval` and switching queue DB emission to *physical* backlog gauges (`physical_approximate_backlog_*`) when attribution is enabled, while keeping legacy gauges when disabled. > > Adds frontend scaffolding for a new visibility RPC `CountSchedules` (client plumbing, interception/metadata/quota wiring) but leaves the frontend handler unimplemented, and bumps `go.temporal.io/api` to `v1.62.2`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit cb8ae14. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>  --------- Co-authored-by: Shahab Tajik <shahab@temporal.io> Co-authored-by: Shivam <57200924+Shivs11@users.noreply.github.com>

Shivs11 added 2 commits January 7, 2026 16:11

changes and unit tests

27dbbe9

refactorings

362213c

Shivs11 requested review from a team as code owners January 7, 2026 21:20

lints

b6380e5

ShahabT approved these changes Jan 7, 2026

View reviewed changes

Shivs11 merged commit c425598 into main Jan 8, 2026
63 checks passed

Shivs11 deleted the ss/fix_no_poll_metric_race branch January 8, 2026 15:21

carlydf mentioned this pull request Mar 17, 2026

Versioning cherry-pick for OSS v1.30.2 #9543

Merged

5 tasks

carlydf added the release/1.30.2 label Mar 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve no_poller_tasks metric#8957

Improve no_poller_tasks metric#8957
Shivs11 merged 3 commits into
mainfrom
ss/fix_no_poll_metric_race

Shivs11 commented Jan 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Shivs11 commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed?

Why?

How did you test it?

Potential risks

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Shivs11 commented Jan 7, 2026 •

edited

Loading