Skip to content

Improve no_poller_tasks metric#8957

Merged
Shivs11 merged 3 commits into
mainfrom
ss/fix_no_poll_metric_race
Jan 8, 2026
Merged

Improve no_poller_tasks metric#8957
Shivs11 merged 3 commits into
mainfrom
ss/fix_no_poll_metric_race

Conversation

@Shivs11
Copy link
Copy Markdown
Member

@Shivs11 Shivs11 commented Jan 7, 2026

What changed?

  • This metric, although debatable if it should exist or not, is incremented when there are no pollers on the root partition for 2 minutes (configurable by DC) and with tasks being added to this partition.
  • A customer pointed out a race condition recently where we noticed that on pod restarts, which resulted in reloading the root partition, led to this metric being wrongly emitted. This was because it did not take into consideration pod restarts.

Why?

  • Correctness

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

Potential risks

  • None, I think

Note

Prevents false positives in the no-recent-poller metric by tracking partition load time and gating emission.

  • Add loadTime to taskQueuePartitionManagerImpl, set in Start()
  • Emit metrics.NoRecentPollerTasksPerTaskQueueCounter only when the root partition has been loaded > noPollerThreshold and there have been no pollers in the last threshold
  • Add focused unit tests using metricstest.Capture covering newly loaded vs. older partitions and with/without recent pollers
  • Minor wiring: import metricstest, helper to setup a manager with capturing metrics

Written by Cursor Bugbot for commit b6380e5. This will update automatically on new commits. Configure here.

@Shivs11 Shivs11 requested review from a team as code owners January 7, 2026 21:20
@Shivs11 Shivs11 merged commit c425598 into main Jan 8, 2026
63 checks passed
@Shivs11 Shivs11 deleted the ss/fix_no_poll_metric_race branch January 8, 2026 15:21
stephanos pushed a commit to stephanos/temporal that referenced this pull request Jan 12, 2026
## What changed?
- This metric, although debatable if it should exist or not, is
incremented when there are no pollers on the root partition for 2
minutes (configurable by DC) and with tasks being added to this
partition.
- A customer pointed out a race condition recently where we noticed that
on pod restarts, which resulted in reloading the root partition, led to
this metric being wrongly emitted. This was because it did not take into
consideration pod restarts.

## Why?
- Correctness

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [x] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
- None, I think


<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Prevents false positives in the no-recent-poller metric by tracking
partition load time and gating emission.
> 
> - Add `loadTime` to `taskQueuePartitionManagerImpl`, set in `Start()`
> - Emit `metrics.NoRecentPollerTasksPerTaskQueueCounter` only when the
root partition has been loaded > `noPollerThreshold` and there have been
no pollers in the last threshold
> - Add focused unit tests using `metricstest.Capture` covering newly
loaded vs. older partitions and with/without recent pollers
> - Minor wiring: import `metricstest`, helper to setup a manager with
capturing metrics
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
b6380e5. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
carlydf pushed a commit that referenced this pull request Mar 17, 2026
## What changed?
- This metric, although debatable if it should exist or not, is
incremented when there are no pollers on the root partition for 2
minutes (configurable by DC) and with tasks being added to this
partition.
- A customer pointed out a race condition recently where we noticed that
on pod restarts, which resulted in reloading the root partition, led to
this metric being wrongly emitted. This was because it did not take into
consideration pod restarts.

## Why?
- Correctness

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [x] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
- None, I think


<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Prevents false positives in the no-recent-poller metric by tracking
partition load time and gating emission.
> 
> - Add `loadTime` to `taskQueuePartitionManagerImpl`, set in `Start()`
> - Emit `metrics.NoRecentPollerTasksPerTaskQueueCounter` only when the
root partition has been loaded > `noPollerThreshold` and there have been
no pollers in the last threshold
> - Add focused unit tests using `metricstest.Capture` covering newly
loaded vs. older partitions and with/without recent pollers
> - Minor wiring: import `metricstest`, helper to setup a manager with
capturing metrics
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
b6380e5. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
chaptersix pushed a commit that referenced this pull request Mar 17, 2026
## What changed?
Cherry-pick versioning PRs
- #9168
  - Cache for system protection
- #9262 
  - Cache for system protection
- #9239 
- Critical PR to enable sending `TargetVersionChanged` flag for
Upgrade-on-CaN feature
- #9147 
- Tracks version drainage properly when version receives workflows via
`VersioningOverride`. Needed for automated worker controllers to
correctly scale versioned workers that received workflows via
`VersioningOverride`.
- #9300 
- Needed for `approximate_backlog_count` metric to track Current and
Ramping version tasks correctly
- #9316 
- Needed for `approximate_backlog_count` metric to track Current and
Ramping version tasks correctly
- #8957 
- Contains minor metric improvement. Included because it adds a test
harness that is used in the two metrics PRs above
- #9250 
  - Bug fix of task rescheduling edge case during AutoUpgrade Transition

## Why?
For OSS v1.30.2

## How did you test it?
- [x] built
- [ ] run locally and tested manually
- [x] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)





<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches history/matching worker-versioning paths, adding new caches
and changing workflow task/start handling and backlog metric emission;
incorrect caching or signaling could affect dispatch/upgrade behavior
and observability.
> 
> **Overview**
> Adds new worker-versioning protections and upgrade signaling: workflow
task started events now persist a
`workflow_task_target_worker_deployment_version_changed` flag (and emit
a new `workflow_target_version_changed_count` metric) under a new
`EnableSendTargetVersionChanged` dynamic config.
> 
> Introduces two new caches with metrics and dynamic config knobs: a
`RoutingInfoCache` to avoid repeated `GetTaskQueueUserData` lookups
during activity start/transition logic, and a `ReactivationSignalCache`
plus `EnableVersionReactivationSignals` to dedupe and asynchronously
send “reactivation” signals when workflows are pinned (via
start/signal-with-start/reset/update-options) to potentially
drained/inactive worker versions.
> 
> Extends matching backlog metrics to support version-attributed
reporting by adding `BacklogMetricsEmitInterval` and switching queue DB
emission to *physical* backlog gauges (`physical_approximate_backlog_*`)
when attribution is enabled, while keeping legacy gauges when disabled.
> 
> Adds frontend scaffolding for a new visibility RPC `CountSchedules`
(client plumbing, interception/metadata/quota wiring) but leaves the
frontend handler unimplemented, and bumps `go.temporal.io/api` to
`v1.62.2`.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
cb8ae14. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Shahab Tajik <shahab@temporal.io>
Co-authored-by: Shivam <57200924+Shivs11@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants