Implement dynamic sticky queue polling based on the reported backlog #1438

Spikhalskiy · 2022-09-15T20:03:59Z

What was changed

This PR brings behavior similar to GoSDK that balances the number of sticky pollers based on the knowledge of the sticky queue backlog received from the server

Why?

After looking at the behavior of workers in the high throughput scenario, it's clear that the static sticky:normal ratio is prone to the following scenario:
Under a high volume of Workflow Tasks workers continue to pull tasks from the normal queue when the sticky queue is already backlogged. This leads to the growth of sticky queues even more which in turn leads to the expiration of sticky workflow tasks and workflow cache churn. This effectively causes a snowballing effect causing more replays and decreasing of Workers throughput. The worker should pull less from a normal task queue if it can't manage its sticky queue.

Closes #998

cretz · 2022-09-16T12:39:02Z

temporal-sdk/src/main/java/io/temporal/internal/worker/StickyQueueBalancer.java

+      // polls observing a stickyBacklogSize == 1 for example (which actually can be 0 already at
+      // that
+      // moment) and get stuck causing dip in worker load.
+      if (stickyBacklogSize > pollersCount || stickyPollers.get() <= normalPollers.get()) {


I assume it is ok that this conditional is not atomic as a whole? Meaning, it's ok if technically in a racy situation this results in an unexpected sticky or normal poll when finishPoll is run by another thread at the same time? Otherwise, you'd be better off switching the atomic integers to regular ones and marking these methods synchronized (calls shouldn't be so frequent and time sensitive here that synchronized would cause a real problem IMO).

Yeah, I think it's ok here. And the main argument is that the value of stickyBacklogSize is anyway non-precise. We don't control/verify which order the responses will be coming in and that the value that we write into stickyBacklogSize is an actual last value. In this condition, it's not worth fighting for precise decisions here.

cretz · 2022-09-16T12:42:40Z

temporal-sdk/src/main/java/io/temporal/internal/worker/StickyQueueBalancer.java

+      // polls observing a stickyBacklogSize == 1 for example (which actually can be 0 already at
+      // that
+      // moment) and get stuck causing dip in worker load.
+      if (stickyBacklogSize > pollersCount || stickyPollers.get() <= normalPollers.get()) {


In Go, this is stickyBacklogSize > 0 instead of stickyBacklogSize > pollersCount. Is it by intention to check against poller count here?

The comment above this statement is actually addressing exactly this.

// If pollersCount >= stickyBacklogSize > 0 we want to go back to a normal ratio to avoid a // situation that too many pollers (all of them in the worst case) will open only sticky queue // polls observing a stickyBacklogSize == 1 for example (which actually can be 0 already at // that moment) and get stuck causing dip in worker load.

So yeah, I decided to change Go condition a little bit to reduce a probability of a worker listening on sticky only just because of a tiny backlog of 1 or 2.

Ah, makes sense. So this still guarantees at least as many sticky pollers as normal pollers. In my head I'd change pollerCount to stickyBacklogThreshold or something to clarify that it's not related to poll threads, but this works too.

Spikhalskiy requested review from Sushisource, bergundy, cretz, mfateev and mmcshane as code owners September 15, 2022 20:03

Implement dynamic sticky queue polling based on the reported backlog

cbe04d1

Spikhalskiy force-pushed the dynamic-sticky-pollers branch 3 times, most recently from b7386bc to 08205ce Compare September 15, 2022 20:21

Sushisource mentioned this pull request Sep 15, 2022

[Feature Request] Use backlog hint to prioritize sticky polling temporalio/sdk-core#393

Closed

Spikhalskiy force-pushed the dynamic-sticky-pollers branch 2 times, most recently from 2d5e875 to 6bcc386 Compare September 16, 2022 06:25

cretz reviewed Sep 16, 2022

View reviewed changes

cretz approved these changes Sep 16, 2022

View reviewed changes

Implement dynamic sticky queue polling based on the reported backlog

ad5bd24

Spikhalskiy force-pushed the dynamic-sticky-pollers branch from 6bcc386 to ad5bd24 Compare September 16, 2022 16:46

Spikhalskiy enabled auto-merge (squash) September 16, 2022 16:47

Spikhalskiy merged commit 67a8022 into temporalio:master Sep 16, 2022

Spikhalskiy deleted the dynamic-sticky-pollers branch September 16, 2022 22:15

Spikhalskiy mentioned this pull request Sep 22, 2022

SDK doesn't allow a single WorkflowTask poller and forces the value to '2' if specified #1451

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement dynamic sticky queue polling based on the reported backlog #1438

Implement dynamic sticky queue polling based on the reported backlog #1438

Uh oh!

Spikhalskiy commented Sep 15, 2022 •

edited

Loading

Uh oh!

cretz Sep 16, 2022

Uh oh!

Spikhalskiy Sep 16, 2022

Uh oh!

cretz Sep 16, 2022

Uh oh!

Spikhalskiy Sep 16, 2022 •

edited

Loading

Uh oh!

cretz Sep 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement dynamic sticky queue polling based on the reported backlog #1438

Implement dynamic sticky queue polling based on the reported backlog #1438

Uh oh!

Conversation

Spikhalskiy commented Sep 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What was changed

Why?

Uh oh!

cretz Sep 16, 2022

Choose a reason for hiding this comment

Uh oh!

Spikhalskiy Sep 16, 2022

Choose a reason for hiding this comment

Uh oh!

cretz Sep 16, 2022

Choose a reason for hiding this comment

Uh oh!

Spikhalskiy Sep 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cretz Sep 16, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Spikhalskiy commented Sep 15, 2022 •

edited

Loading

Spikhalskiy Sep 16, 2022 •

edited

Loading