Skip to content

Conversation

Spikhalskiy
Copy link
Contributor

@Spikhalskiy Spikhalskiy commented Sep 15, 2022

What was changed

This PR brings behavior similar to GoSDK that balances the number of sticky pollers based on the knowledge of the sticky queue backlog received from the server

Why?

After looking at the behavior of workers in the high throughput scenario, it's clear that the static sticky:normal ratio is prone to the following scenario:
Under a high volume of Workflow Tasks workers continue to pull tasks from the normal queue when the sticky queue is already backlogged. This leads to the growth of sticky queues even more which in turn leads to the expiration of sticky workflow tasks and workflow cache churn. This effectively causes a snowballing effect causing more replays and decreasing of Workers throughput. The worker should pull less from a normal task queue if it can't manage its sticky queue.

Closes #998

@Spikhalskiy Spikhalskiy force-pushed the dynamic-sticky-pollers branch 3 times, most recently from b7386bc to 08205ce Compare September 15, 2022 20:21
@Spikhalskiy Spikhalskiy force-pushed the dynamic-sticky-pollers branch 2 times, most recently from 2d5e875 to 6bcc386 Compare September 16, 2022 06:25
// polls observing a stickyBacklogSize == 1 for example (which actually can be 0 already at
// that
// moment) and get stuck causing dip in worker load.
if (stickyBacklogSize > pollersCount || stickyPollers.get() <= normalPollers.get()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it is ok that this conditional is not atomic as a whole? Meaning, it's ok if technically in a racy situation this results in an unexpected sticky or normal poll when finishPoll is run by another thread at the same time? Otherwise, you'd be better off switching the atomic integers to regular ones and marking these methods synchronized (calls shouldn't be so frequent and time sensitive here that synchronized would cause a real problem IMO).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it's ok here. And the main argument is that the value of stickyBacklogSize is anyway non-precise. We don't control/verify which order the responses will be coming in and that the value that we write into stickyBacklogSize is an actual last value. In this condition, it's not worth fighting for precise decisions here.

// polls observing a stickyBacklogSize == 1 for example (which actually can be 0 already at
// that
// moment) and get stuck causing dip in worker load.
if (stickyBacklogSize > pollersCount || stickyPollers.get() <= normalPollers.get()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Go, this is stickyBacklogSize > 0 instead of stickyBacklogSize > pollersCount. Is it by intention to check against poller count here?

Copy link
Contributor Author

@Spikhalskiy Spikhalskiy Sep 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment above this statement is actually addressing exactly this.

  // If pollersCount >= stickyBacklogSize > 0 we want to go back to a normal ratio to avoid a
  // situation that too many pollers (all of them in the worst case) will open only sticky queue
  // polls observing a stickyBacklogSize == 1 for example (which actually can be 0 already at
  // that moment) and get stuck causing dip in worker load.

So yeah, I decided to change Go condition a little bit to reduce a probability of a worker listening on sticky only just because of a tiny backlog of 1 or 2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, makes sense. So this still guarantees at least as many sticky pollers as normal pollers. In my head I'd change pollerCount to stickyBacklogThreshold or something to clarify that it's not related to poll threads, but this works too.

@Spikhalskiy Spikhalskiy force-pushed the dynamic-sticky-pollers branch from 6bcc386 to ad5bd24 Compare September 16, 2022 16:46
@Spikhalskiy Spikhalskiy enabled auto-merge (squash) September 16, 2022 16:47
@Spikhalskiy Spikhalskiy merged commit 67a8022 into temporalio:master Sep 16, 2022
@Spikhalskiy Spikhalskiy deleted the dynamic-sticky-pollers branch September 16, 2022 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bring Pollers & Workers on par with Core and Go implementations

2 participants