-
Notifications
You must be signed in to change notification settings - Fork 178
Implement dynamic sticky queue polling based on the reported backlog #1438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement dynamic sticky queue polling based on the reported backlog #1438
Conversation
b7386bc
to
08205ce
Compare
2d5e875
to
6bcc386
Compare
// polls observing a stickyBacklogSize == 1 for example (which actually can be 0 already at | ||
// that | ||
// moment) and get stuck causing dip in worker load. | ||
if (stickyBacklogSize > pollersCount || stickyPollers.get() <= normalPollers.get()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume it is ok that this conditional is not atomic as a whole? Meaning, it's ok if technically in a racy situation this results in an unexpected sticky or normal poll when finishPoll
is run by another thread at the same time? Otherwise, you'd be better off switching the atomic integers to regular ones and marking these methods synchronized
(calls shouldn't be so frequent and time sensitive here that synchronized
would cause a real problem IMO).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it's ok here. And the main argument is that the value of stickyBacklogSize is anyway non-precise. We don't control/verify which order the responses will be coming in and that the value that we write into stickyBacklogSize is an actual last value. In this condition, it's not worth fighting for precise decisions here.
// polls observing a stickyBacklogSize == 1 for example (which actually can be 0 already at | ||
// that | ||
// moment) and get stuck causing dip in worker load. | ||
if (stickyBacklogSize > pollersCount || stickyPollers.get() <= normalPollers.get()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Go, this is stickyBacklogSize > 0
instead of stickyBacklogSize > pollersCount
. Is it by intention to check against poller count here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment above this statement is actually addressing exactly this.
// If pollersCount >= stickyBacklogSize > 0 we want to go back to a normal ratio to avoid a
// situation that too many pollers (all of them in the worst case) will open only sticky queue
// polls observing a stickyBacklogSize == 1 for example (which actually can be 0 already at
// that moment) and get stuck causing dip in worker load.
So yeah, I decided to change Go condition a little bit to reduce a probability of a worker listening on sticky only just because of a tiny backlog of 1 or 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, makes sense. So this still guarantees at least as many sticky pollers as normal pollers. In my head I'd change pollerCount
to stickyBacklogThreshold
or something to clarify that it's not related to poll threads, but this works too.
6bcc386
to
ad5bd24
Compare
What was changed
This PR brings behavior similar to GoSDK that balances the number of sticky pollers based on the knowledge of the sticky queue backlog received from the server
Why?
After looking at the behavior of workers in the high throughput scenario, it's clear that the static sticky:normal ratio is prone to the following scenario:
Under a high volume of Workflow Tasks workers continue to pull tasks from the normal queue when the sticky queue is already backlogged. This leads to the growth of sticky queues even more which in turn leads to the expiration of sticky workflow tasks and workflow cache churn. This effectively causes a snowballing effect causing more replays and decreasing of Workers throughput. The worker should pull less from a normal task queue if it can't manage its sticky queue.
Closes #998