Skip to content

Fix bad balacing between sticky & nonsticky WFT polling with low slot/poller counts#925

Merged
Sushisource merged 11 commits intomasterfrom
fix-low-wft-slot-sticky-balance
May 28, 2025
Merged

Fix bad balacing between sticky & nonsticky WFT polling with low slot/poller counts#925
Sushisource merged 11 commits intomasterfrom
fix-low-wft-slot-sticky-balance

Conversation

@Sushisource
Copy link
Copy Markdown
Member

Fixes an issue that could cause task timeouts when using very small (<2) numbers of WFT slots or pollers.

@Sushisource Sushisource requested a review from a team as a code owner May 27, 2025 00:45
@Sushisource Sushisource force-pushed the fix-low-wft-slot-sticky-balance branch from 3fb334e to ee26a92 Compare May 27, 2025 00:47
Comment thread core/src/worker/workflow/wft_poller.rs Outdated
Comment thread core/src/worker/workflow/wft_poller.rs Outdated
.wait_for(|v| {
*v < *sticky_active.borrow()
|| !self.have_done_first_poll.load(Ordering::Relaxed)
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an argument for a comment here to help future readers?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, it looks to me like this condition will always eval to true, since have_done_first_poll is set to false on creation, and never changed afterward.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independentely of that, however, there's something I'm not totally clear about the existing wait logic here. The wait_for condition will only get reevaluated when the observed subject change value, right? So in this specific branch (non-sticky), the condition will get reevaluated when the value of the non_sticky_active subject change, but not when not when the value of sticky_active changes. Isn't there a risk that sticky_active gets increased but this condition never (or only much later) gets reevaluated, so that we are not unblocking a non-sticky poller when we should?

Same question applies to the other branch.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question - Assuming have_done_first_poll would work as its name suggest, your change would prioritize sticky pollers over non-sticky except on the first polls. So assuming the case where max wft poller is one, it would forcibly be a sticky one from that point on, no? No more polling on the non-sticky tq, ever? Is that really the intended behavior?

Also, what if workflow cache is disabled (max_cached_workflows = 0), so that we don't even have a sticky task queue? I really don't know how this is handled internally, but it seems like everything would still go through WFTPollerShared, and so I assume that sticky_active would remain 0 forever in that case. And so we'd found ourselves in the situation of being incapable of doing anymore poll past the very first one.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right that those somehow just weren't used. They were before and then I refactored to simplify things and somehow this weird broken version actually does the right thing (probably accidentally by just not using sticky). Will fix.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, these are in fact used now.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment as well

Comment thread core/src/worker/workflow/wft_poller.rs Outdated
@Quinn-With-Two-Ns
Copy link
Copy Markdown
Contributor

Can you describe what this "fix" is? and why is it only relevant if you have a low number of pollers/slots?

@Sushisource
Copy link
Copy Markdown
Member Author

Can you describe what this "fix" is? and why is it only relevant if you have a low number of pollers/slots?

The fix is enforcing the 2-value minimums now, as well as dealing with the problem James mentioned where the balancing was waiting on changes to one channel when it needed to respect changes to both channels. This caused a problem where there was some random chance of getting unbalanced and "stuck" because only one channel was changing. This is more obvious at low poller counts.

Comment on lines -96 to -97
#[tokio::test]
async fn workflow_lru_cache_evictions() {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is removed now since it depended on having only 1 poller to work. It's covered by unit tests anyway.

Comment thread core-api/src/worker.rs
Comment thread core-api/src/worker.rs Outdated
Comment thread core-api/src/worker.rs Outdated
Copy link
Copy Markdown
Contributor

@mjameswh mjameswh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -96,31 +100,39 @@ impl WFTPollerShared {
// If there's a sticky backlog, prioritize it.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole function only applies for the simple max case and not the autoscaling case right?

Copy link
Copy Markdown
Member Author

@Sushisource Sushisource May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it applies to both - that can be a little confusing because it's not saying there have to be an equal number of each - it's saying there has to be an equal number of opportunities to acquire permits and get scaled.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh that's very surprising to me, is this documented? What do you mean by opportunities?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is this code runs before permit acquisition and the scaler. So, we balance attempts to acquire permits, and then scale. IE: It's possible in autoscale mode to, for example, have both the sticky and nonsticky proceed past this balance, acquire permits, allow scale, but from then on the scaler might hold up nonsticky indefinitely until the sticky backlog is clear.

However... it is making me realize the whole "balance" part does not need to apply to autoscale at all. Only the backlog clearing part. I will fix that. I was just running some tests with the autoscale on and since the sticky backlog constantly bounces off of 0, it ties them together more than necessary.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm OK, I am sorry still don't quite understand the logic. It looks like wait_if_needed will block normal polls if there is a sticky backlog?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it will. Whether or not we actually want to do that can be debated. Yimin was very strong on the point that we ought to always prioritize sticky tasks if there's a known backlog, and I can understand why.

That said, the actual observed behavior I see is that it is still very possible for the number of normal pollers to be higher than the number of sticky pollers. I think this makes sense because it's finding a setpoint where the backlog is consistently at 0 or bouncing off of it. So, even though it seems like it means we'll end up not making use of the scale information, we still do.

Copy link
Copy Markdown
Member Author

@Sushisource Sushisource May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example. In this image the high red # pollers is actually nonsticky (and sticky is down at the bottom with some low value)

screenshot

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading through the comments I am sorry I still don't understand why this logic exists for the autoscaling case at all.

In the fixed pollers we need to have some algorithm to balance the number of pollers , but in the autoscaling case the SDK is picking the number of pollers based off of feedback from the server that should be all the balancing we need. I understand there is some argument for balancing with autoscaling if we are low on slots, but that logic is not being applied if we are autoscaling.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I feel like I really tied myself in knots on this one. I've changed it to just be way more obvious about what's the real problem here, which is tying up all the slots with one kind.

@Sushisource Sushisource force-pushed the fix-low-wft-slot-sticky-balance branch 2 times, most recently from 95100bc to d9e00e0 Compare May 27, 2025 23:53
@Sushisource Sushisource force-pushed the fix-low-wft-slot-sticky-balance branch from d9e00e0 to 18e9186 Compare May 28, 2025 00:39
@Sushisource Sushisource merged commit 4d82754 into master May 28, 2025
20 checks passed
@Sushisource Sushisource deleted the fix-low-wft-slot-sticky-balance branch May 28, 2025 00:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants