Fix bad balacing between sticky & nonsticky WFT polling with low slot/poller counts by Sushisource · Pull Request #925 · temporalio/sdk-rust

Sushisource · 2025-05-27T00:45:41Z

Fixes an issue that could cause task timeouts when using very small (<2) numbers of WFT slots or pollers.

dandavison · 2025-05-27T01:37:42Z

+                        .wait_for(|v| {
+                            *v < *sticky_active.borrow()
+                                || !self.have_done_first_poll.load(Ordering::Relaxed)
+                        })


Is there an argument for a comment here to help future readers?

For now, it looks to me like this condition will always eval to true, since have_done_first_poll is set to false on creation, and never changed afterward.

Independentely of that, however, there's something I'm not totally clear about the existing wait logic here. The wait_for condition will only get reevaluated when the observed subject change value, right? So in this specific branch (non-sticky), the condition will get reevaluated when the value of the non_sticky_active subject change, but not when not when the value of sticky_active changes. Isn't there a risk that sticky_active gets increased but this condition never (or only much later) gets reevaluated, so that we are not unblocking a non-sticky poller when we should?

Same question applies to the other branch.

Another question - Assuming have_done_first_poll would work as its name suggest, your change would prioritize sticky pollers over non-sticky except on the first polls. So assuming the case where max wft poller is one, it would forcibly be a sticky one from that point on, no? No more polling on the non-sticky tq, ever? Is that really the intended behavior?

Also, what if workflow cache is disabled (max_cached_workflows = 0), so that we don't even have a sticky task queue? I really don't know how this is handled internally, but it seems like everything would still go through WFTPollerShared, and so I assume that sticky_active would remain 0 forever in that case. And so we'd found ourselves in the situation of being incapable of doing anymore poll past the very first one.

Yeah, you're right that those somehow just weren't used. They were before and then I refactored to simplify things and somehow this weird broken version actually does the right thing (probably accidentally by just not using sticky). Will fix.

OK, these are in fact used now.

Added a comment as well

Quinn-With-Two-Ns · 2025-05-27T15:05:53Z

Can you describe what this "fix" is? and why is it only relevant if you have a low number of pollers/slots?

Sushisource · 2025-05-27T19:21:38Z

Can you describe what this "fix" is? and why is it only relevant if you have a low number of pollers/slots?

The fix is enforcing the 2-value minimums now, as well as dealing with the problem James mentioned where the balancing was waiting on changes to one channel when it needed to respect changes to both channels. This caused a problem where there was some random chance of getting unbalanced and "stuck" because only one channel was changing. This is more obvious at low poller counts.

Sushisource · 2025-05-27T19:22:02Z

-#[tokio::test]
-async fn workflow_lru_cache_evictions() {


This test is removed now since it depended on having only 1 poller to work. It's covered by unit tests anyway.

mjameswh

LGTM

Quinn-With-Two-Ns · 2025-05-27T21:11:31Z

@@ -96,31 +100,39 @@ impl WFTPollerShared {
        // If there's a sticky backlog, prioritize it.


This whole function only applies for the simple max case and not the autoscaling case right?

No it applies to both - that can be a little confusing because it's not saying there have to be an equal number of each - it's saying there has to be an equal number of opportunities to acquire permits and get scaled.

oh that's very surprising to me, is this documented? What do you mean by opportunities?

What I mean is this code runs before permit acquisition and the scaler. So, we balance attempts to acquire permits, and then scale. IE: It's possible in autoscale mode to, for example, have both the sticky and nonsticky proceed past this balance, acquire permits, allow scale, but from then on the scaler might hold up nonsticky indefinitely until the sticky backlog is clear.

However... it is making me realize the whole "balance" part does not need to apply to autoscale at all. Only the backlog clearing part. I will fix that. I was just running some tests with the autoscale on and since the sticky backlog constantly bounces off of 0, it ties them together more than necessary.

Hm OK, I am sorry still don't quite understand the logic. It looks like wait_if_needed will block normal polls if there is a sticky backlog?

Yeah, it will. Whether or not we actually want to do that can be debated. Yimin was very strong on the point that we ought to always prioritize sticky tasks if there's a known backlog, and I can understand why.

That said, the actual observed behavior I see is that it is still very possible for the number of normal pollers to be higher than the number of sticky pollers. I think this makes sense because it's finding a setpoint where the backlog is consistently at 0 or bouncing off of it. So, even though it seems like it means we'll end up not making use of the scale information, we still do.

Here's an example. In this image the high red # pollers is actually nonsticky (and sticky is down at the bottom with some low value)

Reading through the comments I am sorry I still don't understand why this logic exists for the autoscaling case at all.

In the fixed pollers we need to have some algorithm to balance the number of pollers , but in the autoscaling case the SDK is picking the number of pollers based off of feedback from the server that should be all the balancing we need. I understand there is some argument for balancing with autoscaling if we are low on slots, but that logic is not being applied if we are autoscaling.

Yeah, I feel like I really tied myself in knots on this one. I've changed it to just be way more obvious about what's the real problem here, which is tying up all the slots with one kind.

Fixes two-slots-two-pollers

4fa8d2e

Sushisource requested a review from a team as a code owner May 27, 2025 00:45

Test with more values

ee26a92

Sushisource force-pushed the fix-low-wft-slot-sticky-balance branch from 3fb334e to ee26a92 Compare May 27, 2025 00:47

dandavison reviewed May 27, 2025

View reviewed changes

Comment thread core/src/worker/workflow/wft_poller.rs Outdated

dandavison reviewed May 27, 2025

View reviewed changes

mjameswh reviewed May 27, 2025

View reviewed changes

Comment thread core/src/worker/workflow/wft_poller.rs Outdated

Sushisource added 5 commits May 27, 2025 09:21

Fix unused values

37f8923

Don't bother with balancer when not using sticky queues

1fe602d

Enforce minimum WFT/poller of 2 when sticky on

6c8cee7

Remove test that can't really work at the integration level any longer

3582c45

Add clarifying comment

24845a0

Sushisource commented May 27, 2025

View reviewed changes

Remove unneeded bits now that minimum 2-size is enforced

d514f44

mjameswh reviewed May 27, 2025

View reviewed changes

Comment thread core-api/src/worker.rs

mjameswh reviewed May 27, 2025

View reviewed changes

Comment thread core-api/src/worker.rs Outdated

mjameswh reviewed May 27, 2025

View reviewed changes

Comment thread core-api/src/worker.rs Outdated

Review comments

fa8d470

mjameswh approved these changes May 27, 2025

View reviewed changes

Quinn-With-Two-Ns reviewed May 27, 2025

View reviewed changes

Don't attempt balance in autoscale mode

5f34f30

Sushisource force-pushed the fix-low-wft-slot-sticky-balance branch 2 times, most recently from 95100bc to d9e00e0 Compare May 27, 2025 23:53

Just do the obvious thing with max slots

18e9186

Sushisource force-pushed the fix-low-wft-slot-sticky-balance branch from d9e00e0 to 18e9186 Compare May 28, 2025 00:39

Sushisource merged commit 4d82754 into master May 28, 2025
20 checks passed

Sushisource deleted the fix-low-wft-slot-sticky-balance branch May 28, 2025 00:54

		@@ -96,31 +100,39 @@ impl WFTPollerShared {
		// If there's a sticky backlog, prioritize it.

		#[tokio::test]
		async fn workflow_lru_cache_evictions() {

Conversation

Sushisource commented May 27, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Quinn-With-Two-Ns commented May 27, 2025

Uh oh!

Sushisource commented May 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mjameswh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sushisource May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sushisource May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Sushisource May 27, 2025 •

edited

Loading

Sushisource May 27, 2025 •

edited

Loading