Fix bad balacing between sticky & nonsticky WFT polling with low slot/poller counts#925
Conversation
3fb334e to
ee26a92
Compare
| .wait_for(|v| { | ||
| *v < *sticky_active.borrow() | ||
| || !self.have_done_first_poll.load(Ordering::Relaxed) | ||
| }) |
There was a problem hiding this comment.
Is there an argument for a comment here to help future readers?
There was a problem hiding this comment.
For now, it looks to me like this condition will always eval to true, since have_done_first_poll is set to false on creation, and never changed afterward.
There was a problem hiding this comment.
Independentely of that, however, there's something I'm not totally clear about the existing wait logic here. The wait_for condition will only get reevaluated when the observed subject change value, right? So in this specific branch (non-sticky), the condition will get reevaluated when the value of the non_sticky_active subject change, but not when not when the value of sticky_active changes. Isn't there a risk that sticky_active gets increased but this condition never (or only much later) gets reevaluated, so that we are not unblocking a non-sticky poller when we should?
Same question applies to the other branch.
There was a problem hiding this comment.
Another question - Assuming have_done_first_poll would work as its name suggest, your change would prioritize sticky pollers over non-sticky except on the first polls. So assuming the case where max wft poller is one, it would forcibly be a sticky one from that point on, no? No more polling on the non-sticky tq, ever? Is that really the intended behavior?
Also, what if workflow cache is disabled (max_cached_workflows = 0), so that we don't even have a sticky task queue? I really don't know how this is handled internally, but it seems like everything would still go through WFTPollerShared, and so I assume that sticky_active would remain 0 forever in that case. And so we'd found ourselves in the situation of being incapable of doing anymore poll past the very first one.
There was a problem hiding this comment.
Yeah, you're right that those somehow just weren't used. They were before and then I refactored to simplify things and somehow this weird broken version actually does the right thing (probably accidentally by just not using sticky). Will fix.
There was a problem hiding this comment.
OK, these are in fact used now.
There was a problem hiding this comment.
Added a comment as well
|
Can you describe what this "fix" is? and why is it only relevant if you have a low number of pollers/slots? |
The fix is enforcing the 2-value minimums now, as well as dealing with the problem James mentioned where the balancing was waiting on changes to one channel when it needed to respect changes to both channels. This caused a problem where there was some random chance of getting unbalanced and "stuck" because only one channel was changing. This is more obvious at low poller counts. |
| #[tokio::test] | ||
| async fn workflow_lru_cache_evictions() { |
There was a problem hiding this comment.
This test is removed now since it depended on having only 1 poller to work. It's covered by unit tests anyway.
| @@ -96,31 +100,39 @@ impl WFTPollerShared { | |||
| // If there's a sticky backlog, prioritize it. | |||
There was a problem hiding this comment.
This whole function only applies for the simple max case and not the autoscaling case right?
There was a problem hiding this comment.
No it applies to both - that can be a little confusing because it's not saying there have to be an equal number of each - it's saying there has to be an equal number of opportunities to acquire permits and get scaled.
There was a problem hiding this comment.
oh that's very surprising to me, is this documented? What do you mean by opportunities?
There was a problem hiding this comment.
What I mean is this code runs before permit acquisition and the scaler. So, we balance attempts to acquire permits, and then scale. IE: It's possible in autoscale mode to, for example, have both the sticky and nonsticky proceed past this balance, acquire permits, allow scale, but from then on the scaler might hold up nonsticky indefinitely until the sticky backlog is clear.
However... it is making me realize the whole "balance" part does not need to apply to autoscale at all. Only the backlog clearing part. I will fix that. I was just running some tests with the autoscale on and since the sticky backlog constantly bounces off of 0, it ties them together more than necessary.
There was a problem hiding this comment.
Hm OK, I am sorry still don't quite understand the logic. It looks like wait_if_needed will block normal polls if there is a sticky backlog?
There was a problem hiding this comment.
Yeah, it will. Whether or not we actually want to do that can be debated. Yimin was very strong on the point that we ought to always prioritize sticky tasks if there's a known backlog, and I can understand why.
That said, the actual observed behavior I see is that it is still very possible for the number of normal pollers to be higher than the number of sticky pollers. I think this makes sense because it's finding a setpoint where the backlog is consistently at 0 or bouncing off of it. So, even though it seems like it means we'll end up not making use of the scale information, we still do.
There was a problem hiding this comment.
Reading through the comments I am sorry I still don't understand why this logic exists for the autoscaling case at all.
In the fixed pollers we need to have some algorithm to balance the number of pollers , but in the autoscaling case the SDK is picking the number of pollers based off of feedback from the server that should be all the balancing we need. I understand there is some argument for balancing with autoscaling if we are low on slots, but that logic is not being applied if we are autoscaling.
There was a problem hiding this comment.
Yeah, I feel like I really tied myself in knots on this one. I've changed it to just be way more obvious about what's the real problem here, which is tying up all the slots with one kind.
95100bc to
d9e00e0
Compare
d9e00e0 to
18e9186
Compare

Fixes an issue that could cause task timeouts when using very small (<2) numbers of WFT slots or pollers.