Fix possibly incorrect is_replaying values when processing empty WFTs#910
Fix possibly incorrect is_replaying values when processing empty WFTs#910Sushisource merged 8 commits intomasterfrom
is_replaying values when processing empty WFTs#910Conversation
| if !saw_command | ||
| && next_next_event.event_type() == EventType::WorkflowTaskScheduled |
There was a problem hiding this comment.
This is the fix.
Before this change, if the history looked something like this (newest event at the bottom):
- WF started
- Full workflow task <- this is previous WFT started
- Full activity sched/start/compl
- Full workflow task
- WFT Scheduled
- WFT Started
I'd end up sending an activation to lang with the activity resolution but also with replaying = false. That's because this function here, which decides the task boundaries, was not properly considering the end of the last WFT through the activity events to the next WFT as the next sequence. It was also including the next (partial) WFT (and since that WFT was the end of history, decided replaying is now false when processing the events).
It did so because WFT "heartbeats" normally don't count as a "real" wft and are skipped over because they shouldn't cause spurious wakeups (ex: when LAs are running).
However, in this case, it really should count as a separate WFT, because the activity resolution happened in that sequence, and should be considered replaying, and then we should move on to the new (partial) WFT and set replaying to false at that point.
| assert_eq!(seq.len(), 6); | ||
| let seq = next_check_peek(&mut update, 6); | ||
| assert_eq!(seq.len(), 13); | ||
| assert_eq!(seq.len(), 4); |
There was a problem hiding this comment.
This is the test that is updated to work with the new boundaries
bb31a3c to
fc0daae
Compare
86b8a84 to
db9ecc1
Compare
cretz
left a comment
There was a problem hiding this comment.
LGTM. I can't think of an obvious way to break existing workflows that may inadvertently rely on the existing task-end-boundary expectation. We may want to consider an environment variable to be able to flip back for a release or two just in case.
* One UT fixed to show, others fail
d7d371d to
deec018
Compare
|
Added emergency env var |
Currently, everything passes except some UTs that specifically test WFT boundaries involving histories with empty WFTs, which this fix changes.
I tried this change with Python, and everything passes. My intuition is that this is probably a compatible change, but we could always flag it to be sure. If anyone can come up with a counterexample please do. It's possible that this change can potentially move jobs from one activation to another (without changing ordering), but I don't think that matters in any way that actually affects anything in terms of how user code is woken up (besides making is_replay be correct in places where it previously wasn't, which would've led to incorrect behavior like in the bug, or NDEs anyway)