-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Always yield when registering update handles #1325
Always yield when registering update handles #1325
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK let me first check I understand the bug by trying to explain it in my own words. I think the bug is:
- Consider an update with a validator registered
- An update (that will be rejected by the validator) is received by the worker in the first WFT
- It gets buffered, and the main workflow coroutine hits the
SetUpdateHandler
- The main coroutine yields, because there's a buffered update
- The update handler coroutine starts to execute, and rejects the update
...
Now, a worker is replaying this workflow:
- The main workflow coroutine hits the
SetUpdateHandler
. But there are no buffered updates (the rejected update never went into history), so we do not yield - This is a replay non-determinism bug.
Is that right?
ts.NoError(ts.client.SignalWorkflow(ctx, run.GetID(), run.GetRunID(), "finish", "finished")) | ||
ts.NoError(run.Get(ctx, nil)) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite understanding these tests yet. As I understand it, the bug being fixed affects replay. But these tests don't seem to do anything that would trigger a replay. (I was expecting them to kill and then restart a worker or something, to force it to replay from the start.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All integration tests are run with and without cache. Running this test without cache will test that it replays deterministically. Explicit replay tests are only needed in rare situations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@dandavison yes you are correct. |
if getWorkflowEnvironment(ctx).TryUse(SDKPriorityUpdateHandling) { | ||
getWorkflowEnvironment(ctx).HandleQueuedUpdates(updateName) | ||
state := getState(ctx) | ||
defer state.unblocked() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only bit I don't understand now is why we're adding state.unblocked()
calls after returning from the yield
here and above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this coroutine is no longer blocked after calling yield
. The scheduler detects a coroutine is blocked if it calls yield
twice with no unblocked
between them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! (I have not yet found time to study this situation to the extent where I feel comfortable adding my approval to the PR.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like #1306 not released yet, so not even concerned with compat issues (not that I really would be anyways beyond a simple release note)
@@ -513,6 +513,7 @@ func (d *syncWorkflowDefinition) Execute(env WorkflowEnvironment, header *common | |||
// we are yielding. | |||
state := getState(d.rootCtx) | |||
state.yield("yield before executing to setup state") | |||
state.unblocked() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cretz any backwards compatibility concerns about this unblock? Based on my understand of the scheduler this unblock call should always have been here, but it didn't matter till the other changes in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I admit not understanding much of the behavior here. But I would think since this all happens before user code, there should be no affect right? However, the note in the yield
makes me think that there are other coroutines setting up state that come back and unblock this one. I think to feel confident here, we're going to have to find everything this is yielding to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, the note in the yield makes me think that there are other coroutines setting up state that come back and unblock this on
as far as I am aware this is not true, yields don't need to be "unblocked". My concern is this yield
without a block was putting the scheduler in a bad state on the first WFT that was never exposed before. The question I am struggling to answer is if this bad state actually effected any user code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yield before executing to setup state
^ That's my bigger concern. What's it yielding to? And can what it's yielding to start its own coroutines that need to be added to the list while this one remains blocked? In other words, why does this yield at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's it yielding to is not a concern because I am not removing the yield if anything does. I don't actually think it does yield to anything. Maybe a workflow interceptor could spawn a workflow goroutine? I am more concerned with the coroutines state on the first WFT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's it yielding to is not a concern because I am not removing the yield if anything does.
If it yields to something that itself starts a coroutine, does marking this as unblocked mean this may run before that other coroutine where it didn't before? Now that I think about it, if it's not waiting to be unblocked I guess it doesn't matter. If there are no concerns that whatever it is yielding to will now have its order affected by this call, I'm good w/ this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah after talking it over I think it is safe
a48f457
to
4847176
Compare
4847176
to
02dd15a
Compare
Always yield when registering update handles. When we register an update handle we need to yield to allow any queued update to run. Previously we only yielded if we knew we had queued updates to run, but that logic does not work since on rejected updates are not stored in history we can't know if we should yield or not on replay. Therefore we always have to yield.