Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always yield when registering update handles #1325

Merged

Conversation

Quinn-With-Two-Ns
Copy link
Contributor

Always yield when registering update handles. When we register an update handle we need to yield to allow any queued update to run. Previously we only yielded if we knew we had queued updates to run, but that logic does not work since on rejected updates are not stored in history we can't know if we should yield or not on replay. Therefore we always have to yield.

@Quinn-With-Two-Ns Quinn-With-Two-Ns marked this pull request as ready for review December 19, 2023 18:13
@Quinn-With-Two-Ns Quinn-With-Two-Ns requested a review from a team as a code owner December 19, 2023 18:13
Copy link
Contributor

@dandavison dandavison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK let me first check I understand the bug by trying to explain it in my own words. I think the bug is:

  1. Consider an update with a validator registered
  2. An update (that will be rejected by the validator) is received by the worker in the first WFT
  3. It gets buffered, and the main workflow coroutine hits the SetUpdateHandler
  4. The main coroutine yields, because there's a buffered update
  5. The update handler coroutine starts to execute, and rejects the update

...

Now, a worker is replaying this workflow:

  1. The main workflow coroutine hits the SetUpdateHandler. But there are no buffered updates (the rejected update never went into history), so we do not yield
  2. This is a replay non-determinism bug.

Is that right?

ts.NoError(ts.client.SignalWorkflow(ctx, run.GetID(), run.GetRunID(), "finish", "finished"))
ts.NoError(run.Get(ctx, nil))
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite understanding these tests yet. As I understand it, the bug being fixed affects replay. But these tests don't seem to do anything that would trigger a replay. (I was expecting them to kill and then restart a worker or something, to force it to replay from the start.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All integration tests are run with and without cache. Running this test without cache will test that it replays deterministically. Explicit replay tests are only needed in rare situations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@Quinn-With-Two-Ns
Copy link
Contributor Author

@dandavison yes you are correct.

if getWorkflowEnvironment(ctx).TryUse(SDKPriorityUpdateHandling) {
getWorkflowEnvironment(ctx).HandleQueuedUpdates(updateName)
state := getState(ctx)
defer state.unblocked()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only bit I don't understand now is why we're adding state.unblocked() calls after returning from the yield here and above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this coroutine is no longer blocked after calling yield. The scheduler detects a coroutine is blocked if it calls yield twice with no unblocked between them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! (I have not yet found time to study this situation to the extent where I feel comfortable adding my approval to the PR.)

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like #1306 not released yet, so not even concerned with compat issues (not that I really would be anyways beyond a simple release note)

@@ -513,6 +513,7 @@ func (d *syncWorkflowDefinition) Execute(env WorkflowEnvironment, header *common
// we are yielding.
state := getState(d.rootCtx)
state.yield("yield before executing to setup state")
state.unblocked()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cretz any backwards compatibility concerns about this unblock? Based on my understand of the scheduler this unblock call should always have been here, but it didn't matter till the other changes in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit not understanding much of the behavior here. But I would think since this all happens before user code, there should be no affect right? However, the note in the yield makes me think that there are other coroutines setting up state that come back and unblock this one. I think to feel confident here, we're going to have to find everything this is yielding to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the note in the yield makes me think that there are other coroutines setting up state that come back and unblock this on

as far as I am aware this is not true, yields don't need to be "unblocked". My concern is this yield without a block was putting the scheduler in a bad state on the first WFT that was never exposed before. The question I am struggling to answer is if this bad state actually effected any user code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yield before executing to setup state

^ That's my bigger concern. What's it yielding to? And can what it's yielding to start its own coroutines that need to be added to the list while this one remains blocked? In other words, why does this yield at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's it yielding to is not a concern because I am not removing the yield if anything does. I don't actually think it does yield to anything. Maybe a workflow interceptor could spawn a workflow goroutine? I am more concerned with the coroutines state on the first WFT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's it yielding to is not a concern because I am not removing the yield if anything does.

If it yields to something that itself starts a coroutine, does marking this as unblocked mean this may run before that other coroutine where it didn't before? Now that I think about it, if it's not waiting to be unblocked I guess it doesn't matter. If there are no concerns that whatever it is yielding to will now have its order affected by this call, I'm good w/ this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah after talking it over I think it is safe

@Quinn-With-Two-Ns Quinn-With-Two-Ns force-pushed the fix-rejected-update branch 3 times, most recently from a48f457 to 4847176 Compare January 17, 2024 20:57
@Quinn-With-Two-Ns Quinn-With-Two-Ns merged commit e3a3d95 into temporalio:master Jan 18, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants