New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor matching operations when user data disabled #4493
Conversation
@@ -448,12 +444,6 @@ func (e *matchingEngineImpl) DispatchSpooledTask( | |||
unversionedOrigTaskQueue := newTaskQueueIDWithVersionSet(origTaskQueue, "") | |||
// Redirect and re-resolve if we're blocked in matcher and user data changes. | |||
for { | |||
shouldDrop, err := e.shouldDropTask(unversionedOrigTaskQueue, directive) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why you chose to keep these spooled tasks, aren't you worried that all spooled tasks will be blocked? I don't see much of difference between dropping AddTask requests and dropping spooled tasks but I might be missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only tasks that would be blocked here are the ones from specific-version-set queues. So basically all spooled versioned tasks will be blocked and spooled unversioned tasks won't be, which I think is fine.
I agree there's not much difference and this isn't a dramatic improvement but I think it could lessen the impact of this kill switch, and takes zero code to do, so why not? Just consistency?
(After I saw this I thought for a moment that we could keep all versioned tasks that we get if user data is disabled by adding them to the unversioned queue, but that would block the unversioned queue so we can't do that. (We could do a fixed set id but that has issues too as I mentioned.))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this block new tasks with versioning directive of default
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's late so I'm not trusting myself, but I think this change will cause all new tasks (which get default
directive) to be dropped and spooled tasks to be stuck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I introduced this bug 🤦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem you're talking about isn't here, it's in AddWorkflowTask:
Case 1) AddWorkflowTask happens with a directive of default when user data is loaded. It redirects to a versioned queue. Sync match fails so it's spooled on that versioned queue. Now user data is disabled. Task comes back to DispatchSpooledTask, gets blocked (or maybe dropped). This is arguably okay.
Case 2) AddWorkflowTask happens with a directive of default when user data is disabled. It gets dropped. This is... not so good.
Actually it's not clear what the semantics are. If someone has added versioning data saying new workflows should run on v1, we disable user data, then someone starts a workflow... we should run it on the unversioned queue? That's pretty much breaking semantics. But blocking new workflows is worse than blocking existing ones.
We could redirect everything to a fixed fallback
build id and ask users to run workers for that build id if they want workflows to make progress while we fix our bugs. This is kind of a slight variation of my dlq idea except push it on users more.
The other option is to give up on the "drop" idea and send everything to the unversioned queue. Which is in some sense the same thing, just letting the fixed one be the unversioned one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't failing potentially creating strain on the matching nodes and DB?
Can you clarify what happens after a spooled task has failed? How/when is it retried?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just loops in taskReader.dispatchBufferedTasks and retries with a constant 1s timeout. I don't think it's any real additional load.. the goroutine is already running and it's not interacting with the db at all on this path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the other tasks will be stuck behind, which is okay AFAIU because those tasks will all be versioned as well (not using default) IIUC.
@@ -1216,6 +1204,10 @@ func (e *matchingEngineImpl) getTask( | |||
stickyInfo, | |||
) | |||
if err != nil { | |||
if err == errUserDataDisabled { | |||
// Rewrite to nicer error message | |||
err = serviceerror.NewFailedPrecondition("Operations on versioned workflows are disabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting that the implication here is that this will crash Core based SDKs after a grace period of 1 minute.
I think that's better than silently keeping those workers around idle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. We can change this to a NewerBuild error to keep them idle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, it depends on how transient this state is.
Seems more permanent and less expected than than new build so I tend to think we should crash and be noisy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still want some clarifications but this is already better than what we had before so I'm merging.
What changed?
Why?
How did you test it?
existing + new tests
Potential risks
Is hotfix candidate?