Refactor matching operations when user data disabled #4493

dnr · 2023-06-14T16:06:32Z

What changed?

Use error from GetUserData for "drop task" decisions
New workflows ("use default") can start on the unversioned queue when user data is disabled
Buffered versioned tasks are not dropped, they're retried indefinitely
Dispatch buffered tasks loop exits sooner on context canceled
GetWorkerBuildIdCompatibility, etc. return error when user data disabled

Why?

Consolidate logic, easier to understand
Allow workflow to start when user data disabled
Preserve tasks when possible
Return error instead of misleading data

How did you test it?
existing + new tests

Potential risks

Is hotfix candidate?

common/dynamicconfig/static_properties.go

common/util.go

bergundy · 2023-06-15T00:30:23Z

service/matching/matchingEngine.go

@@ -448,12 +444,6 @@ func (e *matchingEngineImpl) DispatchSpooledTask(
 	unversionedOrigTaskQueue := newTaskQueueIDWithVersionSet(origTaskQueue, "")
 	// Redirect and re-resolve if we're blocked in matcher and user data changes.
 	for {
-		shouldDrop, err := e.shouldDropTask(unversionedOrigTaskQueue, directive)


Curious why you chose to keep these spooled tasks, aren't you worried that all spooled tasks will be blocked? I don't see much of difference between dropping AddTask requests and dropping spooled tasks but I might be missing something.

The only tasks that would be blocked here are the ones from specific-version-set queues. So basically all spooled versioned tasks will be blocked and spooled unversioned tasks won't be, which I think is fine.

I agree there's not much difference and this isn't a dramatic improvement but I think it could lessen the impact of this kill switch, and takes zero code to do, so why not? Just consistency?

(After I saw this I thought for a moment that we could keep all versioned tasks that we get if user data is disabled by adding them to the unversioned queue, but that would block the unversioned queue so we can't do that. (We could do a fixed set id but that has issues too as I mentioned.))

Doesn't this block new tasks with versioning directive of default?

It's late so I'm not trusting myself, but I think this change will cause all new tasks (which get default directive) to be dropped and spooled tasks to be stuck.

Looks like I introduced this bug 🤦

I think the problem you're talking about isn't here, it's in AddWorkflowTask:

Case 1) AddWorkflowTask happens with a directive of default when user data is loaded. It redirects to a versioned queue. Sync match fails so it's spooled on that versioned queue. Now user data is disabled. Task comes back to DispatchSpooledTask, gets blocked (or maybe dropped). This is arguably okay.

Case 2) AddWorkflowTask happens with a directive of default when user data is disabled. It gets dropped. This is... not so good.

Actually it's not clear what the semantics are. If someone has added versioning data saying new workflows should run on v1, we disable user data, then someone starts a workflow... we should run it on the unversioned queue? That's pretty much breaking semantics. But blocking new workflows is worse than blocking existing ones.

We could redirect everything to a fixed fallback build id and ask users to run workers for that build id if they want workflows to make progress while we fix our bugs. This is kind of a slight variation of my dlq idea except push it on users more.

The other option is to give up on the "drop" idea and send everything to the unversioned queue. Which is in some sense the same thing, just letting the fixed one be the unversioned one

Isn't failing potentially creating strain on the matching nodes and DB?
Can you clarify what happens after a spooled task has failed? How/when is it retried?

It just loops in taskReader.dispatchBufferedTasks and retries with a constant 1s timeout. I don't think it's any real additional load.. the goroutine is already running and it's not interacting with the db at all on this path

So the other tasks will be stuck behind, which is okay AFAIU because those tasks will all be versioned as well (not using default) IIUC.

…queue

bergundy · 2023-06-15T22:50:41Z

service/matching/matchingEngine.go

@@ -1216,6 +1204,10 @@ func (e *matchingEngineImpl) getTask(
 		stickyInfo,
 	)
 	if err != nil {
+		if err == errUserDataDisabled {
+			// Rewrite to nicer error message
+			err = serviceerror.NewFailedPrecondition("Operations on versioned workflows are disabled")


Just noting that the implication here is that this will crash Core based SDKs after a grace period of 1 minute.
I think that's better than silently keeping those workers around idle.

Right. We can change this to a NewerBuild error to keep them idle?

Not sure, it depends on how transient this state is.
Seems more permanent and less expected than than new build so I tend to think we should crash and be noisy.

bergundy

I still want some clarifications but this is already better than what we had before so I'm merging.

Refactor matching operations when user data disabled

15fc9c5

dnr requested a review from bergundy June 14, 2023 16:06

dnr requested a review from a team as a code owner June 14, 2023 16:06

dnr mentioned this pull request Jun 14, 2023

Wrap setting futures in taskQueueManager #4494

Merged

bergundy reviewed Jun 15, 2023

View reviewed changes

when user data disabled, allow new workflows to start on unversioned …

46bbd52

…queue

bergundy reviewed Jun 15, 2023

View reviewed changes

bergundy approved these changes Jun 16, 2023

View reviewed changes

bergundy merged commit ae2d232 into temporalio:master Jun 16, 2023
9 of 10 checks passed

dnr deleted the ver27 branch June 16, 2023 19:37

dnr mentioned this pull request Jun 28, 2023

Versioning cleanup #4556

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor matching operations when user data disabled #4493

Refactor matching operations when user data disabled #4493

dnr commented Jun 14, 2023 •

edited

bergundy Jun 15, 2023

dnr Jun 15, 2023

bergundy Jun 15, 2023

bergundy Jun 15, 2023

bergundy Jun 15, 2023

dnr Jun 15, 2023

bergundy Jun 15, 2023

dnr Jun 15, 2023

bergundy Jun 16, 2023

bergundy Jun 15, 2023

dnr Jun 15, 2023

bergundy Jun 16, 2023

bergundy left a comment

Refactor matching operations when user data disabled #4493

Refactor matching operations when user data disabled #4493

Conversation

dnr commented Jun 14, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bergundy left a comment

Choose a reason for hiding this comment

dnr commented Jun 14, 2023 •

edited