Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor matching operations when user data disabled #4493

Merged
merged 2 commits into from Jun 16, 2023

Conversation

dnr
Copy link
Member

@dnr dnr commented Jun 14, 2023

What changed?

  • Use error from GetUserData for "drop task" decisions
  • New workflows ("use default") can start on the unversioned queue when user data is disabled
  • Buffered versioned tasks are not dropped, they're retried indefinitely
  • Dispatch buffered tasks loop exits sooner on context canceled
  • GetWorkerBuildIdCompatibility, etc. return error when user data disabled

Why?

  • Consolidate logic, easier to understand
  • Allow workflow to start when user data disabled
  • Preserve tasks when possible
  • Return error instead of misleading data

How did you test it?
existing + new tests

Potential risks

Is hotfix candidate?

@dnr dnr requested a review from bergundy June 14, 2023 16:06
@dnr dnr requested a review from a team as a code owner June 14, 2023 16:06
common/dynamicconfig/static_properties.go Show resolved Hide resolved
common/util.go Show resolved Hide resolved
@@ -448,12 +444,6 @@ func (e *matchingEngineImpl) DispatchSpooledTask(
unversionedOrigTaskQueue := newTaskQueueIDWithVersionSet(origTaskQueue, "")
// Redirect and re-resolve if we're blocked in matcher and user data changes.
for {
shouldDrop, err := e.shouldDropTask(unversionedOrigTaskQueue, directive)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why you chose to keep these spooled tasks, aren't you worried that all spooled tasks will be blocked? I don't see much of difference between dropping AddTask requests and dropping spooled tasks but I might be missing something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only tasks that would be blocked here are the ones from specific-version-set queues. So basically all spooled versioned tasks will be blocked and spooled unversioned tasks won't be, which I think is fine.

I agree there's not much difference and this isn't a dramatic improvement but I think it could lessen the impact of this kill switch, and takes zero code to do, so why not? Just consistency?

(After I saw this I thought for a moment that we could keep all versioned tasks that we get if user data is disabled by adding them to the unversioned queue, but that would block the unversioned queue so we can't do that. (We could do a fixed set id but that has issues too as I mentioned.))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this block new tasks with versioning directive of default?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's late so I'm not trusting myself, but I think this change will cause all new tasks (which get default directive) to be dropped and spooled tasks to be stuck.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I introduced this bug 🤦

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem you're talking about isn't here, it's in AddWorkflowTask:

Case 1) AddWorkflowTask happens with a directive of default when user data is loaded. It redirects to a versioned queue. Sync match fails so it's spooled on that versioned queue. Now user data is disabled. Task comes back to DispatchSpooledTask, gets blocked (or maybe dropped). This is arguably okay.

Case 2) AddWorkflowTask happens with a directive of default when user data is disabled. It gets dropped. This is... not so good.

Actually it's not clear what the semantics are. If someone has added versioning data saying new workflows should run on v1, we disable user data, then someone starts a workflow... we should run it on the unversioned queue? That's pretty much breaking semantics. But blocking new workflows is worse than blocking existing ones.

We could redirect everything to a fixed fallback build id and ask users to run workers for that build id if they want workflows to make progress while we fix our bugs. This is kind of a slight variation of my dlq idea except push it on users more.

The other option is to give up on the "drop" idea and send everything to the unversioned queue. Which is in some sense the same thing, just letting the fixed one be the unversioned one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't failing potentially creating strain on the matching nodes and DB?
Can you clarify what happens after a spooled task has failed? How/when is it retried?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just loops in taskReader.dispatchBufferedTasks and retries with a constant 1s timeout. I don't think it's any real additional load.. the goroutine is already running and it's not interacting with the db at all on this path

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the other tasks will be stuck behind, which is okay AFAIU because those tasks will all be versioned as well (not using default) IIUC.

@@ -1216,6 +1204,10 @@ func (e *matchingEngineImpl) getTask(
stickyInfo,
)
if err != nil {
if err == errUserDataDisabled {
// Rewrite to nicer error message
err = serviceerror.NewFailedPrecondition("Operations on versioned workflows are disabled")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting that the implication here is that this will crash Core based SDKs after a grace period of 1 minute.
I think that's better than silently keeping those workers around idle.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. We can change this to a NewerBuild error to keep them idle?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, it depends on how transient this state is.
Seems more permanent and less expected than than new build so I tend to think we should crash and be noisy.

Copy link
Member

@bergundy bergundy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still want some clarifications but this is already better than what we had before so I'm merging.

@bergundy bergundy merged commit ae2d232 into temporalio:master Jun 16, 2023
9 of 10 checks passed
@dnr dnr deleted the ver27 branch June 16, 2023 19:37
@dnr dnr mentioned this pull request Jun 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants