Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete running workflow executions #2819

Merged

Conversation

alexshtin
Copy link
Member

@alexshtin alexshtin commented May 8, 2022

What changed?
Add ability to delete running workflow executions. When running workflow execution is deleted then workflow is terminated with special flag deleteOnTerminate set to true. This allows to terminate and delete workflow execution within the same transfer task and avoid race condition between terminate/close and delete transfer tasks. Possibility of race condition is still exist in case if running workflow it getting terminated and immediately deleted. In this case, if delete transfer task is processed first, then close transfer task won't be able to access workflow execution mutable state and won't be able to run actions required to run by closed workflow (archive, report to parent, process parent close policy).

Why?
To prevent race condition while deleting running workflows and allow to delete running workflows w/o extra API call.

How did you test it?
Added new integration test and modified existing unit tests.

Potential risks
Race condition between terminate and delete calls is still possible.

Is hotfix candidate?
No.

@alexshtin alexshtin requested a review from a team as a code owner May 8, 2022 02:45
@yiminc yiminc requested a review from yycptt May 9, 2022 01:59
Copy link
Member

@yycptt yycptt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach occurred to me when reviewing the pr: when updating workflow, record the latest taskID in mutable state. On workflow delete, check if transfer ack level in shard Info > closed workflow last taskID. If so, perform the deletion, otherwise return error, backoff, and retry later.

service/history/transferQueueStandbyTaskExecutor.go Outdated Show resolved Hide resolved
service/history/timerQueueTaskExecutorBase.go Outdated Show resolved Hide resolved
@alexshtin alexshtin force-pushed the feature/delete-running-execution branch from 746dcaa to 7760640 Compare May 10, 2022 22:03
common/rpc/interceptor/telemetry.go Outdated Show resolved Hide resolved
service/history/historyEngine.go Outdated Show resolved Hide resolved
Comment on lines 102 to 108
// Create delete workflow execution task only if workflow is closed successfully and all pending tasks are completed.
// Otherwise, mutable state might be deleted before close tasks are executed.
// Unfortunately, queue ack levels are updated with delay (default 60s),
// therefore this API will return error if workflow is deleted within 60 seconds after close.
if (ms.GetExecutionInfo().CloseTransferTaskId != 0 && // backward compatibility (remove in 1.18).
ms.GetExecutionInfo().CloseTransferTaskId > transferQueueAckLevel) || // Transfer close task wasn't executed.
(ms.GetExecutionInfo().CloseVisibilityTaskId != 0 && // backward compatibility (remove in 1.18).
ms.GetExecutionInfo().CloseVisibilityTaskId > visibilityQueueAckLevel) {
return consts.ErrWorkflowNotReady
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail the DeleteWorkflow API call. I think it is better to let this task in, and verify / wait this in the task processing. If the ack level is not reached, fail the task processing and let the task processing automatically retry it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both approaches look reasonable to me.
With the delete flag, only tasks for recently closed wf will have to wait, so task processing shouldn't be blocked.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also visibility delete task doesn't have access to mutable state and moving this check there will add penalty (db read, which in case of Cassandra will go to the deepest internal SS) to every visibility delete operation.

service/history/historyEngine.go Outdated Show resolved Hide resolved
Comment on lines 135 to 136
int64 close_transfer_task_id = 64;
int64 close_visibility_task_id = 65;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we want to make this more general, to something like lastTaskKey[category], which might be useful for detecting stuck workflow?

But I don't have strong opinion on this since I don't have a clear use case in mind right now and it might be a premature optimization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this option too but having map for just 2 ints w/o proper understanding of future use seems to be too much.

Comment on lines 102 to 108
// Create delete workflow execution task only if workflow is closed successfully and all pending tasks are completed.
// Otherwise, mutable state might be deleted before close tasks are executed.
// Unfortunately, queue ack levels are updated with delay (default 60s),
// therefore this API will return error if workflow is deleted within 60 seconds after close.
if (ms.GetExecutionInfo().CloseTransferTaskId != 0 && // backward compatibility (remove in 1.18).
ms.GetExecutionInfo().CloseTransferTaskId > transferQueueAckLevel) || // Transfer close task wasn't executed.
(ms.GetExecutionInfo().CloseVisibilityTaskId != 0 && // backward compatibility (remove in 1.18).
ms.GetExecutionInfo().CloseVisibilityTaskId > visibilityQueueAckLevel) {
return consts.ErrWorkflowNotReady
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both approaches look reasonable to me.
With the delete flag, only tasks for recently closed wf will have to wait, so task processing shouldn't be blocked.

@alexshtin alexshtin force-pushed the feature/delete-running-execution branch from d23228c to 3b6baea Compare May 11, 2022 22:29
@alexshtin alexshtin merged commit cf4153c into temporalio:master May 12, 2022
@alexshtin alexshtin deleted the feature/delete-running-execution branch May 12, 2022 04:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants