-
Notifications
You must be signed in to change notification settings - Fork 757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delete running workflow executions #2819
Delete running workflow executions #2819
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another approach occurred to me when reviewing the pr: when updating workflow, record the latest taskID in mutable state. On workflow delete, check if transfer ack level in shard Info > closed workflow last taskID. If so, perform the deletion, otherwise return error, backoff, and retry later.
746dcaa
to
7760640
Compare
// Create delete workflow execution task only if workflow is closed successfully and all pending tasks are completed. | ||
// Otherwise, mutable state might be deleted before close tasks are executed. | ||
// Unfortunately, queue ack levels are updated with delay (default 60s), | ||
// therefore this API will return error if workflow is deleted within 60 seconds after close. | ||
if (ms.GetExecutionInfo().CloseTransferTaskId != 0 && // backward compatibility (remove in 1.18). | ||
ms.GetExecutionInfo().CloseTransferTaskId > transferQueueAckLevel) || // Transfer close task wasn't executed. | ||
(ms.GetExecutionInfo().CloseVisibilityTaskId != 0 && // backward compatibility (remove in 1.18). | ||
ms.GetExecutionInfo().CloseVisibilityTaskId > visibilityQueueAckLevel) { | ||
return consts.ErrWorkflowNotReady | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail the DeleteWorkflow API call. I think it is better to let this task in, and verify / wait this in the task processing. If the ack level is not reached, fail the task processing and let the task processing automatically retry it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both approaches look reasonable to me.
With the delete flag, only tasks for recently closed wf will have to wait, so task processing shouldn't be blocked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also visibility delete task doesn't have access to mutable state and moving this check there will add penalty (db read, which in case of Cassandra will go to the deepest internal SS) to every visibility delete operation.
int64 close_transfer_task_id = 64; | ||
int64 close_visibility_task_id = 65; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we want to make this more general, to something like lastTaskKey[category], which might be useful for detecting stuck workflow?
But I don't have strong opinion on this since I don't have a clear use case in mind right now and it might be a premature optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered this option too but having map for just 2 ints w/o proper understanding of future use seems to be too much.
// Create delete workflow execution task only if workflow is closed successfully and all pending tasks are completed. | ||
// Otherwise, mutable state might be deleted before close tasks are executed. | ||
// Unfortunately, queue ack levels are updated with delay (default 60s), | ||
// therefore this API will return error if workflow is deleted within 60 seconds after close. | ||
if (ms.GetExecutionInfo().CloseTransferTaskId != 0 && // backward compatibility (remove in 1.18). | ||
ms.GetExecutionInfo().CloseTransferTaskId > transferQueueAckLevel) || // Transfer close task wasn't executed. | ||
(ms.GetExecutionInfo().CloseVisibilityTaskId != 0 && // backward compatibility (remove in 1.18). | ||
ms.GetExecutionInfo().CloseVisibilityTaskId > visibilityQueueAckLevel) { | ||
return consts.ErrWorkflowNotReady | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both approaches look reasonable to me.
With the delete flag, only tasks for recently closed wf will have to wait, so task processing shouldn't be blocked.
d23228c
to
3b6baea
Compare
What changed?
Add ability to delete running workflow executions. When running workflow execution is deleted then workflow is terminated with special flag
deleteOnTerminate
set totrue
. This allows to terminate and delete workflow execution within the same transfer task and avoid race condition between terminate/close and delete transfer tasks. Possibility of race condition is still exist in case if running workflow it getting terminated and immediately deleted. In this case, if delete transfer task is processed first, then close transfer task won't be able to access workflow execution mutable state and won't be able to run actions required to run by closed workflow (archive, report to parent, process parent close policy).Why?
To prevent race condition while deleting running workflows and allow to delete running workflows w/o extra API call.
How did you test it?
Added new integration test and modified existing unit tests.
Potential risks
Race condition between terminate and delete calls is still possible.
Is hotfix candidate?
No.