Delete running workflow executions #2819

alexshtin · 2022-05-08T02:45:10Z

What changed?
Add ability to delete running workflow executions. When running workflow execution is deleted then workflow is terminated with special flag deleteOnTerminate set to true. This allows to terminate and delete workflow execution within the same transfer task and avoid race condition between terminate/close and delete transfer tasks. Possibility of race condition is still exist in case if running workflow it getting terminated and immediately deleted. In this case, if delete transfer task is processed first, then close transfer task won't be able to access workflow execution mutable state and won't be able to run actions required to run by closed workflow (archive, report to parent, process parent close policy).

Why?
To prevent race condition while deleting running workflows and allow to delete running workflows w/o extra API call.

How did you test it?
Added new integration test and modified existing unit tests.

Potential risks
Race condition between terminate and delete calls is still possible.

Is hotfix candidate?
No.

service/history/api/workflow_id_reuse_policy.go

service/history/replicationTaskExecutor.go

service/history/transferQueueActiveTaskExecutor.go

service/history/transferQueueStandbyTaskExecutor.go

yycptt

Another approach occurred to me when reviewing the pr: when updating workflow, record the latest taskID in mutable state. On workflow delete, check if transfer ack level in shard Info > closed workflow last taskID. If so, perform the deletion, otherwise return error, backoff, and retry later.

service/history/transferQueueStandbyTaskExecutor.go

service/history/timerQueueTaskExecutorBase.go

common/rpc/interceptor/telemetry.go

service/history/historyEngine.go

yiminc · 2022-05-11T03:48:38Z

service/history/workflow/delete_manager.go

+	// Create delete workflow execution task only if workflow is closed successfully and all pending tasks are completed.
+	// Otherwise, mutable state might be deleted before close tasks are executed.
+	// Unfortunately, queue ack levels are updated with delay (default 60s),
+	// therefore this API will return error if workflow is deleted within 60 seconds after close.
+	if (ms.GetExecutionInfo().CloseTransferTaskId != 0 && // backward compatibility (remove in 1.18).
+		ms.GetExecutionInfo().CloseTransferTaskId > transferQueueAckLevel) || // Transfer close task wasn't executed.
+		(ms.GetExecutionInfo().CloseVisibilityTaskId != 0 && // backward compatibility (remove in 1.18).
+			ms.GetExecutionInfo().CloseVisibilityTaskId > visibilityQueueAckLevel) {
+		return consts.ErrWorkflowNotReady
+	}
+


This will fail the DeleteWorkflow API call. I think it is better to let this task in, and verify / wait this in the task processing. If the ack level is not reached, fail the task processing and let the task processing automatically retry it.

Both approaches look reasonable to me.
With the delete flag, only tasks for recently closed wf will have to wait, so task processing shouldn't be blocked.

Also visibility delete task doesn't have access to mutable state and moving this check there will add penalty (db read, which in case of Cassandra will go to the deepest internal SS) to every visibility delete operation.

service/history/historyEngine.go

yycptt · 2022-05-11T04:10:48Z

proto/internal/temporal/server/api/persistence/v1/executions.proto

+    int64 close_transfer_task_id = 64;
+    int64 close_visibility_task_id = 65;


I am not sure if we want to make this more general, to something like lastTaskKey[category], which might be useful for detecting stuck workflow?

But I don't have strong opinion on this since I don't have a clear use case in mind right now and it might be a premature optimization.

I considered this option too but having map for just 2 ints w/o proper understanding of future use seems to be too much.

yycptt · 2022-05-11T04:17:07Z

service/history/workflow/delete_manager.go

+	// Create delete workflow execution task only if workflow is closed successfully and all pending tasks are completed.
+	// Otherwise, mutable state might be deleted before close tasks are executed.
+	// Unfortunately, queue ack levels are updated with delay (default 60s),
+	// therefore this API will return error if workflow is deleted within 60 seconds after close.
+	if (ms.GetExecutionInfo().CloseTransferTaskId != 0 && // backward compatibility (remove in 1.18).
+		ms.GetExecutionInfo().CloseTransferTaskId > transferQueueAckLevel) || // Transfer close task wasn't executed.
+		(ms.GetExecutionInfo().CloseVisibilityTaskId != 0 && // backward compatibility (remove in 1.18).
+			ms.GetExecutionInfo().CloseVisibilityTaskId > visibilityQueueAckLevel) {
+		return consts.ErrWorkflowNotReady
+	}
+


Both approaches look reasonable to me.
With the delete flag, only tasks for recently closed wf will have to wait, so task processing shouldn't be blocked.

alexshtin requested a review from a team as a code owner May 8, 2022 02:45

yiminc reviewed May 9, 2022

View reviewed changes

yiminc requested a review from yycptt May 9, 2022 01:59

yycptt reviewed May 9, 2022

View reviewed changes

service/history/transferQueueStandbyTaskExecutor.go Outdated Show resolved Hide resolved

service/history/timerQueueTaskExecutorBase.go Outdated Show resolved Hide resolved

alexshtin force-pushed the feature/delete-running-execution branch from 746dcaa to 7760640 Compare May 10, 2022 22:03

yiminc reviewed May 11, 2022

View reviewed changes

common/rpc/interceptor/telemetry.go Outdated Show resolved Hide resolved

service/history/historyEngine.go Outdated Show resolved Hide resolved

yiminc reviewed May 11, 2022

View reviewed changes

alexshtin added 6 commits May 10, 2022 21:16

Delete running workflow executions

e64b2b0

Final cleanup

1e0e0cf

Release workflow execution context ASAP

6879e6d

Update comment

da1bc95

Add close task check to DeleteExecution API

e9a5a83

Fix integration test

d07c339

yycptt reviewed May 11, 2022

View reviewed changes

alexshtin added 2 commits May 11, 2022 12:55

Address feedback

440ab9c

Remove IsWorkflowRunning check

3b6baea

alexshtin force-pushed the feature/delete-running-execution branch from d23228c to 3b6baea Compare May 11, 2022 22:29

alexshtin added 2 commits May 11, 2022 16:42

Fix comment

0fee224

Add unit test

6f8488b

yiminc approved these changes May 12, 2022

View reviewed changes

Fix unit tests

3d2c7d1

alexshtin merged commit cf4153c into temporalio:master May 12, 2022

alexshtin deleted the feature/delete-running-execution branch May 12, 2022 04:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delete running workflow executions #2819

Delete running workflow executions #2819

alexshtin commented May 8, 2022 •

edited

yycptt left a comment

yiminc May 11, 2022

yycptt May 11, 2022

alexshtin May 11, 2022

yycptt May 11, 2022

alexshtin May 12, 2022

yycptt May 11, 2022

		int64 close_transfer_task_id = 64;
		int64 close_visibility_task_id = 65;

Delete running workflow executions #2819

Delete running workflow executions #2819

Conversation

alexshtin commented May 8, 2022 • edited

yycptt left a comment

Choose a reason for hiding this comment

yiminc May 11, 2022

Choose a reason for hiding this comment

yycptt May 11, 2022

Choose a reason for hiding this comment

alexshtin May 11, 2022

Choose a reason for hiding this comment

yycptt May 11, 2022

Choose a reason for hiding this comment

alexshtin May 12, 2022

Choose a reason for hiding this comment

yycptt May 11, 2022

Choose a reason for hiding this comment

alexshtin commented May 8, 2022 •

edited