Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix archival activities error handling #3227

Merged
merged 4 commits into from
Aug 15, 2022

Conversation

yycptt
Copy link
Member

@yycptt yycptt commented Aug 14, 2022

What changed?

  • Do not mark all activities errors as non-retryable 馃槥
  • Change the condition for what errors is non-retryable for workflow deletion, more specifically, workflow not ready error and context errors (like context deadline exceeded, since each new attempt get a new startToClose timeout) should be retryable.

Why?

  • Bug fix

How did you test it?

  • Tested locally
  • Updated unit test

Potential risks

Is hotfix candidate?

  • Yes.

@yycptt yycptt requested review from yux0 and yiminc August 14, 2022 07:21
@yycptt yycptt requested a review from a team as a code owner August 14, 2022 07:21
if !common.IsPersistenceTransientError(err) {

if _, ok := err.(*serviceerror.WorkflowNotReady); !ok {
logger := tagLoggerWithHistoryRequest(tagLoggerWithActivityInfo(container.Logger, activity.GetInfo(ctx)), &request)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppress logs for notReady error which is expected for deleteWorkflowExecution call. The error will still be logged if all retry fails.

@@ -167,7 +167,7 @@ func (h *handler) handleHistoryRequest(ctx workflow.Context, request *ArchiveReq
localActCtx := workflow.WithLocalActivityOptions(ctx, lao)
err = workflow.ExecuteLocalActivity(localActCtx, deleteHistoryActivity, *request).Get(localActCtx, nil)
if err != nil {
logger.Error("deleting history failed, this means zombie histories are left", tag.Error(err))
logger.Error("deleting workflow execution failed all retires, skip workflow deletion", tag.Error(err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the workflow data stay in Db if it just skip the deletion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Archival workflow will only retry up to 5min for both uploading and deletion and then give up. This is the limitation of the existing archival design. The issue will be gone once we have a separate archival queue.

For now, user should monitor the metrics for archival delete non-retryable error and use admin wf del command to manually delete those workflows from DB.

Comment on lines 66 to 67
} else {
err = temporal.NewApplicationError(err.Error(), "", nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is default error to ApplicationError conversion. I would just return err here.

Suggested change
} else {
err = temporal.NewApplicationError(err.Error(), "", nil)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And everywhere bellow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Then no change is needed to err and now the defer looks like the following:

	defer func() {
		sw.Stop()
		if err == errUploadNonRetryable {
			scope.IncCounter(metrics.ArchiverNonRetryableErrorCount)
		}
	}()

@yycptt yycptt merged commit 4873231 into temporalio:master Aug 15, 2022
@yycptt yycptt deleted the fix-archival-del branch August 15, 2022 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants