New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sending a task to dlq after a number of attempts #5367
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attempts right now is increased whenever an error is encountered during task processing, which can be expected errors like resource exhausted. I think those errors/attempts should not be taken into accounting when deciding if a task should be DLQed.
I'd propose maintaining a separate unexpected error count and use that to decide if task should be put into DLQ. We should also log the error count in the logs.
That makes sense! I have made the changes. |
service/history/queues/executable.go
Outdated
@@ -398,9 +406,17 @@ func (e *executableImpl) HandleErr(err error) (retErr error) { | |||
return nil | |||
} | |||
|
|||
e.attemptsWithUnexpectedError++ | |||
if e.attemptsWithUnexpectedError >= e.attemptsBeforeSendingToDlq() && e.dlqEnabled() { | |||
e.logger.Error("Marking task as terminally failed after maximum number of attempts with unexpected errors, will send to DLQ", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you extract a constant for the log message used above "Marking task as terminally failed, will send to DLQ" and reuse it here, and then add a comment that this message should be kept in sync with the one in the operational guide in develop/docs/dlq.md
? If this is different, people might miss it because we tell users to search for that exact string when looking for DLQ reasons.
(cherry picked from commit 251778c)
What changed?
Adding code to send a task to DLQ after a number of attempts.
This number can be configured through dynamic config.
Why?
Repeatedly failing task can be moved to DLQ and inspected.
How did you test it?
Unit tests
Potential risks
None
Is hotfix candidate?
No