New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix aggressive poller for non-retriable error #977
Conversation
@@ -70,7 +70,8 @@ func isServiceTransientError(err error) bool { | |||
*s.DomainAlreadyExistsError, | |||
*s.QueryFailedError, | |||
*s.DomainNotActiveError, | |||
*s.CancellationAlreadyRequestedError: | |||
*s.CancellationAlreadyRequestedError, | |||
*s.ClientVersionNotSupportedError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this transient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not transient, so return false
} | ||
switch err.(type) { | ||
case *shared.BadRequestError, | ||
*shared.ClientVersionNotSupportedError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ClientVersionNotSupportedError will not happen in prod. What is the BadRequestError that you want to protect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
&gen.BadRequestError{
Message: fmt.Sprintf("binary %v already marked as bad deployment", binaryChecksum),
}
This is the one help me find the bug. When set badbinary for a domain, all workers with that badbinary should stop working, but without this fix it end up in dead loop polling same error.
I checked and believe all other BadRequest that poller can get from PollForDecisionTask
call should also be protected. Because currently once BadRequest happens, it means something like tasklist or domain not correct/exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a bug on the auto reset? The auto reset happens when a workflow(marked with a bad binary checksum) try to make progress. But the BadRequestError returns from the frontend and workflow will never make progress.
And if we do this, how can it be auto reset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a bug on auto reset. As discussed, when different binary (either older or newer) worker out, workflow will be reseted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See if you can add a unit test case to test this logic (i.e. mock return bad request and verify that worker got shutdown).
Pull Request Test Coverage Report for Build 43b6091e-3224-46e5-8fdd-7982e2629201
💛 - Coveralls |
@@ -275,7 +275,7 @@ func awaitWaitGroup(wg *sync.WaitGroup, timeout time.Duration) bool { | |||
|
|||
func getKillSignal() <-chan os.Signal { | |||
c := make(chan os.Signal, 1) | |||
signal.Notify(c, os.Interrupt, syscall.SIGTERM) | |||
signal.Notify(c, syscall.SIGINT, syscall.SIGTERM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may lack some context here. But I think it's better to use os
package if we can as it's system independent? Also, from this page (https://golang.org/pkg/syscall/) seems like syscall
package has been deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are same thing so just make it unified, in os.
Interrupt Signal = syscall.SIGINT
} | ||
switch err.(type) { | ||
case *shared.BadRequestError, | ||
*shared.ClientVersionNotSupportedError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See if you can add a unit test case to test this logic (i.e. mock return bad request and verify that worker got shutdown).
Currently poller will retry immediately with error response such as BadRequestError.
This doesn't make sense, because all the retry will get same error response until correct poller is deployed.
As a result, server frontend be DDOS by such poller.
An example to repro is: update badbinary for a domain.
This PR fix the issue.
If err is non-retriable, just hang.
With any other kinds of error, do exponential retry