Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix aggressive poller for non-retriable error #977

Merged
merged 5 commits into from Jun 4, 2020
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
3 changes: 2 additions & 1 deletion internal/internal_retry.go
Expand Up @@ -70,7 +70,8 @@ func isServiceTransientError(err error) bool {
*s.DomainAlreadyExistsError,
*s.QueryFailedError,
*s.DomainNotActiveError,
*s.CancellationAlreadyRequestedError:
*s.CancellationAlreadyRequestedError,
*s.ClientVersionNotSupportedError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this transient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not transient, so return false

return false
}

Expand Down
2 changes: 1 addition & 1 deletion internal/internal_utils.go
Expand Up @@ -275,7 +275,7 @@ func awaitWaitGroup(wg *sync.WaitGroup, timeout time.Duration) bool {

func getKillSignal() <-chan os.Signal {
c := make(chan os.Signal, 1)
signal.Notify(c, os.Interrupt, syscall.SIGTERM)
signal.Notify(c, syscall.SIGINT, syscall.SIGTERM)
Copy link
Contributor

@yycptt yycptt Jun 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may lack some context here. But I think it's better to use os package if we can as it's system independent? Also, from this page (https://golang.org/pkg/syscall/) seems like syscall package has been deprecated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are same thing so just make it unified, in os.

Interrupt Signal = syscall.SIGINT

return c
}

Expand Down
20 changes: 19 additions & 1 deletion internal/internal_worker_base.go
Expand Up @@ -28,6 +28,7 @@ import (
"errors"
"fmt"
"sync"
"syscall"
"time"

"github.com/uber-go/tally"
Expand Down Expand Up @@ -261,7 +262,12 @@ func (bw *baseWorker) pollTask() {
if err != nil && enableVerboseLogging {
bw.logger.Debug("Failed to poll for task.", zap.Error(err))
}
if err != nil && isServiceTransientError(err) {
if err != nil {
if isNonRetriableError(err) {
bw.logger.Error("Worker received non-retriable error. Shutting down.", zap.Error(err))
syscall.Kill(syscall.Getpid(), syscall.SIGINT)
return
yycptt marked this conversation as resolved.
Show resolved Hide resolved
}
bw.retrier.Failed()
} else {
bw.retrier.Succeeded()
Expand All @@ -278,6 +284,18 @@ func (bw *baseWorker) pollTask() {
}
}

func isNonRetriableError(err error) bool {
if err == nil {
return false
}
switch err.(type) {
case *shared.BadRequestError,
*shared.ClientVersionNotSupportedError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClientVersionNotSupportedError will not happen in prod. What is the BadRequestError that you want to protect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

&gen.BadRequestError{
				Message: fmt.Sprintf("binary %v already marked as bad deployment", binaryChecksum),
			}

This is the one help me find the bug. When set badbinary for a domain, all workers with that badbinary should stop working, but without this fix it end up in dead loop polling same error.

I checked and believe all other BadRequest that poller can get from PollForDecisionTask call should also be protected. Because currently once BadRequest happens, it means something like tasklist or domain not correct/exist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a bug on the auto reset? The auto reset happens when a workflow(marked with a bad binary checksum) try to make progress. But the BadRequestError returns from the frontend and workflow will never make progress.

And if we do this, how can it be auto reset?

Copy link
Contributor Author

@vancexu vancexu Jun 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a bug on auto reset. As discussed, when different binary (either older or newer) worker out, workflow will be reseted

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See if you can add a unit test case to test this logic (i.e. mock return bad request and verify that worker got shutdown).

return true
}
return false
}

func (bw *baseWorker) processTask(task interface{}) {
defer bw.shutdownWG.Done()
// If the task is from poller, after processing it we would need to request a new poll. Otherwise, the task is from
Expand Down
28 changes: 28 additions & 0 deletions internal/internal_worker_test.go
Expand Up @@ -1320,3 +1320,31 @@ func testEncodeFunctionArgs(dataConverter DataConverter, args ...interface{}) []
}
return input
}

func TestIsNonRetriableError(t *testing.T) {
tests := []struct {
err error
expected bool
}{
{
err: nil,
expected: false,
},
{
err: &shared.ServiceBusyError{},
expected: false,
},
{
err: &shared.BadRequestError{},
expected: true,
},
{
err: &shared.ClientVersionNotSupportedError{},
expected: true,
},
}

for _, test := range tests {
require.Equal(t, test.expected, isNonRetriableError(test.err))
}
}