Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix aggressive poller for non-retriable error #977

Merged
merged 5 commits into from Jun 4, 2020
Merged

Fix aggressive poller for non-retriable error #977

merged 5 commits into from Jun 4, 2020

Conversation

vancexu
Copy link
Contributor

@vancexu vancexu commented May 29, 2020

Currently poller will retry immediately with error response such as BadRequestError.
This doesn't make sense, because all the retry will get same error response until correct poller is deployed.
As a result, server frontend be DDOS by such poller.

An example to repro is: update badbinary for a domain.

This PR fix the issue.
If err is non-retriable, just hang.
With any other kinds of error, do exponential retry

@@ -70,7 +70,8 @@ func isServiceTransientError(err error) bool {
*s.DomainAlreadyExistsError,
*s.QueryFailedError,
*s.DomainNotActiveError,
*s.CancellationAlreadyRequestedError:
*s.CancellationAlreadyRequestedError,
*s.ClientVersionNotSupportedError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this transient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not transient, so return false

}
switch err.(type) {
case *shared.BadRequestError,
*shared.ClientVersionNotSupportedError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClientVersionNotSupportedError will not happen in prod. What is the BadRequestError that you want to protect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

&gen.BadRequestError{
				Message: fmt.Sprintf("binary %v already marked as bad deployment", binaryChecksum),
			}

This is the one help me find the bug. When set badbinary for a domain, all workers with that badbinary should stop working, but without this fix it end up in dead loop polling same error.

I checked and believe all other BadRequest that poller can get from PollForDecisionTask call should also be protected. Because currently once BadRequest happens, it means something like tasklist or domain not correct/exist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a bug on the auto reset? The auto reset happens when a workflow(marked with a bad binary checksum) try to make progress. But the BadRequestError returns from the frontend and workflow will never make progress.

And if we do this, how can it be auto reset?

Copy link
Contributor Author

@vancexu vancexu Jun 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a bug on auto reset. As discussed, when different binary (either older or newer) worker out, workflow will be reseted

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See if you can add a unit test case to test this logic (i.e. mock return bad request and verify that worker got shutdown).

@vancexu vancexu requested a review from yycptt June 3, 2020 18:09
@vancexu vancexu requested a review from meiliang86 June 3, 2020 18:12
@coveralls
Copy link

coveralls commented Jun 3, 2020

Pull Request Test Coverage Report for Build 43b6091e-3224-46e5-8fdd-7982e2629201

  • 12 of 16 (75.0%) changed or added relevant lines in 3 files are covered.
  • 4 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.002%) to 74.646%

Changes Missing Coverage Covered Lines Changed/Added Lines %
internal/internal_worker_base.go 10 14 71.43%
Files with Coverage Reduction New Missed Lines %
internal/internal_retry.go 2 84.62%
internal/internal_task_pollers.go 2 73.04%
Totals Coverage Status
Change from base Build 1358f684-76bf-4c95-bfaf-22e070553450: 0.002%
Covered Lines: 9283
Relevant Lines: 12436

💛 - Coveralls

@@ -275,7 +275,7 @@ func awaitWaitGroup(wg *sync.WaitGroup, timeout time.Duration) bool {

func getKillSignal() <-chan os.Signal {
c := make(chan os.Signal, 1)
signal.Notify(c, os.Interrupt, syscall.SIGTERM)
signal.Notify(c, syscall.SIGINT, syscall.SIGTERM)
Copy link
Contributor

@yycptt yycptt Jun 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may lack some context here. But I think it's better to use os package if we can as it's system independent? Also, from this page (https://golang.org/pkg/syscall/) seems like syscall package has been deprecated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are same thing so just make it unified, in os.

Interrupt Signal = syscall.SIGINT

@vancexu vancexu requested a review from yycptt June 4, 2020 18:14
}
switch err.(type) {
case *shared.BadRequestError,
*shared.ClientVersionNotSupportedError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See if you can add a unit test case to test this logic (i.e. mock return bad request and verify that worker got shutdown).

@vancexu vancexu merged commit 5cc879b into master Jun 4, 2020
@vancexu vancexu deleted the fixpoll branch June 4, 2020 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants