Fix aggressive poller for non-retriable error #977

vancexu · 2020-05-29T00:34:30Z

Currently poller will retry immediately with error response such as BadRequestError.
This doesn't make sense, because all the retry will get same error response until correct poller is deployed.
As a result, server frontend be DDOS by such poller.

An example to repro is: update badbinary for a domain.

This PR fix the issue.
If err is non-retriable, just hang.
With any other kinds of error, do exponential retry

internal/internal_worker_base.go

meiliang86 · 2020-05-29T17:19:58Z

internal/internal_retry.go

@@ -70,7 +70,8 @@ func isServiceTransientError(err error) bool {
 		*s.DomainAlreadyExistsError,
 		*s.QueryFailedError,
 		*s.DomainNotActiveError,
-		*s.CancellationAlreadyRequestedError:
+		*s.CancellationAlreadyRequestedError,
+		*s.ClientVersionNotSupportedError:


Why is this transient?

it is not transient, so return false

meiliang86 · 2020-05-29T17:20:33Z

internal/internal_worker_base.go

+	}
+	switch err.(type) {
+	case *shared.BadRequestError,
+		*shared.ClientVersionNotSupportedError:


ClientVersionNotSupportedError will not happen in prod. What is the BadRequestError that you want to protect?

&gen.BadRequestError{ Message: fmt.Sprintf("binary %v already marked as bad deployment", binaryChecksum), }

This is the one help me find the bug. When set badbinary for a domain, all workers with that badbinary should stop working, but without this fix it end up in dead loop polling same error.

I checked and believe all other BadRequest that poller can get from PollForDecisionTask call should also be protected. Because currently once BadRequest happens, it means something like tasklist or domain not correct/exist.

Is this a bug on the auto reset? The auto reset happens when a workflow(marked with a bad binary checksum) try to make progress. But the BadRequestError returns from the frontend and workflow will never make progress.

And if we do this, how can it be auto reset?

Not a bug on auto reset. As discussed, when different binary (either older or newer) worker out, workflow will be reseted

See if you can add a unit test case to test this logic (i.e. mock return bad request and verify that worker got shutdown).

coveralls · 2020-06-03T18:16:04Z

Pull Request Test Coverage Report for Build 43b6091e-3224-46e5-8fdd-7982e2629201

12 of 16 (75.0%) changed or added relevant lines in 3 files are covered.
4 unchanged lines in 2 files lost coverage.
Overall coverage increased (+0.002%) to 74.646%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
internal/internal_worker_base.go	10	14	71.43%

Files with Coverage Reduction	New Missed Lines	%
internal/internal_retry.go	2	84.62%
internal/internal_task_pollers.go	2	73.04%

Totals
Change from base Build 1358f684-76bf-4c95-bfaf-22e070553450:	0.002%
Covered Lines:	9283
Relevant Lines:	12436

💛 - Coveralls

yycptt · 2020-06-03T19:49:33Z

internal/internal_utils.go

@@ -275,7 +275,7 @@ func awaitWaitGroup(wg *sync.WaitGroup, timeout time.Duration) bool {

 func getKillSignal() <-chan os.Signal {
 	c := make(chan os.Signal, 1)
-	signal.Notify(c, os.Interrupt, syscall.SIGTERM)
+	signal.Notify(c, syscall.SIGINT, syscall.SIGTERM)


I may lack some context here. But I think it's better to use os package if we can as it's system independent? Also, from this page (https://golang.org/pkg/syscall/) seems like syscall package has been deprecated.

They are same thing so just make it unified, in os.

Interrupt Signal = syscall.SIGINT

meiliang86 · 2020-06-04T21:29:17Z

internal/internal_worker_base.go

+	}
+	switch err.(type) {
+	case *shared.BadRequestError,
+		*shared.ClientVersionNotSupportedError:


See if you can add a unit test case to test this logic (i.e. mock return bad request and verify that worker got shutdown).

Fix aggressive poller for non-retriable error

7d4cf46

vancexu requested review from meiliang86, yux0, yycptt and emrahs May 29, 2020 00:34

yycptt reviewed May 29, 2020

View reviewed changes

internal/internal_worker_base.go Show resolved Hide resolved

Merge branch 'master' into fixpoll

258d644

meiliang86 reviewed May 29, 2020

View reviewed changes

terminate worker for non-retriable error

18a05c9

vancexu requested a review from yycptt June 3, 2020 18:09

Merge branch 'master' into fixpoll

c58ee6c

vancexu requested a review from meiliang86 June 3, 2020 18:12

yycptt reviewed Jun 3, 2020

View reviewed changes

Merge branch 'master' into fixpoll

9892de4

vancexu requested a review from yycptt June 4, 2020 18:14

meiliang86 approved these changes Jun 4, 2020

View reviewed changes

vancexu merged commit 5cc879b into master Jun 4, 2020

vancexu deleted the fixpoll branch June 4, 2020 21:38

vancexu added a commit that referenced this pull request Jun 11, 2020

Fix aggressive poller for non-retriable error (#977)

6ef2bf4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix aggressive poller for non-retriable error #977

Fix aggressive poller for non-retriable error #977

vancexu commented May 29, 2020 •

edited

meiliang86 May 29, 2020

vancexu May 29, 2020

meiliang86 May 29, 2020

vancexu May 29, 2020

yux0 May 29, 2020

vancexu Jun 3, 2020 •

edited

meiliang86 Jun 4, 2020

coveralls commented Jun 3, 2020 •

edited

yycptt Jun 3, 2020 •

edited

vancexu Jun 4, 2020

meiliang86 Jun 4, 2020

Fix aggressive poller for non-retriable error #977

Fix aggressive poller for non-retriable error #977

Conversation

vancexu commented May 29, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vancexu Jun 3, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Jun 3, 2020 • edited

Pull Request Test Coverage Report for Build 43b6091e-3224-46e5-8fdd-7982e2629201

💛 - Coveralls

yycptt Jun 3, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vancexu commented May 29, 2020 •

edited

vancexu Jun 3, 2020 •

edited

coveralls commented Jun 3, 2020 •

edited

yycptt Jun 3, 2020 •

edited