Support graceful shutdown incase of multiple errors #617

mohinithakkar · 2018-03-31T00:19:29Z

In case of multiple errors, manager.go is stuck because of the below 2 issues:

errChan is set to be a buffered channel of size 1 and code uses a for loop to emit all errors over it. As a result gets blocked when len(errs) > 1. Fix: updating size of errChan to be len(errs)
Code unlocks shutdown mutex in a for loop so when for loop runs more than 1 times, panic.go gets invoked because after the first iteration of for loop, code is trying to unlock an already unlocked mutex. Fix: Unlocking and shutting down outside for loop

Added a TestStartManager_WithMultipleErrors that runs successfully with this branch but fails on the base branch(glue-dig-b3).

akshayjshah

This is a really, really good start - thanks for following up with a PR just a few days after learning Go! I'd like a few changes before we merge this. (I'm sorry that some of the existing code doesn't match what I'm asking of you - this project had a large change of ownership and direction between the beta and 1.0.)

Also, please make sure that the build passes. We can't merge anything that doesn't get a clean bill of health from Travis CI.

If any of this is confusing or doesn't make sense, schedule some time to pair with me - my calendar accurately reflects my availability.

akshayjshah · 2018-04-02T22:44:59Z

service/manager.go

@@ -1,4 +1,4 @@
-// Copyright (c) 2017 Uber Technologies, Inc.
+// Copyright (c) 2018 Uber Technologies, Inc.


Small nit: this should be 2017-2018, since copyright starts when we initially created the file. (Yes, this is the Official Legal Position on copyright dates. No, we don't actually follow it consistently...but ya gotta start somewhere.)

I see, thanks for sharing the details.

akshayjshah · 2018-04-02T22:49:22Z

service/manager_test.go

+	require.NoError(t, s.addModule(moduleProvider2))
+
+	time.AfterFunc(10*time.Second, func() {
+		log.Fatalf("Service dint shut down on its own for over 10 secs so forcefully killing it!")


You don't actually need this, since go tests automatically time out. If you want a shorter timeout, use t.Fatalf instead - that only terminates this test instead of making the whole process exit.

Interesting, it never auto timed out for me. Guessing because the default timeout is 10 mins(https://golang.org/cmd/go/#hdr-Description_of_testing_flags) and I usually gave up waiting in ~ 5mins. Anyway updated this to t.Fatalf, good suggestion.

Commented before thorough testing my bad. Anyway using t.Fatalf doesn't kill the underlying stuck go process running manager.go so have to use log.Fatalf. My understanding is t.Fatalf calls FailNow which only kills the running test but does not stop other goroutines created during the test. Lmk if you know a better way to handle this.

I'd still prefer to remove this. Now that you're done debugging, we shouldn't need this; the tests will only get stuck if there's a bug. If we're running into this, I'd rather adjust the default timeout via go test command-line flags - it'll be chaos if every test chooses a timeout.

akshayjshah · 2018-04-02T22:53:38Z

service/manager_test.go

+	control := s.StartAsync()
+	go func() {
+		<-control.ExitChan
+	}()


I don't think that this needs to be in a separate goroutine. You should be able to:

<- s.StartAsync() require.Error(t, control.ServiceError) assert.Contains(t, control.ServiceError.Error(), "can't start stubModule1") assert.Contains(t, control.ServiceError.Error(), "can't start stubModule2")

If this hangs (which I expect it to, given the code above), there's a bug.

(And yes, I realize that this is copy-pasted from the existing test. Sorry.)

Glad you called it out. I never really understood the motivation for this separate go routine but since I was lacking context dint touch it.

akshayjshah · 2018-04-02T23:20:12Z

service/manager.go

+			}
+
+			// emit all errors over the errChan channel for logging purposes
+			errChan := make(chan Exit, len(errs))


I think we can actually make this quite a bit simpler: rather than attempting to send separate errors for each failing module, we can compose all failures into a single error. That should eliminate most of the complexity in handling channels and mutexes.

Use go.uber.org/multierr, and try something like this:

if len(errs) > 0 { combined := multierr.Combine(errs...) errChan := make(chan Exit, 1) errChan <- Exit{ ...stuff... } zap.L().Error("Error starting modules", zap.Error(combined)) m.shutdownMu.Unlock() return Control{...stuff...} }

Since that brings some sanity to the mutex handling (lock once at the beginning of the method, unlock before every return), we can remove the duplicated unlocks and change the beginning of the method to

m.shutdownMu.Lock() defer m.shutdownMu.Unlock()

Thanks for introducing me to go.uber.org/multierr, its great! Cleaned up the code as you suggested, so much easier to read now :)

akshayjshah · 2018-04-05T23:51:43Z

Looks like the build is failing because of an unrelated (and now unnecessary) lint check - I'll open a PR to remove that check.

coveralls · 2018-04-09T21:31:55Z

Coverage decreased (-0.3%) to 91.429% when pulling eb5497d on mohinithakkar:UFMD-262_enable_graceful_shutdown_incase_of_multiple_erros into b470eaf on uber-go:glue-dig-b3.

akshayjshah · 2018-04-09T22:23:20Z

Two small nits, otherwise looks great.

akshayjshah · 2018-04-09T22:20:30Z

glide.yaml

@@ -35,6 +35,7 @@ import:
 - package: github.com/uber/cherami-thrift
  subpackages:
  - .generated/go/cherami
+- package: go.uber.org/multierr


Please pin a version here:

- package: go.uber.org/multierr version: ^1

akshayjshah · 2018-04-09T22:22:57Z

service/manager_test.go

+	require.NoError(t, s.addModule(moduleProvider2))
+
+	time.AfterFunc(10*time.Second, func() {
+		log.Fatalf("Service dint shut down on its own for over 10 secs so forcefully killing it!")


I'd still prefer to remove this. Now that you're done debugging, we shouldn't need this; the tests will only get stuck if there's a bug. If we're running into this, I'd rather adjust the default timeout via go test command-line flags - it'll be chaos if every test chooses a timeout.

akshayjshah suggested changes Apr 2, 2018

View reviewed changes

mohinithakkar added 4 commits April 9, 2018 14:21

Support graceful shutdown incase of multiple errors

dcc341c

moving if statement outside for loop for readability

9a47e0e

moving if statement outside for loop for readability

aed0047

using multierr to clean up code

78c5579

mohinithakkar force-pushed the UFMD-262_enable_graceful_shutdown_incase_of_multiple_erros branch from 0b5993a to 78c5579 Compare April 9, 2018 21:28

akshayjshah reviewed Apr 9, 2018

View reviewed changes

Pinning multierr & removing code forcefully killing the test

eb5497d

akshayjshah approved these changes Apr 10, 2018

View reviewed changes

akshayjshah merged commit 048cd91 into uber-go:glue-dig-b3 Apr 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support graceful shutdown incase of multiple errors #617

Support graceful shutdown incase of multiple errors #617

mohinithakkar commented Mar 31, 2018

akshayjshah left a comment

akshayjshah Apr 2, 2018

mohinithakkar Apr 3, 2018

akshayjshah Apr 2, 2018

mohinithakkar Apr 3, 2018

mohinithakkar Apr 5, 2018

akshayjshah Apr 9, 2018

akshayjshah Apr 2, 2018

mohinithakkar Apr 3, 2018

akshayjshah Apr 2, 2018

mohinithakkar Apr 4, 2018

akshayjshah commented Apr 5, 2018

coveralls commented Apr 9, 2018 •

edited

akshayjshah commented Apr 9, 2018

akshayjshah Apr 9, 2018

akshayjshah Apr 9, 2018

		@@ -1,4 +1,4 @@
		// Copyright (c) 2017 Uber Technologies, Inc.
		// Copyright (c) 2018 Uber Technologies, Inc.

Support graceful shutdown incase of multiple errors #617

Support graceful shutdown incase of multiple errors #617

Conversation

mohinithakkar commented Mar 31, 2018

akshayjshah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akshayjshah commented Apr 5, 2018

coveralls commented Apr 9, 2018 • edited

akshayjshah commented Apr 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Apr 9, 2018 •

edited