Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support graceful shutdown incase of multiple errors #617

Conversation

mohinithakkar
Copy link

In case of multiple errors, manager.go is stuck because of the below 2 issues:

  1. errChan is set to be a buffered channel of size 1 and code uses a for loop to emit all errors over it. As a result gets blocked when len(errs) > 1. Fix: updating size of errChan to be len(errs)
  2. Code unlocks shutdown mutex in a for loop so when for loop runs more than 1 times, panic.go gets invoked because after the first iteration of for loop, code is trying to unlock an already unlocked mutex. Fix: Unlocking and shutting down outside for loop

Added a TestStartManager_WithMultipleErrors that runs successfully with this branch but fails on the base branch(glue-dig-b3).

Copy link
Contributor

@akshayjshah akshayjshah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really, really good start - thanks for following up with a PR just a few days after learning Go! I'd like a few changes before we merge this. (I'm sorry that some of the existing code doesn't match what I'm asking of you - this project had a large change of ownership and direction between the beta and 1.0.)

Also, please make sure that the build passes. We can't merge anything that doesn't get a clean bill of health from Travis CI.

If any of this is confusing or doesn't make sense, schedule some time to pair with me - my calendar accurately reflects my availability.

@@ -1,4 +1,4 @@
// Copyright (c) 2017 Uber Technologies, Inc.
// Copyright (c) 2018 Uber Technologies, Inc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit: this should be 2017-2018, since copyright starts when we initially created the file. (Yes, this is the Official Legal Position on copyright dates. No, we don't actually follow it consistently...but ya gotta start somewhere.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for sharing the details.

require.NoError(t, s.addModule(moduleProvider2))

time.AfterFunc(10*time.Second, func() {
log.Fatalf("Service dint shut down on its own for over 10 secs so forcefully killing it!")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't actually need this, since go tests automatically time out. If you want a shorter timeout, use t.Fatalf instead - that only terminates this test instead of making the whole process exit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, it never auto timed out for me. Guessing because the default timeout is 10 mins(https://golang.org/cmd/go/#hdr-Description_of_testing_flags) and I usually gave up waiting in ~ 5mins. Anyway updated this to t.Fatalf, good suggestion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented before thorough testing my bad. Anyway using t.Fatalf doesn't kill the underlying stuck go process running manager.go so have to use log.Fatalf. My understanding is t.Fatalf calls FailNow which only kills the running test but does not stop other goroutines created during the test. Lmk if you know a better way to handle this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still prefer to remove this. Now that you're done debugging, we shouldn't need this; the tests will only get stuck if there's a bug. If we're running into this, I'd rather adjust the default timeout via go test command-line flags - it'll be chaos if every test chooses a timeout.

control := s.StartAsync()
go func() {
<-control.ExitChan
}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this needs to be in a separate goroutine. You should be able to:

<- s.StartAsync()
require.Error(t, control.ServiceError)
assert.Contains(t, control.ServiceError.Error(), "can't start stubModule1")
assert.Contains(t, control.ServiceError.Error(), "can't start stubModule2")

If this hangs (which I expect it to, given the code above), there's a bug.

(And yes, I realize that this is copy-pasted from the existing test. Sorry.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad you called it out. I never really understood the motivation for this separate go routine but since I was lacking context dint touch it.

}

// emit all errors over the errChan channel for logging purposes
errChan := make(chan Exit, len(errs))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can actually make this quite a bit simpler: rather than attempting to send separate errors for each failing module, we can compose all failures into a single error. That should eliminate most of the complexity in handling channels and mutexes.

Use go.uber.org/multierr, and try something like this:

if len(errs) > 0 {
  combined := multierr.Combine(errs...)
  errChan := make(chan Exit, 1)
  errChan <- Exit{ ...stuff... }
  zap.L().Error("Error starting modules", zap.Error(combined))
  m.shutdownMu.Unlock()
  return Control{...stuff...}
}

Since that brings some sanity to the mutex handling (lock once at the beginning of the method, unlock before every return), we can remove the duplicated unlocks and change the beginning of the method to

m.shutdownMu.Lock()
defer m.shutdownMu.Unlock()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for introducing me to go.uber.org/multierr, its great! Cleaned up the code as you suggested, so much easier to read now :)

@akshayjshah
Copy link
Contributor

Looks like the build is failing because of an unrelated (and now unnecessary) lint check - I'll open a PR to remove that check.

@mohinithakkar mohinithakkar force-pushed the UFMD-262_enable_graceful_shutdown_incase_of_multiple_erros branch from 0b5993a to 78c5579 Compare April 9, 2018 21:28
@coveralls
Copy link

coveralls commented Apr 9, 2018

Coverage Status

Coverage decreased (-0.3%) to 91.429% when pulling eb5497d on mohinithakkar:UFMD-262_enable_graceful_shutdown_incase_of_multiple_erros into b470eaf on uber-go:glue-dig-b3.

@akshayjshah
Copy link
Contributor

Two small nits, otherwise looks great.

@@ -35,6 +35,7 @@ import:
- package: github.com/uber/cherami-thrift
subpackages:
- .generated/go/cherami
- package: go.uber.org/multierr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please pin a version here:

- package: go.uber.org/multierr
  version: ^1

require.NoError(t, s.addModule(moduleProvider2))

time.AfterFunc(10*time.Second, func() {
log.Fatalf("Service dint shut down on its own for over 10 secs so forcefully killing it!")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still prefer to remove this. Now that you're done debugging, we shouldn't need this; the tests will only get stuck if there's a bug. If we're running into this, I'd rather adjust the default timeout via go test command-line flags - it'll be chaos if every test chooses a timeout.

@akshayjshah akshayjshah merged commit 048cd91 into uber-go:glue-dig-b3 Apr 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants