Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestMaintenanceSignal is failing #32

Closed
tgross opened this issue Dec 4, 2015 · 15 comments
Closed

TestMaintenanceSignal is failing #32

tgross opened this issue Dec 4, 2015 · 15 comments

Comments

@tgross
Copy link
Contributor

tgross commented Dec 4, 2015

Our first build on TravisCI failed https://travis-ci.org/joyent/containerbuddy/builds/94954762
The failing test is one that passed when run locally. I believe there's a race in the signal sending set up or tear-down.

=== RUN   TestMaintenanceSignal
2015/12/04 20:13:36 we are paused!
--- FAIL: TestMaintenanceSignal (0.00s)
    signals_test.go:51: Should not be in maintenance mode after receiving second SIGUSR1
=== RUN   TestTerminateSignal
.2015/12/04 20:13:36 Caught SIGTERM
2015/12/04 20:13:36 Deregister service: test-service
--- PASS: TestTerminateSignal (1.00s)

I'm looking into it, but I'm going to cc @justenwalker because he's been in this area of the code recently.

@tgross tgross added the bug label Dec 4, 2015
@tgross
Copy link
Contributor Author

tgross commented Dec 4, 2015

Yeah, there's definitely still some kind of race in the setup/teardown of the signal handlers. I gave justen bad advice on this. I'll have a PR with a fix shortly.

@justenwalker
Copy link
Contributor

I ran it again and it succeeded - so its a race

https://travis-ci.org/justenwalker/containerbuddy/builds/94959813

Is it the order in which sendAndWaitForSignal blocks?

    <-sig
    runtime.Gosched()

vs.

    runtime.Gosched()
    <-sig

@tgross
Copy link
Contributor Author

tgross commented Dec 4, 2015

With that ordering it fails consistently when I run it locally. I suspect the mental model we're using here is flawed somehow about how signal handlers should be set up.

@tgross
Copy link
Contributor Author

tgross commented Dec 4, 2015

I've got a good workaround and I'm testing it now on a WIP branch... we'll set up the signal handler in a helper and then mark that as such in a global bool in the test module. This way we verify that we've only set it up once; the state should be safe to reuse between the signal tests so long as each test covers different signals.

@justenwalker
Copy link
Contributor

I don't know why, but this seems to work:
Edit: Scratch that. I increased it to 1000 and it dies.

    runtime.Gosched()
    <-sig
    runtime.Gosched()
    signal.Stop(sig) //May not be needed

I added a test that re-runs the test 100 times, and it passes.
Dies with 1000 iterations.

func TestMaintenanceSignalRace(t *testing.T) {
    for i := 0; i < 100; i++ {
        TestMaintenanceSignal(t)
    }
}

@tgross tgross changed the title TestTerminateSignal is failing TestMaintenanceSignal is failing Dec 4, 2015
@tgross
Copy link
Contributor Author

tgross commented Dec 4, 2015

The only problem with that is that it might only work on your machine (the test that's failing never failed on my machine before, for example, and it still doesn't). If we remove the need to reset the handlers, we can avoid the question entirely. I've got a PR coming to that effect.

@justenwalker
Copy link
Contributor

Figured - thats what I get for coding by coincidence

@tgross
Copy link
Contributor Author

tgross commented Dec 4, 2015

Interesting... apparently I'm totally wrong. It worked the first time I pushed up the commit (on a very dirty work-in-progress branch) and then not afterwards: https://travis-ci.org/tgross/containerbuddy/builds/94967601

@tgross
Copy link
Contributor Author

tgross commented Dec 4, 2015

Working on this is a little tricky because I can't force the bug on my machine.

@tgross
Copy link
Contributor Author

tgross commented Dec 4, 2015

@justenwalker I tried what you'd suggested in https://travis-ci.org/tgross/containerbuddy/builds/94972692 and it results in the same kinds of errors on Travis.

I'm going to do some integration testing on this and make sure that the signal handler is actually working as I expect and that this isn't just a test artifact.

@tgross
Copy link
Contributor Author

tgross commented Dec 4, 2015

Ok, so fortunately in an integration test this seems to work as we'd expect. That doesn't eliminate a possible race in the implementation that happens when signals arrive close together, of course. But it does lead me to believe that the problem is in the test rig.

@tgross tgross mentioned this issue Dec 4, 2015
@tgross
Copy link
Contributor Author

tgross commented Dec 4, 2015

Think I got it: #33

@tgross
Copy link
Contributor Author

tgross commented Dec 5, 2015

@justenwalker I slept on this one a bit and woke up this morning with some new insights. We're making what is probably a classic mistake of trying to force behavior in the test by forcing the test thread to yield or by sleeping. We're trying to fix a concurrency bug within asynchronous signals, and we can't coerce the signals to be received by the signal package synchronously.

I've spent a little time trying to force synchronicity into this and I think the best approach will be to factor out the signal handlers from the signal receivers such that we can unit test the signal handling code properly without worrying about concurrency, and then just have a simple test that verifies we've wired up the handlers correctly. I'll have a PR inbound for this shortly.

@tgross
Copy link
Contributor Author

tgross commented Dec 5, 2015

I've opened #36 with what I believe to be the correct solution to the problem.

@tgross tgross added the open PR label Dec 5, 2015
@tgross
Copy link
Contributor Author

tgross commented Dec 7, 2015

Fix merged and build is now marked as passing. Thanks for the assist on this, @justenwalker

@tgross tgross closed this as completed Dec 7, 2015
@tgross tgross added fixed and removed open PR labels Dec 16, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants