Skip to content
This repository has been archived by the owner on Feb 24, 2020. It is now read-only.

TestRktStop on stage1-kvm often fails on SemaphoreCI #3091

Closed
lucab opened this issue Aug 18, 2016 · 10 comments
Closed

TestRktStop on stage1-kvm often fails on SemaphoreCI #3091

lucab opened this issue Aug 18, 2016 · 10 comments

Comments

@lucab
Copy link
Member

lucab commented Aug 18, 2016

Recently we are seeing an increasing number of failure on SemaphoreCI, while testing stage1-kvm. They are failing on TestRktStop, as follows:

=== RUN   TestRktStop
--- FAIL: TestRktStop (21.87s)
    rkt_tests.go:115: Running command: /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/tmp/functional/rkt --dir=/tmp/datadir-393981548 --local-config=/tmp/localdir-097164507 --system-config=/tmp/systemdir-476420222 --user-config=/tmp/userdir-881331141 --insecure-options=image prepare /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/tmp/functional/test-tmp/rkt-stop-test.aci
    rkt_tests.go:115: Running command: /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/tmp/functional/rkt --dir=/tmp/datadir-393981548 --local-config=/tmp/localdir-097164507 --system-config=/tmp/systemdir-476420222 --user-config=/tmp/userdir-881331141 --insecure-options=image run-prepared --interactive b22e5d1a-2a14-44d5-add6-986757de1fbb
    rkt_stop_test.go:92: Running test #0, /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/tmp/functional/rkt --dir=/tmp/datadir-393981548 --local-config=/tmp/localdir-097164507 --system-config=/tmp/systemdir-476420222 --user-config=/tmp/userdir-881331141 stop b22e5d1a-2a14-44d5-add6-986757de1fbb
    rkt_tests.go:115: Running command: /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/tmp/functional/rkt --dir=/tmp/datadir-393981548 --local-config=/tmp/localdir-097164507 --system-config=/tmp/systemdir-476420222 --user-config=/tmp/userdir-881331141 stop b22e5d1a-2a14-44d5-add6-986757de1fbb
    rkt_tests.go:115: Running command: /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/tmp/functional/rkt --dir=/tmp/datadir-393981548 --local-config=/tmp/localdir-097164507 --system-config=/tmp/systemdir-476420222 --user-config=/tmp/userdir-881331141 --insecure-options=image prepare /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/tmp/functional/test-tmp/rkt-stop-test.aci
    rkt_tests.go:115: Running command: /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/tmp/functional/rkt --dir=/tmp/datadir-393981548 --local-config=/tmp/localdir-097164507 --system-config=/tmp/systemdir-476420222 --user-config=/tmp/userdir-881331141 --insecure-options=image run-prepared --interactive ad47a77a-f0e7-4442-9bd1-f42afb50c009
    rkt_stop_test.go:92: Running test #1, /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/tmp/functional/rkt --dir=/tmp/datadir-393981548 --local-config=/tmp/localdir-097164507 --system-config=/tmp/systemdir-476420222 --user-config=/tmp/userdir-881331141 stop --force ad47a77a-f0e7-4442-9bd1-f42afb50c009
    rkt_tests.go:115: Running command: /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/tmp/functional/rkt --dir=/tmp/datadir-393981548 --local-config=/tmp/localdir-097164507 --system-config=/tmp/systemdir-476420222 --user-config=/tmp/userdir-881331141 stop --force ad47a77a-f0e7-4442-9bd1-f42afb50c009
    rkt_tests.go:115: Running command: /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/tmp/functional/rkt --dir=/tmp/datadir-393981548 --local-config=/tmp/localdir-097164507 --system-config=/tmp/systemdir-476420222 --user-config=/tmp/userdir-881331141 --insecure-options=image prepare /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/tmp/functional/test-tmp/rkt-stop-test.aci
    rkt_tests.go:115: Running command: /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/tmp/functional/rkt --dir=/tmp/datadir-393981548 --local-config=/tmp/localdir-097164507 --system-config=/tmp/systemdir-476420222 --user-config=/tmp/userdir-881331141 --insecure-options=image run-prepared --interactive 0e3a3cb9-28c2-4832-872f-b321033252d6
    rkt_stop_test.go:92: Running test #2, /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/tmp/functional/rkt --dir=/tmp/datadir-393981548 --local-config=/tmp/localdir-097164507 --system-config=/tmp/systemdir-476420222 --user-config=/tmp/userdir-881331141 stop --uuid-file=/tmp/rkt-TestRktStop-320355424/uuid-file630404159
    rkt_tests.go:115: Running command: /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/tmp/functional/rkt --dir=/tmp/datadir-393981548 --local-config=/tmp/localdir-097164507 --system-config=/tmp/systemdir-476420222 --user-config=/tmp/userdir-881331141 stop --uuid-file=/tmp/rkt-TestRktStop-320355424/uuid-file630404159
    rkt_stop_test.go:107: Expected pod "0e3a3cb9-28c2-4832-872f-b321033252d6" to be exited, but it is "running"
panic: Failed to remove temporary data directory "/tmp/datadir-393981548": remove /tmp/datadir-393981548/pods/run/0e3a3cb9-28c2-4832-872f-b321033252d6/stage1/rootfs/sys: device or resource busy [recovered]
    panic: Failed to remove temporary data directory "/tmp/datadir-393981548": remove /tmp/datadir-393981548/pods/run/0e3a3cb9-28c2-4832-872f-b321033252d6/stage1/rootfs/sys: device or resource busy

goroutine 418 [running]:
testing.tRunner.func1(0xc820396fc0)
    /usr/local/go/src/testing/testing.go:450 +0x171
github.com/coreos/rkt/tests/testutils.(*dirDesc).cleanup(0xc82020d640)
    /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/gopath/src/github.com/coreos/rkt/tests/testutils/ctx.go:67 +0x27c
github.com/coreos/rkt/tests/testutils.(*RktRunCtx).Cleanup(0xc82020d740)
    /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/gopath/src/github.com/coreos/rkt/tests/testutils/ctx.go:179 +0x1be
runtime.Goexit()
    /usr/local/go/src/runtime/panic.go:320 +0xfb
testing.(*common).FailNow(0xc820396fc0)
    /usr/local/go/src/testing/testing.go:336 +0x36
testing.(*common).Fatalf(0xc820396fc0, 0xc9e360, 0x2a, 0xc8203afe18, 0x2, 0x2)
    /usr/local/go/src/testing/testing.go:379 +0x83
github.com/coreos/rkt/tests.TestRktStop(0xc820396fc0)
    /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/gopath/src/github.com/coreos/rkt/tests/rkt_stop_test.go:107 +0x1254
testing.tRunner(0xc820396fc0, 0x11734c8)
    /usr/local/go/src/testing/testing.go:456 +0x98
created by testing.RunTests
    /usr/local/go/src/testing/testing.go:561 +0x86d

goroutine 1 [chan receive]:
testing.RunTests(0xd1e130, 0x1172c40, 0x60, 0x60, 0xbcfd01)
    /usr/local/go/src/testing/testing.go:562 +0x8ad
testing.(*M).Run(0xc82003df08, 0xc820164686)
    /usr/local/go/src/testing/testing.go:494 +0x70
main.main()
    github.com/coreos/rkt/tests/_test/_testmain.go:276 +0x116

goroutine 17 [syscall, 31 minutes, locked to thread]:
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1721 +0x1

goroutine 156 [chan receive, 15 minutes]:
github.com/coreos/rkt/vendor/github.com/coreos/gexpect.(*ExpectSubprocess).AsyncInteractChannels.func2(0xc82002c860, 0xc8203bbe60, 0xc82002c868)
    /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/gopath/src/github.com/coreos/rkt/vendor/github.com/coreos/gexpect/gexpect.go:177 +0x5e
created by github.com/coreos/rkt/vendor/github.com/coreos/gexpect.(*ExpectSubprocess).AsyncInteractChannels
    /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/gopath/src/github.com/coreos/rkt/vendor/github.com/coreos/gexpect/gexpect.go:190 +0x127

goroutine 158 [chan receive, 15 minutes]:
github.com/coreos/rkt/vendor/github.com/coreos/gexpect.(*ExpectSubprocess).AsyncInteractChannels.func2(0xc82002c8d8, 0xc82057ec00, 0xc82002c8e0)
    /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/gopath/src/github.com/coreos/rkt/vendor/github.com/coreos/gexpect/gexpect.go:177 +0x5e
created by github.com/coreos/rkt/vendor/github.com/coreos/gexpect.(*ExpectSubprocess).AsyncInteractChannels
    /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/gopath/src/github.com/coreos/rkt/vendor/github.com/coreos/gexpect/gexpect.go:190 +0x127

goroutine 138 [chan receive, 15 minutes]:
github.com/coreos/rkt/vendor/github.com/coreos/gexpect.(*ExpectSubprocess).AsyncInteractChannels.func2(0xc820112188, 0xc820132db0, 0xc820112190)
    /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/gopath/src/github.com/coreos/rkt/vendor/github.com/coreos/gexpect/gexpect.go:177 +0x5e
created by github.com/coreos/rkt/vendor/github.com/coreos/gexpect.(*ExpectSubprocess).AsyncInteractChannels
    /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/gopath/src/github.com/coreos/rkt/vendor/github.com/coreos/gexpect/gexpect.go:190 +0x127

goroutine 170 [chan receive, 15 minutes]:
github.com/coreos/rkt/vendor/github.com/coreos/gexpect.(*ExpectSubprocess).AsyncInteractChannels.func2(0xc820106490, 0xc820550720, 0xc820106498)
    /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/gopath/src/github.com/coreos/rkt/vendor/github.com/coreos/gexpect/gexpect.go:177 +0x5e
created by github.com/coreos/rkt/vendor/github.com/coreos/gexpect.(*ExpectSubprocess).AsyncInteractChannels
    /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/gopath/src/github.com/coreos/rkt/vendor/github.com/coreos/gexpect/gexpect.go:190 +0x127

goroutine 168 [chan receive, 15 minutes]:
github.com/coreos/rkt/vendor/github.com/coreos/gexpect.(*ExpectSubprocess).AsyncInteractChannels.func2(0xc8201063e8, 0xc82038f6e0, 0xc8201063f0)
    /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/gopath/src/github.com/coreos/rkt/vendor/github.com/coreos/gexpect/gexpect.go:177 +0x5e
created by github.com/coreos/rkt/vendor/github.com/coreos/gexpect.(*ExpectSubprocess).AsyncInteractChannels
    /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/gopath/src/github.com/coreos/rkt/vendor/github.com/coreos/gexpect/gexpect.go:190 +0x127

goroutine 166 [chan receive, 15 minutes]:
github.com/coreos/rkt/vendor/github.com/coreos/gexpect.(*ExpectSubprocess).AsyncInteractChannels.func2(0xc820106360, 0xc82038e7b0, 0xc820106368)
    /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/gopath/src/github.com/coreos/rkt/vendor/github.com/coreos/gexpect/gexpect.go:177 +0x5e
created by github.com/coreos/rkt/vendor/github.com/coreos/gexpect.(*ExpectSubprocess).AsyncInteractChannels
    /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/gopath/src/github.com/coreos/rkt/vendor/github.com/coreos/gexpect/gexpect.go:190 +0x127

goroutine 160 [chan receive, 15 minutes]:
github.com/coreos/rkt/vendor/github.com/coreos/gexpect.(*ExpectSubprocess).AsyncInteractChannels.func2(0xc82002c960, 0xc82057fc50, 0xc82002c968)
    /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/gopath/src/github.com/coreos/rkt/vendor/github.com/coreos/gexpect/gexpect.go:177 +0x5e
created by github.com/coreos/rkt/vendor/github.com/coreos/gexpect.(*ExpectSubprocess).AsyncInteractChannels
    /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/gopath/src/github.com/coreos/rkt/vendor/github.com/coreos/gexpect/gexpect.go:190 +0x127

goroutine 140 [chan receive, 15 minutes]:
github.com/coreos/rkt/vendor/github.com/coreos/gexpect.(*ExpectSubprocess).AsyncInteractChannels.func2(0xc820112238, 0xc820133e30, 0xc820112240)
    /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/gopath/src/github.com/coreos/rkt/vendor/github.com/coreos/gexpect/gexpect.go:177 +0x5e
created by github.com/coreos/rkt/vendor/github.com/coreos/gexpect.(*ExpectSubprocess).AsyncInteractChannels
    /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/gopath/src/github.com/coreos/rkt/vendor/github.com/coreos/gexpect/gexpect.go:190 +0x127

goroutine 354 [chan send]:
github.com/coreos/rkt/vendor/github.com/coreos/gexpect.(*ExpectSubprocess).expectTimeoutRegexFind.func2(0xdf8475800, 0xc8201109f0, 0xc788e0, 0x21, 0xc820224ed0, 0xc820015500)
    /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/gopath/src/github.com/coreos/rkt/vendor/github.com/coreos/gexpect/gexpect.go:239 +0x387
created by github.com/coreos/rkt/vendor/github.com/coreos/gexpect.(*ExpectSubprocess).expectTimeoutRegexFind
    /home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.13.0+git/gopath/src/github.com/coreos/rkt/vendor/github.com/coreos/gexpect/gexpect.go:240 +0x17c
exit status 2
FAIL    github.com/coreos/rkt/tests 1897.924s

One such example is https://semaphoreci.com/coreos/rkt/branches/pull-request-3090/builds/1

@lucab
Copy link
Member Author

lucab commented Aug 18, 2016

@coreos/rkt-kvm-maintainers can you please have a look at it? This is now popping up quite often.

@grahamwhaley
Copy link

@lucab I tried to force this to happen locally by shrinking the memory for kvm - no luck so far (but I will try a little more). I went to have a look at the semaphore link above, but it is 404 for me - maybe it has expired... can you give me a very brief description of they core symptom (whilst I get used to decoding the test fails) - is it a hang up and timeout?

@lucab
Copy link
Member Author

lucab commented Dec 2, 2016

@grahamwhaley semaphore logs get garbage collected after a while, but the full log is above. The main failure is:

rkt_stop_test.go:107: Expected pod "0e3a3cb9-28c2-4832-872f-b321033252d6" to be exited, but it is "running"

Which means that the lvkm process on the host is still running for some reason (it holds a pod lock, thus the pod is still marked as running). My initial guess is that systemctl halt in the VM is failing, perhaps due to the thread limit above.

However I also realized that the timeout for this test is not too high (5s), so I just raised it to 15s in 46e9c15 to make sure it is not just a matter of bad timing.

@grahamwhaley
Copy link

@lucab Right - I've been trying to figure out if the still running/locked status was maybe just because the timeout was too short. I did some experiments with semaphoreci on my repo clone and saw timeouts of upto 2s so far, but have more investigation to do. I was going to ask if we might just want to bump the timeout up and try it, but was going to gather some more info first. So, let's see how that new 15s timeout works out, and I will do a little more diagnostic on my semaphoreci runs.

@grahamwhaley
Copy link

@lucab Could you assign this to me and make it 'in progress' please? I've been slowly trying to beat it to death, but am not sure i have any more clues. I can at least make it fail every time, and have it fail with either lkvm or qemu. I am adding more and more debug and doing more and more semaphoreCI runs, and hopefully one day I'll capture a clue...

@grahamwhaley
Copy link

@lucab, btw, I'm pretty sure the extended timeout is not the fix - with my modified tree I have an extended timeout at 1m5s, and I still see failures (as I also extended the test to run the loops over and over 50 times or so to try and force it more often).

@grahamwhaley
Copy link

I may have stumbled on a clue/workaround for this whilst debugging...

It appears that if we wait for the stop command to complete before we go into the status poll loop, then we don't hang up (during my testing at least). At present we do a bit of a fire-and-forget of the stop, and go straight into the poll loop. I therefore suspect maybe there is a race or bad interaction somewhere in there.

I came across this whilst trying to capture more output from the stop/SSH command to see if it were somehow failing in the hangup/timeout case. Ultimately, if I add the following:

diff --git a/tests/rkt_stop_test.go b/tests/rkt_stop_test.go
index 26bae4d..99ee076 100644
--- a/tests/rkt_stop_test.go
+++ b/tests/rkt_stop_test.go
@@ -367,7 +367,9 @@ func TestRktStop(t *testing.T) {
                runCmd := fmt.Sprintf("%s --debug %s %s", ctx.Cmd(), tt.cmd, args)
                t.Logf("Running test #%d, %s", i, runCmd)
                spawnTime := time.Now()
-               spawnOrFail(t, runCmd)
+               stopExitStatus := 0
+               // Let's try waiting for the stop command to finish executing
+               spawnAndWaitOrFail(t, runCmd, stopExitStatus)
 
                // Make sure the pod is stopped
                exitedSuccessfully := false

then in my always-fails test (extending the rkt_stop test to run it's set of 3 tests 9 times - so 47 stops), it now appears to work for both qemu and lkvm.

Much though I'm happy that maybe I found a workaround, I'm never happy if I don't understand why... so, before I complete some more testing and raise a PR ... @lucab and @jjlakis - any thoughts on what might be going on here?

@lucab
Copy link
Member Author

lucab commented Jan 5, 2017

@grahamwhaley that's a really interesting finding, and I may have an explanation for that.

It looks like the test is gexpect-spawning multiple command but never draining their tty. In a way similar to a04bfbf, the child may get stuck as the parent is not draining its output.

@grahamwhaley
Copy link

Aha, thanks @lucab - in which case, it looks like spawnAndWaitOrFail() might be the right solution - if you concur then I'll draft a PR

@lucab
Copy link
Member Author

lucab commented Jan 5, 2017

Yes, that should at least make the testing code proper. With some luck, we may have reached the end of this :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests

2 participants