TestRktStop on stage1-kvm often fails on SemaphoreCI #3091
Comments
@coreos/rkt-kvm-maintainers can you please have a look at it? This is now popping up quite often. |
@lucab I tried to force this to happen locally by shrinking the memory for kvm - no luck so far (but I will try a little more). I went to have a look at the semaphore link above, but it is 404 for me - maybe it has expired... can you give me a very brief description of they core symptom (whilst I get used to decoding the test fails) - is it a hang up and timeout? |
@grahamwhaley semaphore logs get garbage collected after a while, but the full log is above. The main failure is:
Which means that the lvkm process on the host is still running for some reason (it holds a pod lock, thus the pod is still marked as running). My initial guess is that However I also realized that the timeout for this test is not too high (5s), so I just raised it to 15s in 46e9c15 to make sure it is not just a matter of bad timing. |
@lucab Right - I've been trying to figure out if the still running/locked status was maybe just because the timeout was too short. I did some experiments with semaphoreci on my repo clone and saw timeouts of upto 2s so far, but have more investigation to do. I was going to ask if we might just want to bump the timeout up and try it, but was going to gather some more info first. So, let's see how that new 15s timeout works out, and I will do a little more diagnostic on my semaphoreci runs. |
@lucab Could you assign this to me and make it 'in progress' please? I've been slowly trying to beat it to death, but am not sure i have any more clues. I can at least make it fail every time, and have it fail with either |
@lucab, btw, I'm pretty sure the extended timeout is not the fix - with my modified tree I have an extended timeout at 1m5s, and I still see failures (as I also extended the test to run the loops over and over 50 times or so to try and force it more often). |
I may have stumbled on a clue/workaround for this whilst debugging... It appears that if we wait for the I came across this whilst trying to capture more output from the
then in my always-fails test (extending the rkt_stop test to run it's set of 3 tests 9 times - so 47 Much though I'm happy that maybe I found a workaround, I'm never happy if I don't understand why... so, before I complete some more testing and raise a PR ... @lucab and @jjlakis - any thoughts on what might be going on here? |
@grahamwhaley that's a really interesting finding, and I may have an explanation for that. It looks like the test is gexpect-spawning multiple command but never draining their tty. In a way similar to a04bfbf, the child may get stuck as the parent is not draining its output. |
Aha, thanks @lucab - in which case, it looks like |
Yes, that should at least make the testing code proper. With some luck, we may have reached the end of this :) |
Recently we are seeing an increasing number of failure on SemaphoreCI, while testing stage1-kvm. They are failing on
TestRktStop
, as follows:One such example is https://semaphoreci.com/coreos/rkt/branches/pull-request-3090/builds/1
The text was updated successfully, but these errors were encountered: