Errors in the IO threads can cause everything to hang, still #351

Closed
bitprophet opened this Issue May 28, 2016 · 15 comments

Projects

None yet

1 participant

@bitprophet
Member

Running into this under #350 - if the encoding code changes in a way to throw errors within the IO loops, things tend to hang overall until Ctrl-C'd.

Feels like it's the same problem as #289 (comment) (and indeed, I am only able to notice it when I turn debug logging on - but it made it very obvious very fast, so yay past me I guess) except this is for commands definitely not requiring user input (spec, i.e. our test suite) so something else is going on.

Offhand, wondering if something to do with the IO thread not consuming from the subprocess' pipe and said pipe filling up, which I think would be another way to keep the child from ever exiting (putting us back in limbo). This sounds super familiar so I bet it's come up before and I just can't find it when I search.

I also don't see a great way to actually handle this, because "child process is blocking on our own ability to read from it" is indistinguishable from "child process is sitting around waiting for something legit". Unless we can make Runner.wait check for its IO threads' health and abort early if it notices they've died...will look into that.


Side note: at first I thought it was related to #275 but a) I was unable to recreate the issue there, b) inspecting the behavior of the forked pre-execve code doesn't show that it's excepting in this case (it's the parent process' IO threads only) and c) I can get this both with pty on and off.

@bitprophet
Member
bitprophet commented May 28, 2016 edited

Think I have a working test for this; sure, it makes the test suite hang instead of error, but I honestly don't see a good way to surface it otherwise (besides a timer thread or whatever, but I don't care hard enough right now). Travis will surface it via its own native timeout, local users will notice pretty fast since even the integration suite takes <10s to run normally.

(Downside is it requires knowing where a good, large textfile is to cat, but meh. Both Darwin and Travis' Docker image have /usr/share/dict/words, so.)

On to the solution...

@bitprophet
Member

Hm, one wrinkle in my plan, while we control the wait-loop for the PTY use case, we do not control it for the regular-subprocess use case (it's buried in Popen.wait). Hopefully can just switch to Popen.poll instead...

@bitprophet
Member

Also, getting this working (again, if it even DOES work) will require some code shuffling, since at the moment the entirety of Runner.wait is abstract, but this "side" of the problem (testing whether the IO threads have died) should be in the parent class.

@bitprophet
Member
bitprophet commented May 28, 2016 edited

Hrm, even in the PTY branch, the while loop seems to not do jack, os.waitpid appears to be blocking, when the loop & comments within implies it would not. But the docs say nothing about this "it returns right away with pid==0" stuff - 99.9% sure that we were following older code using WNOHANG, which would generate nonblocking, returning-0 behavior.

Unfortunately, the comment links to https://github.com/pexpect/ptyprocess/blob/4058faa05e2940662ab6da1330aa0586c6f9cd9c/ptyprocess/ptyprocess.py#L680-L687 which implies that WNOHANG doesn't work right on Linux; further, the docs for os.waitpid themselves say that the options param doesn't do jack on Windows.

I was apparently suspicious in the past about that assertion re: WNOHANG, and in basic testing on Debian, it does work just as well as on Darwin, implying the pexpect note is outdated or something. That still leaves us with Windows as a possible issue, but I wonder if Windows even has this particular problem re: pipes filling up (wouldn't be too surprised if it did, but, all bets are really off).

Looked for an alternative to using the 'nonblocking wait' strategy, but I don't see anything in os for checking proc status that is reliably cross-platform yet also allows accessing exit status; even e.g. psutil only seems to offer variants on wait.

So for now, I think "nonblocking wait" is gonna have to be it; if this means this bug is non-solvable on Windows, so be it. Ideally it will only ever be surfaced by other, legit, fixable bugs in the IO threads, so.

@bitprophet
Member

Have this working well for the PTY use case.

For the non-PTY case, I have it "working" insofar as I appear able to poll the subprocess' exit state as guessed above - so Runner.wait correctly notices the IO thread dying and aborts.

However, the stderr thread - which is active here, but is not in the PTY case - is hanging on join, and doesn't seem to be triggering the exception log in ExceptionHandlingThread. Due to how I am testing this (causing an exception in _handle_output) both threads should behave identically, so something is goofy here. Suspect some dumb assumptions somewhere. Digging.

@bitprophet
Member
  • The stderr thread is hanging on its initial attempt to read - which is actually before the forced exception would even occur (it was, arbitrarily, stuck just prior to the yield/encode.
  • Normally, in the situation under test (where there is no stderr to speak of) that thread eventually gets a single read result of an empty byte-string.
  • So our "pack it in early due to the stdout's explosion" behavior is probably allowing the subprocess to close up shop before we even get that initial stderr pipe read, or something to that effect?
  • An interesting wrinkle: when I changed cat words to cat words | head -n10 while testing the 'normally' situation above...I forgot to change it back; re-enabling the fake encoding explosion then no longer caused the stderr hang - it happily gets its empty byte on read.
    • Sure enough, in the still-failing case, the cat process is still hung, even though our stdout thread has been joined.
    • So the issue here is less "shut down the threads" and more "allow our own flow to reach the point where we raise an exception & terminate execution". I.e. the subprocess can't ever actually quit (and thus, submit its empty byte / closed-the-pipe event to our stderr reader) if we can't consume all of its output.
    • Implies the only way out is for ourselves to exit & thus cause the subprocess to die with us (or, I guess, to forcibly terminate it). This probably has implications for the Fabric 2 remote runner, and/or anybody who is running run in a library context & catching exceptions.
  • Still gotta figure out how to prevent this read deadlock; offhand the only real solution is to use a timeout when joining the out/err threads, then see if any have stored exceptions:
    • If any do, it's probably safe to assume this scenario has occurred & that, since the timeout's expired, we're not getting jack from the non-excepting threads.
    • If none do...I dunno what that would mean, as it should be impossible, because we can only read the join calls once wait is over, and wait should only exit when the subprocess exits or has_dead_threads is True.
      • There's certainly race conditions possible where stuff is still happening in a non-excepting post-wait world - but that's the point of having a non-miniscule timeout value here. (I'd guess a default of 1-2s would be enough, and we can make it configurable).
@bitprophet
Member

Got that working pretty well, including refactoring - so now wait is concrete in Runner and only the "is the subprocess still running?" bits remain abstract. Seems good.

@bitprophet
Member

Grump. Thought this was good, merged to master, merged back into the thread for #350, and now the join timeout seems to be doing the Bad Things I was worried about (if the subproc runs for more than approximately the timeout value, everything just stops w/o error).

Bizarrely, on master I can't recreate it - only in the combined master+#350 branch - and I don't see offhand why the diff there would cause this. Time to dig.

Hoping it's something dumb and easily fixed because otherwise I don't see a good solution for the corner case here (no PTY, one worker dies, other worker hangs out forever), only crappy solutions like "manually kill the subprocess" (maybe that's not SO crappy) or "ignore post-read errors in IO threads so that even if they're super broken, they still read enough times to flush the pipe".

@bitprophet
Member
bitprophet commented May 31, 2016 edited

Seems to be an issue with stdin, its thread is dying almost immediately in the encoding branch, which is triggering the "do we have dead threads" check and causing the early termination.

Guessing it's to do with the reorg/split of read / encode concerns, still digging.

If there's no real bug here, I can simply update the dead-thread check to only look at stdout/err, since that was the original intent anyways. The stdin thread won't cause the deadlock problems with the subprocess & will auto-terminate when the flag is set anyways.

@bitprophet
Member

Yea, the behavior of stdin reading changed and I'm actually wondering why this didn't show up earlier. Previously, the while loop enclosed the "if ready for reading" and not much more, so the "did we read an empty byte" logic only fired when actual reads occurred.

In the new code where the read/decode is split from the loop logic, the read method (read_our_stdin) returns None immediately if the pipe wasn't ready for reading; so in most cases the stdin thread will just close up shop right away (and then trigger the dead threads check).

Pretty sure I need to twiddle things so the older behavior is reinstated, otherwise stdin just won't work at all.

@bitprophet
Member

Got that "working" but now there's a race condition in runners.stdin_mirroring.when_pty_is_False_we_write_in_stream_back_to_out_stream somewhere, only fails ~50-60% of the time, with an arbitrary amount of the expected bytes being yielded (i.e. sometimes it's just I'm, sometimes I'm ty etc etc.

I think this might be the join timeout at work; at least in the test suite, program_is_finished is hardcoded to return True, so the main run logic gets past wait very fast and ends up at the thread-joining and seems to be joining the stdin thread before it's finished reading.

Sorta like above, I think the timeout should probably only be applied to stdout/err, not stdin. (I haven't yet altered has_dead_threads to skip stdin, either, but I still don't think that is truly necessary.)

@bitprophet
Member

That didn't entirely help, which isn't too surprising I guess, I was already suspicious the issue was just that fast "is finished" behavior. The stdin thread will naturally self-terminate when the process is finished, so this is just a "real" problem where the subproc dies or exits before our stdin is done.

This should probably result in a broken pipe error or something.

More importantly, I don't get why this wasn't an issue beforehand, the old version of the test suite had a neutered wait which should still have resulted in a very-fast setting of program-exited flag & thread joining. Something else has changed, either in the code for this ticket or for #350, to make this problem more likely.

@bitprophet
Member

On master, I'm able to recreate the failing test by slapping a sleep(0.5) in handle_stdin. (Still unable to recreate it normally, even after dozens of runs.)

So some timing behavior may have changed in the encoding branch to make things slower (ugh) but it isn't in fact qualitatively different re: the general issue of subprocs + stdin, which is good because I was going crazy trying to figure out what could possibly be a 'hard' difference.

Unfortunately, does still mean that this is a general problem. Tempted to punt for now tho, already spent way too much time on this & it's not likely to bite most users right away (and there'd still need to be an error like a pipe-broken OSError type thing; there is no actual non-erroring solution for this.)

(But the test needs fixing somehow, even if it's some crappy manual thread locking or something; because the test isn't intending for the "subprocess" to finish before the stdin writing is done.)

@bitprophet bitprophet added a commit that referenced this issue Jun 1, 2016
@bitprophet bitprophet Fix bug in stdin threading re #351/#350
Needed stronger differentiation between 'not ready for reading' and 'read an empty byte'.
d9590d2
@bitprophet
Member

Forgot to actually the "minor" fix of "don't use the 1s join() timeout for the stdin thread" - that appears to solve this issue in master even with the artificially slow stdin reader in place. (Because the stdin thread will then only die when program_finished is set and there's nothing left to read from stdin.)

Applying it to the encoding branch doesn't make as big a difference as I was hoping, but I did tweak the stdin loop logic some so digging there briefly in case I can just revert a braino.

@bitprophet
Member

I think this is actually done now ಠ_ಠ

@bitprophet bitprophet closed this Jun 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment