Running into this under #350 - if the encoding code changes in a way to throw errors within the IO loops, things tend to hang overall until Ctrl-C'd.
Feels like it's the same problem as #289 (comment) (and indeed, I am only able to notice it when I turn debug logging on - but it made it very obvious very fast, so yay past me I guess) except this is for commands definitely not requiring user input (spec, i.e. our test suite) so something else is going on.
Offhand, wondering if something to do with the IO thread not consuming from the subprocess' pipe and said pipe filling up, which I think would be another way to keep the child from ever exiting (putting us back in limbo). This sounds super familiar so I bet it's come up before and I just can't find it when I search.
I also don't see a great way to actually handle this, because "child process is blocking on our own ability to read from it" is indistinguishable from "child process is sitting around waiting for something legit". Unless we can make Runner.wait check for its IO threads' health and abort early if it notices they've died...will look into that.
Side note: at first I thought it was related to #275 but a) I was unable to recreate the issue there, b) inspecting the behavior of the forked pre-execve code doesn't show that it's excepting in this case (it's the parent process' IO threads only) and c) I can get this both with pty on and off.
Think I have a working test for this; sure, it makes the test suite hang instead of error, but I honestly don't see a good way to surface it otherwise (besides a timer thread or whatever, but I don't care hard enough right now). Travis will surface it via its own native timeout, local users will notice pretty fast since even the integration suite takes <10s to run normally.
(Downside is it requires knowing where a good, large textfile is to cat, but meh. Both Darwin and Travis' Docker image have /usr/share/dict/words, so.)
On to the solution...
Hm, one wrinkle in my plan, while we control the wait-loop for the PTY use case, we do not control it for the regular-subprocess use case (it's buried in Popen.wait). Hopefully can just switch to Popen.poll instead...
Also, getting this working (again, if it even DOES work) will require some code shuffling, since at the moment the entirety of Runner.wait is abstract, but this "side" of the problem (testing whether the IO threads have died) should be in the parent class.
Hrm, even in the PTY branch, the while loop seems to not do jack, os.waitpid appears to be blocking, when the loop & comments within implies it would not. But the docs say nothing about this "it returns right away with pid==0" stuff - 99.9% sure that we were following older code using WNOHANG, which would generate nonblocking, returning-0 behavior.
Unfortunately, the comment links to https://github.com/pexpect/ptyprocess/blob/4058faa05e2940662ab6da1330aa0586c6f9cd9c/ptyprocess/ptyprocess.py#L680-L687 which implies that WNOHANG doesn't work right on Linux; further, the docs for os.waitpid themselves say that the options param doesn't do jack on Windows.
I was apparently suspicious in the past about that assertion re: WNOHANG, and in basic testing on Debian, it does work just as well as on Darwin, implying the pexpect note is outdated or something. That still leaves us with Windows as a possible issue, but I wonder if Windows even has this particular problem re: pipes filling up (wouldn't be too surprised if it did, but, all bets are really off).
Looked for an alternative to using the 'nonblocking wait' strategy, but I don't see anything in os for checking proc status that is reliably cross-platform yet also allows accessing exit status; even e.g. psutil only seems to offer variants on wait.
So for now, I think "nonblocking wait" is gonna have to be it; if this means this bug is non-solvable on Windows, so be it. Ideally it will only ever be surfaced by other, legit, fixable bugs in the IO threads, so.
Have this working well for the PTY use case.
For the non-PTY case, I have it "working" insofar as I appear able to poll the subprocess' exit state as guessed above - so Runner.wait correctly notices the IO thread dying and aborts.
However, the stderr thread - which is active here, but is not in the PTY case - is hanging on join, and doesn't seem to be triggering the exception log in ExceptionHandlingThread. Due to how I am testing this (causing an exception in _handle_output) both threads should behave identically, so something is goofy here. Suspect some dumb assumptions somewhere. Digging.
cat words | head -n10
Got that working pretty well, including refactoring - so now wait is concrete in Runner and only the "is the subprocess still running?" bits remain abstract. Seems good.
Grump. Thought this was good, merged to master, merged back into the thread for #350, and now the join timeout seems to be doing the Bad Things I was worried about (if the subproc runs for more than approximately the timeout value, everything just stops w/o error).
Bizarrely, on master I can't recreate it - only in the combined master+#350 branch - and I don't see offhand why the diff there would cause this. Time to dig.
Hoping it's something dumb and easily fixed because otherwise I don't see a good solution for the corner case here (no PTY, one worker dies, other worker hangs out forever), only crappy solutions like "manually kill the subprocess" (maybe that's not SO crappy) or "ignore post-read errors in IO threads so that even if they're super broken, they still read enough times to flush the pipe".
Seems to be an issue with stdin, its thread is dying almost immediately in the encoding branch, which is triggering the "do we have dead threads" check and causing the early termination.
Guessing it's to do with the reorg/split of read / encode concerns, still digging.
If there's no real bug here, I can simply update the dead-thread check to only look at stdout/err, since that was the original intent anyways. The stdin thread won't cause the deadlock problems with the subprocess & will auto-terminate when the flag is set anyways.
Yea, the behavior of stdin reading changed and I'm actually wondering why this didn't show up earlier. Previously, the while loop enclosed the "if ready for reading" and not much more, so the "did we read an empty byte" logic only fired when actual reads occurred.
In the new code where the read/decode is split from the loop logic, the read method (read_our_stdin) returns None immediately if the pipe wasn't ready for reading; so in most cases the stdin thread will just close up shop right away (and then trigger the dead threads check).
Pretty sure I need to twiddle things so the older behavior is reinstated, otherwise stdin just won't work at all.
Got that "working" but now there's a race condition in runners.stdin_mirroring.when_pty_is_False_we_write_in_stream_back_to_out_stream somewhere, only fails ~50-60% of the time, with an arbitrary amount of the expected bytes being yielded (i.e. sometimes it's just I'm, sometimes I'm ty etc etc.
I think this might be the join timeout at work; at least in the test suite, program_is_finished is hardcoded to return True, so the main run logic gets past wait very fast and ends up at the thread-joining and seems to be joining the stdin thread before it's finished reading.
Sorta like above, I think the timeout should probably only be applied to stdout/err, not stdin. (I haven't yet altered has_dead_threads to skip stdin, either, but I still don't think that is truly necessary.)
That didn't entirely help, which isn't too surprising I guess, I was already suspicious the issue was just that fast "is finished" behavior. The stdin thread will naturally self-terminate when the process is finished, so this is just a "real" problem where the subproc dies or exits before our stdin is done.
This should probably result in a broken pipe error or something.
More importantly, I don't get why this wasn't an issue beforehand, the old version of the test suite had a neutered wait which should still have resulted in a very-fast setting of program-exited flag & thread joining. Something else has changed, either in the code for this ticket or for #350, to make this problem more likely.
On master, I'm able to recreate the failing test by slapping a sleep(0.5) in handle_stdin. (Still unable to recreate it normally, even after dozens of runs.)
So some timing behavior may have changed in the encoding branch to make things slower (ugh) but it isn't in fact qualitatively different re: the general issue of subprocs + stdin, which is good because I was going crazy trying to figure out what could possibly be a 'hard' difference.
Unfortunately, does still mean that this is a general problem. Tempted to punt for now tho, already spent way too much time on this & it's not likely to bite most users right away (and there'd still need to be an error like a pipe-broken OSError type thing; there is no actual non-erroring solution for this.)
(But the test needs fixing somehow, even if it's some crappy manual thread locking or something; because the test isn't intending for the "subprocess" to finish before the stdin writing is done.)
Fix bug in stdin threading re #351/#350
Needed stronger differentiation between 'not ready for reading' and 'read an empty byte'.
Forgot to actually the "minor" fix of "don't use the 1s join() timeout for the stdin thread" - that appears to solve this issue in master even with the artificially slow stdin reader in place. (Because the stdin thread will then only die when program_finished is set and there's nothing left to read from stdin.)
Applying it to the encoding branch doesn't make as big a difference as I was hoping, but I did tweak the stdin loop logic some so digging there briefly in case I can just revert a braino.
I think this is actually done now ಠ_ಠ
Prevent race condition introduced as part of #351; closes #359