Errors in the IO threads can cause everything to hang, still #351
Comments
Think I have a working test for this; sure, it makes the test suite hang instead of error, but I honestly don't see a good way to surface it otherwise (besides a timer thread or whatever, but I don't care hard enough right now). Travis will surface it via its own native timeout, local users will notice pretty fast since even the integration suite takes <10s to run normally. (Downside is it requires knowing where a good, large textfile is to On to the solution... |
Hm, one wrinkle in my plan, while we control the wait-loop for the PTY use case, we do not control it for the regular-subprocess use case (it's buried in |
Also, getting this working (again, if it even DOES work) will require some code shuffling, since at the moment the entirety of |
Hrm, even in the PTY branch, the while loop seems to not do jack, Unfortunately, the comment links to https://github.com/pexpect/ptyprocess/blob/4058faa05e2940662ab6da1330aa0586c6f9cd9c/ptyprocess/ptyprocess.py#L680-L687 which implies that WNOHANG doesn't work right on Linux; further, the docs for I was apparently suspicious in the past about that assertion re: WNOHANG, and in basic testing on Debian, it does work just as well as on Darwin, implying the pexpect note is outdated or something. That still leaves us with Windows as a possible issue, but I wonder if Windows even has this particular problem re: pipes filling up (wouldn't be too surprised if it did, but, all bets are really off). Looked for an alternative to using the 'nonblocking wait' strategy, but I don't see anything in So for now, I think "nonblocking wait" is gonna have to be it; if this means this bug is non-solvable on Windows, so be it. Ideally it will only ever be surfaced by other, legit, fixable bugs in the IO threads, so. |
Have this working well for the PTY use case. For the non-PTY case, I have it "working" insofar as I appear able to poll the subprocess' exit state as guessed above - so However, the stderr thread - which is active here, but is not in the PTY case - is hanging on |
|
Got that working pretty well, including refactoring - so now |
Grump. Thought this was good, merged to master, merged back into the thread for #350, and now the Bizarrely, on master I can't recreate it - only in the combined master+#350 branch - and I don't see offhand why the diff there would cause this. Time to dig. Hoping it's something dumb and easily fixed because otherwise I don't see a good solution for the corner case here (no PTY, one worker dies, other worker hangs out forever), only crappy solutions like "manually kill the subprocess" (maybe that's not SO crappy) or "ignore post-read errors in IO threads so that even if they're super broken, they still |
Seems to be an issue with stdin, its thread is dying almost immediately in the encoding branch, which is triggering the "do we have dead threads" check and causing the early termination. Guessing it's to do with the reorg/split of read / encode concerns, still digging. If there's no real bug here, I can simply update the dead-thread check to only look at stdout/err, since that was the original intent anyways. The stdin thread won't cause the deadlock problems with the subprocess & will auto-terminate when the flag is set anyways. |
Yea, the behavior of stdin reading changed and I'm actually wondering why this didn't show up earlier. Previously, the while loop enclosed the "if ready for reading" and not much more, so the "did we read an empty byte" logic only fired when actual reads occurred. In the new code where the read/decode is split from the loop logic, the read method ( Pretty sure I need to twiddle things so the older behavior is reinstated, otherwise stdin just won't work at all. |
Got that "working" but now there's a race condition in I think this might be the join timeout at work; at least in the test suite, Sorta like above, I think the timeout should probably only be applied to stdout/err, not stdin. (I haven't yet altered |
That didn't entirely help, which isn't too surprising I guess, I was already suspicious the issue was just that fast "is finished" behavior. The stdin thread will naturally self-terminate when the process is finished, so this is just a "real" problem where the subproc dies or exits before our stdin is done. This should probably result in a broken pipe error or something. More importantly, I don't get why this wasn't an issue beforehand, the old version of the test suite had a neutered |
On master, I'm able to recreate the failing test by slapping a So some timing behavior may have changed in the encoding branch to make things slower (ugh) but it isn't in fact qualitatively different re: the general issue of subprocs + stdin, which is good because I was going crazy trying to figure out what could possibly be a 'hard' difference. Unfortunately, does still mean that this is a general problem. Tempted to punt for now tho, already spent way too much time on this & it's not likely to bite most users right away (and there'd still need to be an error like a pipe-broken OSError type thing; there is no actual non-erroring solution for this.) (But the test needs fixing somehow, even if it's some crappy manual thread locking or something; because the test isn't intending for the "subprocess" to finish before the stdin writing is done.) |
Needed stronger differentiation between 'not ready for reading' and 'read an empty byte'.
Forgot to actually the "minor" fix of "don't use the 1s join() timeout for the stdin thread" - that appears to solve this issue in master even with the artificially slow stdin reader in place. (Because the stdin thread will then only die when program_finished is set and there's nothing left to read from stdin.) Applying it to the encoding branch doesn't make as big a difference as I was hoping, but I did tweak the stdin loop logic some so digging there briefly in case I can just revert a braino. |
I think this is actually done now ಠ_ಠ |
Prevents accidental early closure of IO workers when there is a lot of stuff in the pipe and/or reading takes some time (e.g. if network is involved, etc.)
Prevents accidental early closure of IO workers when there is a lot of stuff in the pipe and/or reading takes some time (e.g. if network is involved, etc.)
Running into this under #350 - if the encoding code changes in a way to throw errors within the IO loops, things tend to hang overall until Ctrl-C'd.
Feels like it's the same problem as #289 (comment) (and indeed, I am only able to notice it when I turn debug logging on - but it made it very obvious very fast, so yay past me I guess) except this is for commands definitely not requiring user input (
spec
, i.e. our test suite) so something else is going on.Offhand, wondering if something to do with the IO thread not consuming from the subprocess' pipe and said pipe filling up, which I think would be another way to keep the child from ever exiting (putting us back in limbo). This sounds super familiar so I bet it's come up before and I just can't find it when I search.
I also don't see a great way to actually handle this, because "child process is blocking on our own ability to read from it" is indistinguishable from "child process is sitting around waiting for something legit". Unless we can make
Runner.wait
check for its IO threads' health and abort early if it notices they've died...will look into that.Side note: at first I thought it was related to #275 but a) I was unable to recreate the issue there, b) inspecting the behavior of the forked pre-
execve
code doesn't show that it's excepting in this case (it's the parent process' IO threads only) and c) I can get this both with pty on and off.The text was updated successfully, but these errors were encountered: