Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocking when running commands looking for stdin close #552

Closed
plockc opened this issue Jul 8, 2018 · 15 comments

Comments

@plockc
Copy link

commented Jul 8, 2018

Example:

@task
def cat(c):
    c.run("cat")
echo Hi | timeout 3 invoke cat || echo failed

This prints Hi but never exits because cat never receives a close of stdin and is waiting for more, and invoke is waiting on the subprocess.

In a different issue, I notice that the console has trouble after running the above command

Caused by two bugs:

First is that ExceptionHandlingThread.is_dead return expression is wrong, it should be "or" instead of "and":
https://github.com/pyinvoke/invoke/blob/master/invoke/util.py#L274

Second, is that the stdin for the process is not closed after receiving EOF:
https://github.com/pyinvoke/invoke/blob/master/invoke/runners.py#L659

Python 3.6.5
Ubuntu 18.04
Invoke 1.0.0.0

plockc pushed a commit to plockc/invoke that referenced this issue Jul 8, 2018

Chris Plock
Unblock subprocesses waiting for stdin close
Partial fix for pyinvoke#552
Fix ExceptionHandlingThread.is_dead which before would report dead when
thread was not alive and there was no exception info

This does not fix when pty=True, more significant changes are required for that case

New method added close_proc_stdin to allow the parent Runner to ask
subclass to close the stream after it finds the end of the stream.

Two test cases were adjusted that were wrongly asserting that the thread
should be "is_dead" after thread.join()

plockc pushed a commit to plockc/invoke that referenced this issue Jul 8, 2018

Chris Plock
Unblock subprocesses waiting for stdin close
Partial fix for pyinvoke#552

BREAKING CHANGE FOR RUNNER IMPLEMENTATIONS:
New method added close_proc_stdin to allow the parent Runner to ask
subclass to close the stream after it finds the end of the stream.

Fix ExceptionHandlingThread.is_dead which before would report dead when
thread was not alive and there was no exception info

This does not fix when pty=True, more significant changes are required for that case

Two test cases were adjusted that were wrongly asserting that the thread
should be "is_dead" after thread.join()

@bitprophet bitprophet added this to the p2 milestone Jul 13, 2018

@bitprophet

This comment has been minimized.

Copy link
Member

commented Jul 30, 2018

Can confirm the lack of exiting easily enough.

Not getting terminal weirdness however, even under bash, which is where I am able to reproduce the potentially-related fabric/fabric#1812. Still trying to determine if the two issues stem from the same bugs.

@bitprophet

This comment has been minimized.

Copy link
Member

commented Jul 31, 2018

Still ongoing but that Fabric ticket seems to be caused by executing with character_buffered(...) multiple times concurrently (such that all but the first invocation erroneously snapshots, then restores, an already-character-buffered terminal state) which is what causes the weird local terminal behavior (specifically, the echo flag is set to off, and then never properly restored).

Given the code is a contextmanager, I don't see why it would ever hit that particular issue in single-threaded Invoke-only code; no matter what happens inside the block (early exit, error, process termination short of SIGKILL) the post-yield, end-of-block terminal restoration code (which is in a finally clause) should always run.

In addition, when (our) stdin is hooked up to a pipe, as it is here, that contextmanager should become a no-op. The detection of whether to do that has historically been a bit fragile, so maybe this exposes another hole in it. (Though that still shouldn't cause a problem here unless the file descriptor in question is switching targets underneath us - e.g. it's pointing at the echo output pipe initially, which might yield weird tcgetattr results [tho I'd expect an explosion instead], then if we restore those onto the real stdin after for some reason...?)

Either way, I can't reproduce that angle, but I can reproduce the hanging/never-exiting problem so I'll focus on that first and we can make a new issue for terminal state corruption if necessary.

@plockc

This comment has been minimized.

Copy link
Author

commented Sep 3, 2018

hi @bitprophet, the terminal weirdness only happened once, I can't reproduce, I probably should not have mentioned in this ticket.

I have locally ran 1.1.1 and still can reproduce the blocking issue, while taking my 1.0.0 patch and rebasing on top of 1.1.1 works for the "cat" test.

I added test cases and a discussion about why I think the test cases were wrong too, mind looking at my comments and the test case updated assertions?
#553

plockc pushed a commit to plockc/invoke that referenced this issue Sep 3, 2018

Chris Plock
Unblock subprocesses waiting for stdin close
Partial fix for pyinvoke#552

BREAKING CHANGE FOR RUNNER IMPLEMENTATIONS:
New method added close_proc_stdin to allow the parent Runner to ask
subclass to close the stream after it finds the end of the stream.

Fix ExceptionHandlingThread.is_dead which before would report dead when
thread was not alive and there was no exception info

This does not fix when pty=True, more significant changes are required for that case

Two test cases were adjusted that were wrongly asserting that the thread
should be "is_dead" after thread.join()
@bitprophet

This comment has been minimized.

Copy link
Member

commented Apr 26, 2019

Think I'm hitting this internally now and traced it to something similar-but-different - a lack of submitting EOF down the pipe when non-mirrored in_stream hits its end.

I suspect that the explicit file close in OP's commit (plockc@4d6ab11) will do the same thing and will test that momentarily.

FWIW my own fix of Runner.send_interrupt() results in a later thread deadlock / logic hang (we hit the or logic in Runner.wait, where dead_threads is True but proc_finished is still False) so if that still pops up after switching to the file close approach, more work will be needed.

EDIT: actually when I read that commit's message I see the same problem mentioned there. Duh!

@bitprophet

This comment has been minimized.

Copy link
Member

commented Apr 26, 2019

When looking at extending that commit to Fabric (which is where I am troubleshooting this for now) I notice we have a mostly-noop Runner.stop method. Thinking maybe we want to fold _close_proc_stdin into that method (or at least call the latter from the former, then call the former where the patch currently calls the latter...). Still figuring out if this even fixes the whole problem for me though.

@bitprophet

This comment has been minimized.

Copy link
Member

commented Apr 26, 2019

Does appear to fix my issue, however there's some race condition where maybe 10% of the time, the remote end (in this case, a simple wc -l) exits -1 instead of 0. Gotta remember exactly what that means, though my offhand guess is it means we're closing down before the remote end is and it's getting a broken pipe.

EDIT: when switching to a Linux workstation, the incidence of the problem goes way up (or is 100% of the time perhaps). Guess that'll be useful...

@bitprophet

This comment has been minimized.

Copy link
Member

commented Apr 26, 2019

Interestingly, the bad exit code happens even when the remote program doesn't touch its stdin. I also realize...there's no stdout coming back down the pipe. So something's very clearly wrong here. Removing in_stream from the equation makes everything normal again, even with this patch involved.

The issue with lack of stdout may be a separate, related logic error triggered by in_stream though. EDIT: perhaps not. when this problem appears the remote program doesn't even seem to run at all...eg if I put in a sleep, it doesn't hold things up, if I exit 7, the local end always thinks the returncode is -1. So something's going seriously sideways here.

@bitprophet

This comment has been minimized.

Copy link
Member

commented Apr 26, 2019

Thinking this may just be some sort of hilarity where we read a small file from stdin and it gets read/written so fast it's happening before the remote end is even set up correctly; that in tandem with the changed termination logic may mean we close things down before the remote end has even started execution.

That would explain this behavior perfectly. Plus given we clearly haven't tested the non-interactive stdin scenario anywhere near as much as the interactive-stdin scenario, it makes sense we never tripped over this before - in the other scenario, stdin will never exit before the remote end, because it's a live terminal.

@bitprophet

This comment has been minimized.

Copy link
Member

commented Apr 26, 2019

The negative-1 exit status I am getting smells like it may be SSH specific so this is going to be another spot where Fabric bleeds down into Invoke, grump. Does explain why the OP figured the patch worked OK for them. Seems plausible that we'll end up merging this partly as-is and having to address something else on top of it in Fabric's runner subclass.

@bitprophet

This comment has been minimized.

Copy link
Member

commented Apr 26, 2019

Yea, as always in this arena, there are too many damned things all competing. When I get the scenario working for a non-reading-from-stdin subprocess, the reading-from-stdin scenario breaks again, and vice versa.

I thought I had something workable where Fabric was just treating Runner._close_proc_stdin as a noop - because Paramiko/SSH have no real concept of closing just stdin, only the entire overall channel - was working ok in a base case (whereas I was closing the whole channel before, like an idiot, and that causes its own problem chain).

But as soon as we return to the real scenario under test - a program actually trying to follow stdin - we're back to square one where that program seems to hang out forever waiting for stdin to close.

@bitprophet

This comment has been minimized.

Copy link
Member

commented Apr 26, 2019

Wow, Paramiko's had a stdin-closing related ticket since at least 2014. Not sure why it's never surfaced in my knowledge. Love software. It's paramiko/paramiko#322. Which hints at a Channel member I didn't even remember existed, shutdown_write.

When I dig, it does indeed do what I expected from grepping around for "EOF" in Paramiko - it sends an explicit SSH EOF message type. And plugging THAT into Fabric's version of our new method, seems to make things Just Work.

I'd say hallelujah but I'm sure I'm overlooking something and this isn't over yet, because it is that kind of day.

@bitprophet

This comment has been minimized.

Copy link
Member

commented Apr 26, 2019

Also, we still need the usual administrivia here - tests, changelog etc. The tests will probably be an enormous pain in the ass and my tolerance at this point is at an all time low, so I'll probably just do something half-baked.

bitprophet added a commit to fabric/fabric that referenced this issue Apr 26, 2019

bitprophet added a commit that referenced this issue Apr 26, 2019

Unblock subprocesses waiting for stdin close
Partial fix for #552

BREAKING CHANGE FOR RUNNER IMPLEMENTATIONS:
New method added close_proc_stdin to allow the parent Runner to ask
subclass to close the stream after it finds the end of the stream.

Fix ExceptionHandlingThread.is_dead which before would report dead when
thread was not alive and there was no exception info

This does not fix when pty=True, more significant changes are required for that case

Two test cases were adjusted that were wrongly asserting that the thread
should be "is_dead" after thread.join()

bitprophet added a commit that referenced this issue Apr 26, 2019

New method didn't need to be private
The one that contributor probably referenced
did, because it needed a split between base and subclass
implementations. Not currently the case here, so we start
out public only.

Re #552

bitprophet added a commit that referenced this issue Apr 26, 2019

Remove now-extraneous break when detecting end of stdin
It can cause race conditions by triggering our overall 'did any threads
die' logic before remote end has exited, and there's no downside to this
worker hanging out in a noop loop til end of session, as it has a sleep.

Re #552

bitprophet added a commit that referenced this issue Apr 26, 2019

Fix a test broken by c09962a
Re #552 which still needs its own new tests
@bitprophet

This comment has been minimized.

Copy link
Member

commented Jul 12, 2019

Revisiting this so it can get closed out. Writing some quick tests & they're exposing a flaw with my reasoning in the commit that removes the break after sending EOF down the pipe: yes, we do need to avoid that race condition, but just letting the loop keep rolling causes the EOF to be re-sent hundreds or thousands of times! (as evidenced by my test naively assuming that Runner.close_proc_stdin is only going to be called once.)

This may not be a huge deal for Local but it's still crazy messy, and for any other Runner subclass (like in Fabric's Remote) it could potentially be far more inefficient if not outright harmful.

Need some way to track state such that we don't trigger a "dead thread" alarm while either exiting the loop after all, or at least skipping over that method call after its first run.

@bitprophet

This comment has been minimized.

Copy link
Member

commented Jul 12, 2019

Also, I went through some mental loops today over whether this should be a bugfix or a new feature (on account of adding the new Runner method) and ended up deciding on the latter. Plus the potential for extra buggy hijinx from this behavior change also strikes me as being at odds with the stability emphasis of bugfix branches.

bitprophet added a commit that referenced this issue Jul 19, 2019

Unblock subprocesses waiting for stdin close
Partial fix for #552

BREAKING CHANGE FOR RUNNER IMPLEMENTATIONS:
New method added close_proc_stdin to allow the parent Runner to ask
subclass to close the stream after it finds the end of the stream.

Fix ExceptionHandlingThread.is_dead which before would report dead when
thread was not alive and there was no exception info

This does not fix when pty=True, more significant changes are required for that case

Two test cases were adjusted that were wrongly asserting that the thread
should be "is_dead" after thread.join()

bitprophet added a commit that referenced this issue Jul 19, 2019

New method didn't need to be private
The one that contributor probably referenced
did, because it needed a split between base and subclass
implementations. Not currently the case here, so we start
out public only.

Re #552

bitprophet added a commit that referenced this issue Jul 19, 2019

Remove now-extraneous break when detecting end of stdin
It can cause race conditions by triggering our overall 'did any threads
die' logic before remote end has exited, and there's no downside to this
worker hanging out in a noop loop til end of session, as it has a sleep.

Re #552

bitprophet added a commit that referenced this issue Jul 19, 2019

Fix a test broken by c09962a
Re #552 which still needs its own new tests
@bitprophet

This comment has been minimized.

Copy link
Member

commented Jul 19, 2019

Decided to simply solve the "don't close stdin a lot" problem with basic state-tracking logic, instead of fucking around with thread aliveness yet again. As noted, there's a sleep in that loop which should limit any problems with it hanging out til end-of-subprocess, even if we could technically close it down sooner.

bitprophet added a commit to fabric/fabric that referenced this issue Aug 4, 2019

bitprophet added a commit to fabric/fabric that referenced this issue Aug 6, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.