Interrupting a running task with Ctrl+C #152

Closed
pmdarrow opened this Issue Jun 17, 2014 · 35 comments

Projects

None yet

6 participants

@pmdarrow

Interrupting a running task with Ctrl+C results in this traceback:

^CTraceback (most recent call last):
  File "/Users/peter/Envs/are/bin/inv", line 9, in <module>
    load_entry_point('invoke==0.8.2', 'console_scripts', 'inv')()
  File "/Users/peter/Envs/are/lib/python2.7/site-packages/invoke/cli.py", line 295, in main
    dispatch(sys.argv)
  File "/Users/peter/Envs/are/lib/python2.7/site-packages/invoke/cli.py", line 288, in dispatch
    return executor.execute(*tasks, dedupe=dedupe)
  File "/Users/peter/Envs/are/lib/python2.7/site-packages/invoke/executor.py", line 89, in execute
    task=task, name=name, args=args, kwargs=kwargs
  File "/Users/peter/Envs/are/lib/python2.7/site-packages/invoke/executor.py", line 128, in _execute
    return task(*args, **kwargs)
  File "/Users/peter/Envs/are/lib/python2.7/site-packages/invoke/tasks.py", line 108, in __call__
    result = self.body(*args, **kwargs)
  File "/Users/peter/Code/are-platform/API/tasks.py", line 56, in start
    run_env('python are/servers/dev_server.py', shell_env, verbose)
  File "/Users/peter/Code/are-platform/API/tasks.py", line 27, in run_env
    run('%s %s' % (env_vars, cmd))
  File "/Users/peter/Envs/are/lib/python2.7/site-packages/invoke/runner.py", line 143, in run
    stdout, stderr, exited, exception = func(command, warn, hide)
  File "/Users/peter/Envs/are/lib/python2.7/site-packages/invoke/runner.py", line 65, in run
    stdout, stderr = process.communicate()
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 798, in communicate
    return self._communicate(input)
  File "/Users/peter/Envs/are/lib/python2.7/site-packages/invoke/monkey.py", line 58, in _communicate
    rlist, wlist, xlist = select.select(read_set, write_set, [])
KeyboardInterrupt

I'm migrating from Fabric which handles SIGINT nicely by exiting with sys.exit(1) and printing Stopped.. It also properly forwards the interrupt to any processes started with fabric.api.local(). How can I get this behaviour in Invoke?

Willing to contribute a fix for this if I'm pointed in the right direction!

@sophacles

I'm 👍 on this

@presidento
Contributor

I'm also willing to contribute for this if I'm pointed in the right direction.

@frol
frol commented Feb 18, 2016

@presidento Just looking at the code, I would say that the place where handling of KeyboardInterrupt exception should be added is here.

@presidento
Contributor

@frol I don't think so. IMHO, the best way to handling Ctrt+C would be sending the SIGKILL signal to the external process and wait gracefully until it stops. Now the behaviour is very different (#315): Invoke exits, while the started process remain running in the background.

Also I don't want to Invoke handle the Ctrl+C, because we frequently use invoke run bash. And I run other commands within the bash shell, so it would be good to return into it after pressing Ctrl+C.

@frol
frol commented Feb 19, 2016

@presidento I see your point. Then it seems it should be also patched here. Catch KeyboardInterrupt exception, send SIGTERM to a child (why would you want to kill it with SIGKILL?), wait again and once it is terminated, I would raise KeyboardInterrupt to exit since this is what I would expect as a user instead of continuing running. (Also, you may consider handling subsequent CTRL+C hits, which will kill a misbehaved child with SIGKILL signal.)

@bitprophet
Member

My off the cuff feedback:

  • Us not submitting the Ctrl-C "signal" downstream to the subprocess is an oversight and needs to happen. The general design goal in Invoke's runner functionality is "as transparent as we can get without sacrificing our own feature set" :)
    • Honestly, we should replicate most if not all signals we get, to the running subprocess, if we receive signals during the "waiting for subprocess" phase of our execution.
  • Again, just off the cuff, I think the "best" place to 'mirror' the KeyboardInterrupt downstream may be here - the recent changes fixing #303 added that try/except mostly to nab KeyboardExceptions. So within the except, or somewhere soon afterwards, is probably where we ought to signal the subprocess.
    • Likely needs to be implemented as an abstract Runner method + explicit Local method because the act of signaling differs between local and remote execution.
    • I also agree that SIGTERM is probably the best default to go with, though making this configurable might be a good idea...
@bitprophet
Member

Poking at this myself now, referencing #331 as a jumping-off point (thanks @presidento!). Notes as I go:

  • Started with tests & docs :)
  • Put the basic logic around "do things with the exception encountered during wait" inside run instead of wait itself, as it's implementation-agnostic & also just felt cleaner.
    • Then added signal_interrupt() as the implementation-specific subroutine that handles signaling
  • Preserved @presidento's use of killpg instead of kill since that does feel like the most correct call (more likely to correctly handle situations where the subprocess spawns its own subprocesses)
  • Handled the situations where pty=False (which also includes some Windows considerations)

EDIT: getting permission denied errors on killpg; digging into that next.

@bitprophet
Member

Seems like the killpg thing (I get OSError: permission denied) is a Darwin/BSD problem (e.g. this); but it works OK on Linux.

That said, regular kill seems to work OK on both platforms, and while submitting the signal to the whole process group sounds nice, I'm not sure whether it's a big enough benefit to be worth the apparent cross-platform issues. E.g. I'd expect most subprocesses that spawn their own subprocesses to handle a single SIGINT sent to just-them, gracefully enough.

So I'm probably going to go with kill for now, tho as usual I'm open to arguments. (Just, the argument needs to overcome the platform issue or the complexity cost of working around it :))

@presidento
Contributor

Using killpg was not conscious, I just googled for an existing problem and have found a "works for me" solution (for Linux). So I don't have arguments againts kill.

@bitprophet
Member

Great, glad to hear it! :)

@bitprophet bitprophet added a commit that referenced this issue Mar 16, 2016
@bitprophet bitprophet DDD re #152 405e76a
@bitprophet
Member

Back on this. Works reasonably well in practice; been poking at a non crap integration test for it, which grew into a generic "assert signal passthrough" setup (though that is only on the test harness side, Invoke itself still only handles SIGINT/KeyboardInterrupt right now).

Actually invoking the test involves starting a subprocess running Invoke, then sending it SIGINT while it's running. Ideally, this wants a truly asynchronous API for run, but I don't have time to enter that rabbit hole right now. Basic use of threading works pretty well in its stead.

Need to tidy up the WIP (eg merge my tiny threading handler with runners._IOThread) and then finish up with the non-pty implementation & test cases (the base case just looks at pty=True).

@bitprophet
Member

I'm apparently a moron and this may not have worked well after all. Suspect the problem was that I worked on the test harness stuff mostly on its own, and goofed hooking it up to the real behavior.

(This was complicated by the fact that the sub-subprocess expecting a given signal, has no actual way of communicating cleanly with the test runner, besides parsing stdout/stderr. So I was actually missing some "failures" at some point.)

What seems to be happening so far:

  • Earlier work on this and related tickets ended up with a core event logic of:
    • start IO threads
    • wait for natural completion, capturing main-thread exceptions if they occur while waiting
    • either way, signal to the threads that they should clean up, via a threading Event
    • join the threads (which can involve waiting again if they do not clean up correctly)
    • handle main-thread exceptions, if any - this is where we submit the signal downstream to the still-live subprocess
    • handle IO-thread exceptions, if any
  • What I expected (and saw, at least some of the time) in the Ctrl-C/SIGINT event, is, regardless of pty:
    • the 'wait-or-except' step sees KeyboardInterrupt, stores it
    • the IO threading Event is set
    • the IO threads terminate 'early', even if there was ostensibly stuff to read
      • the latter is moot in my current test case as I'm just using sleep or similar, so there is an active subprocess but no text in the pipes
    • joining those threads happens pretty fast
    • we move on to handling the KeyboardInterrupt and send it downstream
  • When pty=False, that happens correctly (save for the actual submission of the SIGINT, which isn't implemented for pty=False, but the point is that we get to calling send_interrupt)
  • But when pty=True, we're blocking on the first IO thread (which happens to be our stdout reader) until the subprocess naturally exits. Looking into that now, but ugh.

This is not happening in the unit tests, but that's because they're unit tests & have mocking designed to allow things to flow fast-but-clean (and the "submit signal downstream" code is not actually broken - os.kill does get called eventually - it just isn't happening until after the subprocess exits naturally).

I would consider skipping past this because ugh, time/complex, but the fact that unit tests can't see this problem, exposes the need for solid integration tests.

@bitprophet
Member

OK, as I suspected earlier but then forgot while writing the above: it's because the stdin handler is the only one honoring program_finished (and this is because it needs that flag to know when to shut down, as Invoke's own stdin is typically an interactive terminal pipe with no clearly defined end).

Presently, the stdout/err handlers prefer to consume "the whole stream" to avoid race conditions, so they ignore that Event object. But that's now actually preventing us from sending the signal that would cause those streams to terminate, in a nice Catch-22.

So I see two ways out of this:

  • Update stdout/stderr to honor the threading Event after all, and figure out a way to deal with the race condition (or ignore it until it's a real problem. That always ends well!! Not.)
  • Move the signal-sending logic somewhere else:
    • Simply separate the "submit signal" and "raise an exception" logic, since they are at this point separate things, and call send_interrupt at KeyboardInterrupt time instead of post-thread-shutdown time. This feels like the right thing to do offhand.
      • The reason we handle it later now is because the signal-send was implemented after "raise stored exceptions" and so the 'obvious' place to stick it was in that same spot.
    • Alternately, set up yet another worker thread responsible solely for mirroring signals, which would be able to do such things independent of the main thread (and thus arguably as close to 'realtime' as possible).
      • Aside from the extra complexity cost, this could also (if not now, maybe later) involve manipulating state that is normally 'owned' by the main thread, e.g. attributes of the Runner instance - something I'd prefer not to do if I can get away with it.
      • Currently the threads only perform read/write actions on pipe/file objects, which isn't so bad.
@bitprophet
Member

Furthermore, while I'm not 100% sure about the why, the problem only happens for PTYs; in the no-PTY case, the signal seems to be getting submitted downstream automatically & immediately - when I test this on the integration's helper module, it sees SIGINT before we even get around to calling send_interrupt. (This also means we probably don't actually need to do anything there in this case, perhaps...perhaps on Windows though).

Offhand I'm thinking this may be due to no-pty using straight up subprocess.Popen, meaning the subprocess is a direct child of the Invoke process, and my testing right now is straight up Ctrl-C in a terminal, which IIRC acts like pgkill instead of pkill.

EDIT: yea, if I change to using pkill -INT from another terminal, the same hanging behavior occurs, even with pty=False. And google confirms - Ctrl-C being interpreted as "SIGINT to foreground process group" is POSIX behavior. Finally, I doublechecked using (Darwin/BSD) ps -ax -O tpgid (control terminal process group ID column displayed) and yes, with no-pty, both invoke and the subprocess share a process group, but with pty, the pty-driven subprocess gets its own process group.

@bitprophet
Member
  • Yup, moving send_interrupt's call to time-of-except works much better, re: by-hand testing & integration test
  • The problem of correctly determining a failure of the integration test - specifically, one where the inner Invoke doesn't send the signal but does still exit 130 - remains.
    • The core issue is that while signaling.py nicely exits w/ exceptions when unexpected things happen, it's being run via the middle Invoke, which is being SIGINT'd, and will thus always exit 130.
      • Even if we e.g. exited 0 on SIGINT for some reason, the full run cycle will never complete due to interruption, and thus the fact that the inner process "failed" cannot enter the picture.
    • So, not sure what my least-terrible option is for having the innermost/sentinel process communicate failure with the test runner. Probably just literally testing stdout or stderr for an expected string. Semi gross, sure, but no real other avenues of communication besides files or whatever, which is even dumber.
      • Except when under pty, as we are in this case, there's no difference. So I'll just have to tweak the subscript to be a good ol' Unix citizen and only speak up if there's a problem.
@bitprophet bitprophet added a commit that referenced this issue Mar 19, 2016
@bitprophet bitprophet Call send_interrupt at interrupt time instead of waiting for IO threa…
…d closure.

Re later parts of #152 - without this, pty-driven procs will appear to ignore Ctrl-C still
3596735
@bitprophet
Member

Updating the innermost script to be good-Unix-citizen and only output on failure, works well enough; then if I revert the implementation to trigger the error case, things fail nicely.

Next is the no-pty use case; as stated it may not be strictly necessary, but we still need to capture SIGINT and submit it in case it is not generated by an interactive Ctrl-C. So far, interactive tests show that it doesn't hurt anything to try sending it "twice" (though I do wonder if there are cases where subprocess.Popen.send_signal will explode if the subprocess has already exited by the time it's called...).

Beyond that, I'm not planning to tackle the "other signals" problem yet:

  • I've already spent gobs of time here!
  • I can't really see too many other cases that matter often:
    • all of the other 'terminating' signals like HUP, KILL, QUIT, TERM etc either cannot be handled by signal or will result in correct-enough cascading process death on most OSes.
    • less-common signals like SIGUSR1/2 and PIPE are either poorly-defined or seem to cause similar halting behaviors as the others.

So those can definitely wait for a solid bug report / PR.

@bitprophet
Member

Seems to also be some issue on Travis (because of course there is!) - https://travis-ci.org/pyinvoke/invoke/jobs/117052576

Will try loading up my copy of their docker image sometime and see wtf.

@bitprophet
Member

I can recreate this on regular ol Linux, no need for the docker image at this time.

The pkill used to submit the SIGINT is what's dying (with -2, and I can't figure out what that's supposed to mean, neither pkill manpage nor google helps much). When I poke by hand some really weird shit is going on:

  • Changing pkill to pgrep within the test runner shows 3 (!) matching PIDs, the one invoke process and two 'sh' processes.
    • Even stranger, if one does the pgrep twice in a row, a different PID appears for the 2nd 'sh'...
    • That makes me think that pgrep on this system (Debian 8) is buggy and is finding its own PID...
  • Invoking pgrep by hand in another shell only shows 2 PIDs, the invoke proc and its spawned sh.
    • Which is still one more than pgrep/pkill find on Darwin - those only ever turn(ed) up the Invoke process. Guessing GNU vs BSD semantics, e.g. maybe the BSD pgrep/pkill only look at process group leaders or something?
  • If I, by hand, identify the Invoke PID and manually kill -INT it, things work correctly, so at least the core concept for the integration test still flies.
  • Question becomes, is my problem the pkill killing itself and thus blowing up the test, or would even killing both Invoke + its own subproc be problematic?
    • Though because this is mostly an issue with the test itself and not normal Invoke use, it's probably not worth digging into deeply - I just need to tweak my crummy shell commands so they work better cross-platform.
@bitprophet
Member

Dug deeper:

  • GNU pkill/grep need -al to show full command line (BSD just needs -l), once I found that I confirmed the 3 PIDs are as expected: invoke, sh, pgrep itself
  • This seems to be a platform difference; Linux is spawning additional explicit sh processes where Darwin is not. While that's perplexing & I really want to/should understand it, it'll have to wait, as stated previously this has been a deep rabbit hole already.
  • Tweaking the pkill/pgrep to use a real regex (e.g. python.*bin/inv -c signal_tasks to select the actual Python process instead of the sh 'wrapper') works equally well on both platforms, by hand.
  • However, when run via Invoke, we still end up with the 'extra' sh process on Linux, which isn't actually pgrep/pkill itself, and so it's selecting it.
  • I can force Linux not to spawn/list the sh process by using pty=True (since this uses fork instead of Python's subprocess module) and this seems like a good-enough workaround for now.
@bitprophet
Member

Have pty=False integration test working now (plus reworked the IO thread stuff so it's more generic and lives in util 😸). Naturally, it too is broken on Travis and thus probably Linux, while working fine on Darwin (EDIT: and Windows/appveyor - phew!). poking...

  • Interactive behavior still works fine. e.g. running the intermediate invoke process by hand and Ctrl-C functions correctly & the inner script sees the SIGINT
  • Using pkill is still "working" insofar as Invoke sees the signal and gets around to calling subprocess.Popen.send_signal
  • However, send_signal is clearly not having the desired effect in this case. Figuring out why.
@bitprophet
Member

Yea, as before, the issue here is Darwin subprocess is not generating intermediate sh processes, but is instead directly spawning the inner Python. (Sending SIGINT to either has the same result, which is good/expected.)

On Linux, the intermediate sh is there, and sending SIGINT to it doesn't percolate into the innermost Python process. (However, sending SIGINT directly to that inner process does satisfy its signal handlers - tho it'd be even stranger if it didn't.)

Unlike with the previous problem, I can't "work around" this, because it means non-pty (which is the default too!) behavior won't pass-through SIGINT cleanly. So I do have to figure this out after all =/

@bitprophet
Member
  • subprocess.Popen(shell=True) uses /bin/sh, hardcoded
    • On my Darwin (Yosemite) system, /bin/sh is GNU bash 3
    • On my Debian 8 system, /bin/sh is a symlink to dash (0.5.7 for whatever that's worth)
  • run(pty=True) uses /bin/bash for the time being; I wonder if changing it to /bin/sh would yield similar issues on Linux for that use case too; something to check.
  • Conversely, tweaking /bin/sh to link to bash not dash might be a fun experiment (presuming it doesn't make my VM explode, but I doubt it would unless I tried poking init scripts/etc)
  • I'm not actually sure "which shell" matters though, unless the bash on Darwin is doing some sort of exec of its arguments (replacing the shell process with the command executed), and dash on Debian isn't.
    • I definitely don't see any /bin/sh processes (or even any related bash processes) on my Darwin system while Invoke + sleeping signaling.py are running; so it's not like my pstree is hiding things (presumably).
  • Random idea: if this can't be narrowed down to something obvious, I wonder if we can simply do away with subprocess at this point, given we handle so much of the work ourselves anyways.
    • We'd lose some of the extra-special junk it does inside, but how much of that is necessary (& how much is equally necessary, and just not implemented yet, for the PTY use case) is probably arguable
    • It's, at heart, just doing os.fork so if it comes down to us just toggling use of os.fork vs pty.fork (& other odds and ends truly pty-specific)...maybe?
@bitprophet
Member
  • One interesting but hopefully irrelevant quirk is that as I noted earlier, on Darwin, our send_signal is spurious because the shell/terminal signals the entire process group: this results in signaling.py emitting two "success" messages (when tweaked to do so), versus only one on Linux.
  • As expected, the actual code in subprocess is the same on both systems, there's no platform-specific stuff besides windows-vs-posix. Both are ending up with args: ['/bin/sh', '-c', 'python signaling.py SIGINT'], executable: /bin/sh in my debuggery.
  • Sure enough, if I manually call /bin/sh -c "python signaling.py SIGINT" in my shell (note: it's zsh on both, which shouldn't matter here), I get the same pstree structures as when subprocess does the same: Darwin is just zsh -> python, Debian is zsh -> sh -> python.
  • Running /bin/bash in place of /bin/sh on Debian, does what I expected too: it does exactly what bash on Darwin does: I'm left with zsh -> python. So this behavior is specific to the system-local /bin/sh. (So much for it being portable!)
  • Question now is how much relevance this truly has...
    • A random user on Debian/Ubuntu is still going to end up wondering why Ctrl-C doesn't work as advertised. that's pretty bad.
    • It also means that my TODO item for "hey switch to /bin/sh in the non-pty use case, for maximum portability/compatibility" would probably just make both use cases break in this manner, in this situation.
    • And why doesn't dash seem to propagate signals to its subprocess (which would still presumably make this a non-issue)? this all seems to come down to dash being an annoying [EXPLETIVE DELETED].
  • Also wondering whether this is an issue with Fabric 1's local() on the same systems...shouldn't it have the exact same problem? I don't recall reports of that but they may exist. Something worth checking.
@bitprophet
Member

Then again, I just realized, that the Ctrl-C use case isn't broken, it's only regular submission of SIGINT that is broken here. The kill-foreground-process-group behavior of a real shell's Ctrl-C suffices here. (Too bad there's no apparent way to force a process-group variant of send_signal...)

So this probably can just wait until it bites somebody; submitting a manual SIGINT instead of using Ctrl-C seems like it'd be quite rare. I'm very very sick of tickets that sound simple and end up taking an inordinate amount of time to correctly handle even the base 80% of real use cases :(


For the record, when I did check Fabric 1's behavior, it does seem to work better; obviously, it still results in an intermediate sh process, but kill -INT <fabric PID> causes things to shut down correctly.

The primary differences between the ways we use subprocess are:

  • Fabric sets the out/err/in fds to None (meaning inheritance) and we of course set it to PIPE so we can intercept them.
  • Fabric sets close_fds=True when not-Windows; we leave it as the default, which is False.
    • However, (and I mostly expected this) setting close_fds=True in Invoke makes zero difference re: this particular behavior.
@bitprophet
Member

Merged to master \o/

@bitprophet bitprophet closed this Mar 20, 2016
@bitprophet
Member

And Travis reminds me that the whole reason I was poking this last bit, is it is breaking the integration test on Linux. Got so wrapped up in figuring out WTF I forgot that it needs addressing in some fashion.

I may just comment out that last integration test for now...will sleep on it.

@bitprophet bitprophet reopened this Mar 20, 2016
@bitprophet
Member

Glad I slept on it, realized this morning a decent workaround is to expose a config option for the executable and use it in both pty and non-pty use cases:

  • Fixes the current discrepancy between pty (which uses /bin/bash) and non-pty (/bin/sh)
  • Config option means users who have issues with the default (like, say...if we were to set it to /bin/sh and they're sending raw SIGINT to dash-based sh...) can tweak it
  • Lets me workaround potential dash shells on test platforms for the integration test by forcing /bin/bash

Going to make that a new ticket (EDIT: reused old one, #67) since from most perspectives it's not technically related to this one.

@bitprophet bitprophet closed this Mar 20, 2016
@bitprophet
Member

Implemented #67 and the tests pass, including the one that was breaking for this issue (because all shells are now /bin/bash by default). Toot toot.

@presidento
Contributor

Hi @bitprophet, thank you for the detailed log! I think, handling SIGINT correctly is far enough now.

@presidento
Contributor

(Didn't you want to add it to the 0.12.3 milestone?)

@noirbizarre

Hi !

I just tested: it seams to works well only when pty=False.
When having pty=True, signals (tested with SIGINT) is not relayed to the child process.

@bitprophet bitprophet reopened this May 28, 2016
@bitprophet
Member
bitprophet commented May 28, 2016 edited

The integration tests for this are, at present, failing intermittently. Not sure if some other change in the interim is causing or what.

They're already "meh" because there's a sleep in there to prevent the signal from happening before the inner Python interpreter has fully started up; unclear if that is the reason things fail or not (I bumped it from 1s to 2s briefly and made no obvious difference).

Happening mostly/entirely under Python 3, probably because it's slower. (I earlier noted it was on both but that was my mistake, I erroneously bumped the sleep but not the alarm. Comment added to prevent that in future.)

@bitprophet
Member

@noirbizarre In real world testing (unrelated to my note above about the integration tests) it still works fine for me with pty=True. Can you provide more details sometime?

@bitprophet
Member
bitprophet commented May 28, 2016 edited

Simply bumping the alarm to 2s from 1s makes this appear to go away; guessing the issue was that on my environment & under Python 3, I was hitting a race condition.

@bitprophet
Member

Re-closing, @noirbizarre do lmk if you can reliably recreate your problem - all details appreciated.

@bitprophet bitprophet closed this May 28, 2016
@bitprophet bitprophet added a commit that referenced this issue May 31, 2016
@bitprophet bitprophet Tighten up integration test further re #152
There was another race condition that shouldn't be considered
a failure - re: whether sub-subproc had time to squirt out
a KeyboardInterrupt stacktrace before dying
abe3525
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment