Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further adventures with dead IO threads and threading timeouts #432

bitprophet opened this issue Mar 2, 2017 · 3 comments

Further adventures with dead IO threads and threading timeouts #432

bitprophet opened this issue Mar 2, 2017 · 3 comments


Copy link

This is a follow-up to #351, specifically:

  • The general cause of that issue was errors within IO or other worker threads causing them to stop consuming from a subprocess' pipe, which in turn locked the subprocess once the pipe filled, which then locked up the main Invoke process (as the subprocess never exits).
  • The fix was to add a 1s timeout for (non-stdin) IO thread joins, which appeared sufficient in our test cases.
  • Integration tests were added to exhibit this, specifically barfing out /usr/share/dict/words (~2.4 MB of text)
    • On the side (later? can't recall, too lazy to look up) another test was added that incidentally exposes the same issue - one that deals with Watcher errors, and since Watchers are run in the stdout-handling worker thread body (Runner._handle_output), exceptions from them will have the same effect as an unexpected IO-handling exception.
  • Much later, when performing incidentally similar tests in Fabric 2, I found that large output was getting mysteriously truncated. Immediately thought of "oh this must be a shutdown/flush problem", and was that 1s thread join timeout! Setting it to None makes the problem mysteriously go away.
    • Generally, the problem is that reading/parsing/etc 2.4 MB of text chunkwise over the network - even to localhost - takes long enough that waiting 1s after process exit is insufficient (In my testing, the thread is naturally joined after about 2-3s, and of course it is reasonable to expect even larger remote jobs to take even longer to wrap up.)
  • So how to reconcile these two things?
    • Is there a better way to deal with the thread death than this timeout? Feels like there should be; we have 'dead thread' detection so wait() ought to be able to continue to cleanup even if the subprocess hasn't finished...
    • Otherwise, we need to deal with it as a regular timeout issue of making it configurable and probably having Fabric 2 default it to a longer time, like 5 or 10 seconds.
Copy link
Member Author

Yea, the issue is that current execution flow looks like this:

  • Wait for things to "finish", which is defined as "the remote process ended OR one+ of our threads died with an exception"
  • Indicate to the otherwise-eternal stdin mirroring thread that it's time to shut down, via a flag
  • Join all threads (stdout, stderr, stdin) with a timeout on the stdout/stderr threads (to avoid the Errors in the IO threads can cause everything to hang, still #351 problem).

Why aren't the non-dead threads exiting cleanly in this case? If my notes/memory are to believed, it was because the recv call in the non-dead IO thread never exits, as the subprocess never terminates (when it terminates, recv receives an empty read which allows our workers to exit).

I debugged again this time and found:

  • stdout thread, which is what is hosting the error condition (a specifically failing Watcher that dies most of the way thru on a specific dict word) croaks and stops reading
  • subprocess thus just sticks around forever, never exiting (it'd be nice to find a way to introspect the pipe in question to confirm it is in fact full...)
  • stderr thread (as there IS no stderr) initiates a recv once and sits there forever

So what else can we do differently here? Our detection of state is OK - we saw a thread die and finished waiting - but we need to be more targeting with our timeout.

  • Giving a timeout to the dead thread is pointless, it'll join no problem (and does, I can see it)
  • Timing out the stdin thread was never necessary
  • The stderr thread, aka "The one that is not dead", is the one that needs the timeout, but only when its sibling is dead.
  • Do we care about the corner case of the stdin thread being the one to die? That would not cause Errors in the IO threads can cause everything to hang, still #351 so it's less likely to matter for now.

So I think I can "just" update the join logic so that if the thread is stdout/err and sees that its opposite has .is_dead, it initiates the timed-out join, but otherwise does not?

Copy link
Member Author

Seems to work!

Copy link
Member Author

Added a test to the Fabric 2 integration suite to prove the issue; didn't seem super useful to add one to Invoke itself, besides noting that the existing tests I mentioned above continue to pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet

No branches or pull requests

1 participant