-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
highstate hangs on zombie proc #5567
Comments
Started looking through the code to find out what's broken, came across; https://github.com/saltstack/salt/blob/develop/salt/modules/cmdmod.py#L299 "This is where the magic happens" - heh.... If I run this manually;
After adding a bit of debugging, the whole kwargs being passed through is;
So I tried this manually;
Added some more debugging to see where it's hanging;
This returns;
Highstate hangs, and ps shows;
If I manually kill the process, the following logs appear;
At which point it repeats the service start command and hangs again. Sadly I just don't know enough about the internals of salt to figure out what is going on here :/ |
Ah, so I managed to replicate the problem;
|
ok, so we need to be able to change the behavior of a command to only execute with wait instead of communicate? |
Looked into this a bit further, if I exclude all the kwargs options, it behaves as expected;
|
Okay, I found the offending option;
EDIT: Looks like Replacing this with subprocess.PIPE didn't seem to help either (as -1 already means subprocess.PIPE)
This might be related; However a quick check shows no fork/setsid in the strace, so I'm not sure if that's relevant;
I've tried this with wait();
It would appear that returncode is
So the question here is, when should wait() be used, and when should communicate() be used? And can communicate() be fixed to prevent this? Looking at subprocess.py, communicate() and wait() behave very differently; It would appear that communicate() is certainly broken in this case. One of my colleagues was able to implement a fixed version of communicate() based on the following article; However, this is very much considered alpha and I would like to know why subprocess.communicate() is broken in the first place. |
Also, some information about the process causing this issue.. freeswitch forks itself to run as another user upon being executed, as seen here;
So, as you can see, the 7766 process is still running if we run this manually. However when doing it through service, we get;
If I run fs directly again, you can see the fork/clone.
The same problem happens if I use init.d directly without service;
If I try this directly against the freeswitch binary, I get the following;
The line it hangs on is;
So to clarify the bug here, subprocess.communicate() does not cleanly handle communication when a process forks(). I'm still digging into this, and will update the thread as I go along. |
Okay I've managed to make a fix for this based on the advice contained in [1] and [2]. This also has the added benefit that it fixes the deadlocking bug that occurs if the buffer size is too big :) I've made it so the code can function as a drop in replacement and functions exactly as you'd expect communicate() to. The downside is that it writes the output to disk, which isn't exactly the most cleanest method.
This works by using a tempfile for subprocess, it waits for the process to exit with wait(), and then does a seek(0) back to the start of each tempfile and reads the content. If you're happy with this patch, let me know and I'll do a fork/merge. [1] http://stackoverflow.com/questions/3575554/python-subprocess-with-timeout-and-large-output-64k |
Ideally we'd want to use the The import signal
import os
import subprocess
import select
def handler(signum, frame):
print 'Signal handler called with signal', signum
def iterate_fds(handles, buffersize=1024):
handles = dict([(handle, True) for handle in handles])
results = dict([(handle, '') for handle in handles])
while len(handles):
try:
for handle in select.select(handles.keys(), tuple(), tuple())[0]:
print '.'
data = handle.read()
if data:
results[handle] += data
else:
handles.pop(handle)
except select.error, e:
# If there's a signal handler installed select() will raise a
# select.error(4, 'Interrupted system call') when the signal handler
# returns.
# By ignoring interrupted select() it continues on as normal rather
# than breaking...
if e.args[0] == 4:
pass
raise e
return [results[handle] for handle in results.keys()]
signal.signal(signal.SIGUSR1, handler)
kwargs = {'executable': '/bin/bash', 'shell': True, 'stdout': subprocess.PIPE, 'stderr': subprocess.PIPE}
x = subprocess.Popen("sleep 1000", **kwargs)
out, err = iterate_fds([x.stdout, x.stderr])
x.wait() |
I'm testing the code from @HarryR now against a real salt deployment, and re-factoring into a drop-in replacement. I'll update once finished (should be in an hour or so) |
The code from @HarryR will not work for following reasons (i've just tested this and got the same problem as before);
The zombie is being caused by stdout/stderr holding a file descriptor open to the parent of the child fork, as soon as this fd is closed, the zombie disappears. According to [1], read() may still block after select.select() in exceptional cases. The suggested fix for this is to use non blocking reads (via fnctl), but again this means it would be windows incompatible. So from what I can tell, the only safe way to do this is to use tempfile. @thatch45 Are you happy to accept tempfile as the fix? If so, I'll go ahead and make a drop in communicate() replacement which uses tempfile instead. [1] http://stackoverflow.com/questions/5351994/will-read-ever-block-after-select |
Okay, another slight problem.. using tempfile results in a potential race condition. Calling wait() guarantees that the file has terminated, but it doesn't guarantee that the file has been written to yet by forked process. If we look here;
In the above we can see that So far my conclusion is that subprocess is |
There was some useful discussion about this in [1] and [2]. They tackle the problem by using a double fork, but this means you cannot read data that is coming back. If we are going to ignore the data, then we wouldn't be using communicate() anyway, we'd use wait(). Interestingly, I managed to stop the zombie from happening by using;
However it still hangs on;
And as you can see here, it doesn't show up in proc list during hanging;
Attaching strace shows the fork, and also data being received from the child proc "14063 Backgrounding"
@thatch45 It's possible we might need to go with your original suggestion, which involves selectively disabling communicate() on calls that don't need it. [1] http://code.activestate.com/recipes/278731-creating-a-daemon-the-python-way/ |
Next I tried to repeat this fork() problem by creating a simple python script which forks. However, I was unable to repeat the same behavior as freeswitch; communicate() on python fork proc
Results in;
communicate() on freeswitch fork proc
As you can see in the freeswitch example, select() gets some data, read() fetches it, then select() hangs again until the freeswitch process dies. So let's have a look at the difference between them; fork.py
freeswitch
So the manner in which these two examples are forking are quite different, and a simple os.fork() does not replicate this problem. At this stage I have completely reached the end of my knowledge, to the point where I'm not even sure what the correct behavior should be lol. Or as quoted on IRC: |
I just did a real quick hack, and was able to apply a temporary fix which bypasses So from what I can tell, the available options to us are;
I was unable to make [1] work, however initial tests on [2] were promising. @thatch45 If you're happy with [2], let me know and I'll get a patch written up. EDIT: Very alpha patch written up - #5576 Cal |
Also, an amusing fact;
It would appear this causes |
This is a freeswitch bug: http://fisheye.freeswitch.org/browse/freeswitch.git/src/switch.c?hb=true#to339 At line 339 daemonize() returns instead of redirecting and closing the std* file descriptors. |
@olliewalsh Oh man, nicely spotted! I'll report that to them now. However, users may not always have the ability to fix their broken binary.. in some cases, they will have no control over whether or not their binary is forking correctly or not. On that basis, I think we should still avoid using communicate() unless it's needed. Thoughts? |
@olliewalsh Btw, I just patched FS to remove that return, and it now functions correctly again - thank you so much for spotting that, really appreciated. Turns out this bug was only just introduced less than 2 months ago lol. |
I've seen subprocess.communicate hangs in the past but adding bufsize=-1 to the Popen args always resolved the issue. Never seen this one but then I probably wasn't starting many daemons via subprocess so I've no idea how common this issue might be. |
Hi same bug ...... |
So moving forward, the core devs need to make a decision on how this should be handled. @terminalmage closed #5469 on the grounds that the original program is broken. However, I do not feel that salt should assume that all programs are working correctly, in some cases it could take weeks, even years, to get a patch to fix the upstream. Although there is no way to fully prevent this happening in a safe way, we can avoid it happening by not calling communicate() unnecessarily in This needs a design decision from a core developer. |
I'll bug @thatch45 to look into this, I'm not experienced enough with |
cmd.retcode should not call communicate, looks like that will solve the underlying problem, we should make sure and get a comment in there as well that explains why it is being called that way. |
@thatch45 +1. Although I'm sure it goes without saying that someone will need to manually check every call to |
This is a dup of #4410, same underlying issue |
Salt seems to hang on highstate;
This seems to be triggered by;
So the init.d script is becoming a zombie somehow.. though I'm not sure why this is happening.
Should this be considered a bug in salt, because it's not handling this gracefully? (the script exits fine for me)
Thanks
Cal
The text was updated successfully, but these errors were encountered: