Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

highstate hangs on zombie proc #5567

Closed
foxx opened this issue Jun 16, 2013 · 25 comments
Closed

highstate hangs on zombie proc #5567

foxx opened this issue Jun 16, 2013 · 25 comments
Labels
Bug broken, incorrect, or confusing behavior severity-low 4th level, cosemtic problems, work around exists

Comments

@foxx
Copy link

foxx commented Jun 16, 2013

Salt seems to hang on highstate;

root      1569  1.0  5.4 187360 27088 ?        Sl   22:09   0:00 python /usr/bin/salt-minion
root      3702  0.0  0.0      0     0 ?        Z    22:10   0:00  \_ [sh] <defunct>

This seems to be triggered by;

2013-06-16 22:10:27,839 [salt.loaded.int.module.cmdmod][INFO    ] Executing command 'service freeswitch start' in directory '/root'

So the init.d script is becoming a zombie somehow.. though I'm not sure why this is happening.

Should this be considered a bug in salt, because it's not handling this gracefully? (the script exits fine for me)

Thanks

Cal

@foxx
Copy link
Author

foxx commented Jun 16, 2013

Started looking through the code to find out what's broken, came across;

https://github.com/saltstack/salt/blob/develop/salt/modules/cmdmod.py#L299

"This is where the magic happens" - heh....

If I run this manually;

>>> import subprocess
>>> x = subprocess.Popen("service freeswitch start", **kwargs); x.wait()
>>> 4397 Backgrounding.

>>> x.wait()
0

After adding a bit of debugging, the whole kwargs being passed through is;

kwargs = {'executable': '/bin/bash', 'shell': True, 'env': {'LC_NUMERIC': 'en_US.UTF-8', 'LC_MEASUREMENT': 'en_US.UTF-8', 'SSH_CLIENT': '172.16.2.4 34881 22', 'LOGNAME': 'root', 'USER': 'root', 'HOME': '/root', 'LC_PAPER': 'en_US.UTF-8', 'PATH': '/sbin:/usr/sbin:/bin:/usr/bin', 'UCF_FORCE_CONFFOLD': '1', 'LANG': 'en_GB.UTF-8', 'TERM': 'xterm', 'SHELL': '/bin/bash', 'LANGUAGE': 'en_GB:en', 'SHLVL': '1', '_': '/etc/init.d/salt-minion', 'LC_ADDRESS': 'en_US.UTF-8', 'DEBIAN_FRONTEND': 'noninteractive', 'LC_ALL': 'C', 'APT_LISTCHANGES_FRONTEND': 'none', 'LC_IDENTIFICATION': 'en_US.UTF-8', 'SSH_TTY': '/dev/pts/0', 'OLDPWD': '/usr/lib/pymodules/python2.6/salt/modules', 'LC_TELEPHONE': 'en_US.UTF-8', 'LC_MONETARY': 'en_US.UTF-8', 'PWD': '/root', 'LC_NAME': 'en_US.UTF-8', 'MAIL': '/var/mail/root', 'LC_TIME': 'en_US.UTF-8', 'SSH_CONNECTION': '172.16.2.4 34881 172.16.2.10 22', 'APT_LISTBUGS_FRONTEND': 'none'}, 'stdout': -1, 'close_fds': True, 'stderr': -2, 'cwd': '/root'}

So I tried this manually;

>>> import subprocess
>>> x = subprocess.Popen("service freeswitch start", **kwargs); x.wait()
0
>>> 

Added some more debugging to see where it's hanging;

    # This is where the magic happens
    log.debug("starting popen: %s - %s" % ( cmd, kwargs, ))
    proc = subprocess.Popen(cmd, **kwargs)
    log.debug("communicate popen: %s - %s" % ( cmd, kwargs, ))
    out, err = proc.communicate()
    log.debug("stopped popen: %s - %s" % ( cmd, kwargs, ))

This returns;

2013-06-16 23:24:44,444 [salt.loaded.int.module.cmdmod][INFO    ] Executing command 'service freeswitch start' in directory '/root'
2013-06-16 23:24:44,445 [salt.loaded.int.module.cmdmod][DEBUG   ] starting popen: service freeswitch start - {'executable': '/bin/bash', 'shell': True, 'env': {'LC_NUMERIC': 'en_US.UTF-8', 'LC_MEASUREMENT': 'en_US.UTF-8', 'SSH_CLIENT': '172.16.2.4 34881 22', 'LOGNAME': 'root', 'USER': 'root', 'HOME': '/root', 'LC_PAPER': 'en_US.UTF-8', 'PATH': '/sbin:/usr/sbin:/bin:/usr/bin', 'UCF_FORCE_CONFFOLD': '1', 'LANG': 'en_GB.UTF-8', 'TERM': 'xterm', 'SHELL': '/bin/bash', 'LANGUAGE': 'en_GB:en', 'SHLVL': '1', '_': '/etc/init.d/salt-minion', 'LC_ADDRESS': 'en_US.UTF-8', 'DEBIAN_FRONTEND': 'noninteractive', 'LC_ALL': 'C', 'APT_LISTCHANGES_FRONTEND': 'none', 'LC_IDENTIFICATION': 'en_US.UTF-8', 'SSH_TTY': '/dev/pts/0', 'OLDPWD': '/usr/lib/pymodules/python2.6/salt', 'LC_TELEPHONE': 'en_US.UTF-8', 'LC_MONETARY': 'en_US.UTF-8', 'PWD': '/root', 'LC_NAME': 'en_US.UTF-8', 'MAIL': '/var/mail/root', 'LC_TIME': 'en_US.UTF-8', 'SSH_CONNECTION': '172.16.2.4 34881 172.16.2.10 22', 'APT_LISTBUGS_FRONTEND': 'none'}, 'stdout': -1, 'close_fds': True, 'stderr': None, 'cwd': '/root'}
2013-06-16 23:24:44,446 [salt.loaded.int.module.cmdmod][DEBUG   ] communicate popen: service freeswitch start - {'executable': '/bin/bash', 'shell': True, 'env': {'LC_NUMERIC': 'en_US.UTF-8', 'LC_MEASUREMENT': 'en_US.UTF-8', 'SSH_CLIENT': '172.16.2.4 34881 22', 'LOGNAME': 'root', 'USER': 'root', 'HOME': '/root', 'LC_PAPER': 'en_US.UTF-8', 'PATH': '/sbin:/usr/sbin:/bin:/usr/bin', 'UCF_FORCE_CONFFOLD': '1', 'LANG': 'en_GB.UTF-8', 'TERM': 'xterm', 'SHELL': '/bin/bash', 'LANGUAGE': 'en_GB:en', 'SHLVL': '1', '_': '/etc/init.d/salt-minion', 'LC_ADDRESS': 'en_US.UTF-8', 'DEBIAN_FRONTEND': 'noninteractive', 'LC_ALL': 'C', 'APT_LISTCHANGES_FRONTEND': 'none', 'LC_IDENTIFICATION': 'en_US.UTF-8', 'SSH_TTY': '/dev/pts/0', 'OLDPWD': '/usr/lib/pymodules/python2.6/salt', 'LC_TELEPHONE': 'en_US.UTF-8', 'LC_MONETARY': 'en_US.UTF-8', 'PWD': '/root', 'LC_NAME': 'en_US.UTF-8', 'MAIL': '/var/mail/root', 'LC_TIME': 'en_US.UTF-8', 'SSH_CONNECTION': '172.16.2.4 34881 172.16.2.10 22', 'APT_LISTBUGS_FRONTEND': 'none'}, 'stdout': -1, 'close_fds': True, 'stderr': None, 'cwd': '/root'}

Highstate hangs, and ps shows;

root      5539  0.8  5.2 291900 26524 ?        Sl   23:24   0:00 python /usr/bin/salt-minion
root      5613  0.0  0.0      0     0 ?        Z    23:24   0:00  \_ [freeswitch] <defunct>
999       5623  0.8  6.5 219668 32960 ?        S<l  23:24   0:00 /usr/bin/freeswitch -u freeswitch -g freeswitch -nc -rp -nonat

If I manually kill the process, the following logs appear;

2013-06-16 23:26:21,950 [salt.loaded.int.module.cmdmod][DEBUG   ] stopped popen: service freeswitch start - {'executable': '/bin/bash', 'shell': True, 'env': {'LC_NUMERIC': 'en_US.UTF-8', 'LC_MEASUREMENT': 'en_US.UTF-8', 'SSH_CLIENT': '172.16.2.4 34881 22', 'LOGNAME': 'root', 'USER': 'root', 'HOME': '/root', 'LC_PAPER': 'en_US.UTF-8', 'PATH': '/sbin:/usr/sbin:/bin:/usr/bin', 'UCF_FORCE_CONFFOLD': '1', 'LANG': 'en_GB.UTF-8', 'TERM': 'xterm', 'SHELL': '/bin/bash', 'LANGUAGE': 'en_GB:en', 'SHLVL': '1', '_': '/etc/init.d/salt-minion', 'LC_ADDRESS': 'en_US.UTF-8', 'DEBIAN_FRONTEND': 'noninteractive', 'LC_ALL': 'C', 'APT_LISTCHANGES_FRONTEND': 'none', 'LC_IDENTIFICATION': 'en_US.UTF-8', 'SSH_TTY': '/dev/pts/0', 'OLDPWD': '/usr/lib/pymodules/python2.6/salt', 'LC_TELEPHONE': 'en_US.UTF-8', 'LC_MONETARY': 'en_US.UTF-8', 'PWD': '/root', 'LC_NAME': 'en_US.UTF-8', 'MAIL': '/var/mail/root', 'LC_TIME': 'en_US.UTF-8', 'SSH_CONNECTION': '172.16.2.4 34881 172.16.2.10 22', 'APT_LISTBUGS_FRONTEND': 'none'}, 'stdout': -1, 'close_fds': True, 'stderr': None, 'cwd': '/root'}

At which point it repeats the service start command and hangs again.

Sadly I just don't know enough about the internals of salt to figure out what is going on here :/

@foxx
Copy link
Author

foxx commented Jun 16, 2013

Ah, so I managed to replicate the problem;

>>> import subprocess
>>> kwargs = {'executable': '/bin/bash', 'shell': True, 'env': {'LC_NUMERIC': 'en_US.UTF-8', 'LC_MEASUREMENT': 'en_US.UTF-8', 'SSH_CLIENT': '172.16.2.4 34881 22', 'LOGNAME': 'root', 'USER': 'root', 'HOME': '/root', 'LC_PAPER': 'en_US.UTF-8', 'PATH': '/sbin:/usr/sbin:/bin:/usr/bin', 'UCF_FORCE_CONFFOLD': '1', 'LANG': 'en_GB.UTF-8', 'TERM': 'xterm', 'SHELL': '/bin/bash', 'LANGUAGE': 'en_GB:en', 'SHLVL': '1', '_': '/etc/init.d/salt-minion', 'LC_ADDRESS': 'en_US.UTF-8', 'DEBIAN_FRONTEND': 'noninteractive', 'LC_ALL': 'C', 'APT_LISTCHANGES_FRONTEND': 'none', 'LC_IDENTIFICATION': 'en_US.UTF-8', 'SSH_TTY': '/dev/pts/0', 'OLDPWD': '/usr/lib/pymodules/python2.6/salt', 'LC_TELEPHONE': 'en_US.UTF-8', 'LC_MONETARY': 'en_US.UTF-8', 'PWD': '/root', 'LC_NAME': 'en_US.UTF-8', 'MAIL': '/var/mail/root', 'LC_TIME': 'en_US.UTF-8', 'SSH_CONNECTION': '172.16.2.4 34881 172.16.2.10 22', 'APT_LISTBUGS_FRONTEND': 'none'}, 'stdout': -1, 'close_fds': True, 'stderr': None, 'cwd': '/root'}
>>> 
>>> x = subprocess.Popen("service freeswitch start", **kwargs)
6197 Backgrounding.
>>>  = proc.communicate()
KeyboardInterrupt
>>> out, err = x.communicate()
--- hangs forever ---
>>> x.pid
6187
root@bare1:~# ps faux | grep 6187
root      6187  0.0  0.0      0     0 pts/0    Z+   23:32   0:00  |           \_ [freeswitch] <defunct>

@thatch45
Copy link
Contributor

ok, so we need to be able to change the behavior of a command to only execute with wait instead of communicate?

@foxx
Copy link
Author

foxx commented Jun 16, 2013

Looked into this a bit further, if I exclude all the kwargs options, it behaves as expected;

root@bare1:~# python
Python 2.6.6 (r266:84292, Dec 26 2010, 22:31:48) 
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import subprocess
>>> x = subprocess.Popen("service freeswitch start", shell=True)
>>> exit
6709 Backgrounding.

>>> out, err = proc.communicate()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'proc' is not defined
>>> out, err = x.communicate()
>>> 

@foxx
Copy link
Author

foxx commented Jun 16, 2013

Okay, I found the offending option;

root@bare1:~# python
Python 2.6.6 (r266:84292, Dec 26 2010, 22:31:48) 
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import subprocess
>>> kwargs = {'executable': '/bin/bash', 'shell': True, 'stdout': -1}
>>> x = subprocess.Popen("service freeswitch start", **kwargs)
>>> 6845 Backgrounding.

>>> out, err = x.communicate()

EDIT: Looks like stdout = -1 is causing this to happen, although I'm not entirely sure why... should this not be subprocess.PIPE? ANOTHER EDIT: -1 means subprocess.PIPE.

Replacing this with subprocess.PIPE didn't seem to help either (as -1 already means subprocess.PIPE)

root@bare1:~# python
Python 2.6.6 (r266:84292, Dec 26 2010, 22:31:48) 
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import subprocess
>>> kwargs = {'executable': '/bin/bash', 'shell': True, 'stdout': subprocess.PIPE, 'stderr': subprocess.PIPE}
>>> x = subprocess.Popen("service freeswitch start", **kwargs)
>>> 
>>> out, err = x.communicate()

This might be related;
http://grokbase.com/t/rabbitmq/rabbitmq-discuss/12ag9g26aa/service-rabbitmq-server-start-hangs-when-using-pythons-popen-from-subprocess

However a quick check shows no fork/setsid in the strace, so I'm not sure if that's relevant;

root@bare1:~# cat /tmp/lol | grep setsid
root@bare1:~# cat /tmp/lol | grep fork
root@bare1:~# ls -lah /tmp/lol
-rw-r--r-- 1 root root 853K Jun 16 23:50 /tmp/lol

I've tried this with wait();

root@bare1:~# python
Python 2.6.6 (r266:84292, Dec 26 2010, 22:31:48) 
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import subprocess
>>> kwargs = {'executable': '/bin/bash', 'shell': True, 'env': {'LC_NUMERIC': 'en_US.UTF-8', 'LC_MEASUREMENT': 'en_US.UTF-8', 'SSH_CLIENT': '172.16.2.4 34881 22', 'LOGNAME': 'root', 'USER': 'root', 'HOME': '/root', 'LC_PAPER': 'en_US.UTF-8', 'PATH': '/sbin:/usr/sbin:/bin:/usr/bin', 'UCF_FORCE_CONFFOLD': '1', 'LANG': 'en_GB.UTF-8', 'TERM': 'xterm', 'SHELL': '/bin/bash', 'LANGUAGE': 'en_GB:en', 'SHLVL': '1', '_': '/etc/init.d/salt-minion', 'LC_ADDRESS': 'en_US.UTF-8', 'DEBIAN_FRONTEND': 'noninteractive', 'LC_ALL': 'C', 'APT_LISTCHANGES_FRONTEND': 'none', 'LC_IDENTIFICATION': 'en_US.UTF-8', 'SSH_TTY': '/dev/pts/0', 'OLDPWD': '/usr/lib/pymodules/python2.6/salt/modules', 'LC_TELEPHONE': 'en_US.UTF-8', 'LC_MONETARY': 'en_US.UTF-8', 'PWD': '/root', 'LC_NAME': 'en_US.UTF-8', 'MAIL': '/var/mail/root', 'LC_TIME': 'en_US.UTF-8', 'SSH_CONNECTION': '172.16.2.4 34881 172.16.2.10 22', 'APT_LISTBUGS_FRONTEND': 'none'}, 'stdout': -1, 'close_fds': True, 'stderr': -2, 'cwd': '/root'}
>>> 
>>> x = subprocess.Popen("service freeswitch start", **kwargs)
>>> out, err = x.communicate()
^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/subprocess.py", line 683, in communicate
    stdout = self.stdout.read()
KeyboardInterrupt
>>> out, err = x.communicate()
^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/subprocess.py", line 683, in communicate
    stdout = self.stdout.read()
KeyboardInterrupt
>>> x.wait()
0
>>> x.returncode
0

It would appear that returncode is None if the command hasn't finished running, and wait() returns correctly;

>>> x  = subprocess.Popen("/bin/sleep 1000", shell=True)
>>> x.wait()

^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/subprocess.py", line 1182, in wait
    pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)
  File "/usr/lib/python2.6/subprocess.py", line 455, in _eintr_retry_call
    return func(*args)
KeyboardInterrupt
>>> x.returncode
>>> 

So the question here is, when should wait() be used, and when should communicate() be used? And can communicate() be fixed to prevent this?

Looking at subprocess.py, communicate() and wait() behave very differently;
https://bitbucket.org/pypy/pypy/src/9d88b4875d6e/lib-python/2.7/subprocess.py

It would appear that communicate() is certainly broken in this case. One of my colleagues was able to implement a fixed version of communicate() based on the following article;

http://alexandre.deverteuil.net/blogue/archives/2011/04/03/monitor_python_subprocess_output_streams_in_real-time/

However, this is very much considered alpha and I would like to know why subprocess.communicate() is broken in the first place.

@foxx
Copy link
Author

foxx commented Jun 16, 2013

Also, some information about the process causing this issue.. freeswitch forks itself to run as another user upon being executed, as seen here;

root      7763 79.1  0.1   4200   812 pts/0    S+   00:30   0:05  |       \_ strace -ifq -p -- /usr/bin/freeswitch -u freeswitch -g freeswitch -nc -rp -nonat
root      6252  0.0  0.6  70460  3372 ?        Ss   Jun16   0:00  \_ sshd: root@pts/1 
root      6255  0.0  0.7 125440  3564 pts/1    Ss   Jun16   0:00      \_ -bash
root      7778  0.0  0.2 121216  1132 pts/1    R+   00:30   0:00          \_ ps faux
999       7766  0.0  0.9  89532  4800 ?        S<l  00:30   0:00 /usr/bin/freeswitch -u freeswitch -g freeswitch -nc -rp -nonat
999       7770 18.5  0.5  89532  2932 ?        RN   00:30   0:01  \_ /usr/bin/freeswitch -u freeswitch -g freeswitch -nc -rp -nonat
[    7ffcfcbf0b31] clone(child_stack=0x7ffcfd8b9f70, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7ffcfd8ba9d0, tls=0x7ffcfd8ba700, child_tidptr=0x7ffcfd8ba9d0) = 7767
[pid  7767] [    7ffcfce89859] set_robust_list(0x7ffcfd8ba9e0, 0x18) = 0
[pid  7767] [    7ffcfcbea3e3] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[pid  7766] [    7ffcfcbcb757] sched_getparam(7767, { 1 }) = 0

...

[pid  7766] [    7ffcfcbc1d46] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7ffcfd981a50) = 7770
[pid  7770] [    7ffcfcbcb787] sched_setscheduler(0, SCHED_OTHER, { 0 } <unfinished ...>
[pid  7766] [    7ffcfce911ad] close(8 <unfinished ...>
[pid  7770] [    7ffcfcbcb787] <... sched_setscheduler resumed> ) = 0
[pid  7766] [    7ffcfce911ad] <... close resumed> ) = 0

So, as you can see, the 7766 process is still running if we run this manually.

However when doing it through service, we get;

root@bare1:~# service freeswitch start
exit
root@bare1:~# 7797 Backgrounding.

# ps faux
999       7797  1.7  6.4 219076 32344 ?        S<l  00:32   0:00 /usr/bin/freeswitch -u freeswitch -g freeswitch -nc -rp -nonat

If I run fs directly again, you can see the fork/clone.

root@bare1:~# strace /usr/bin/freeswitch -u freeswitch -g freeswitch -nc -rp -nonat
--- snip ---
brk(0x257b000)                          = 0x257b000
rt_sigaction(SIGILL, {0x4037d0, [ILL], SA_RESTORER|SA_RESTART, 0x7f9215db3230}, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGTERM, {0x4037d0, [TERM], SA_RESTORER|SA_RESTART, 0x7f9215db3230}, {SIG_DFL, [], 0}, 8) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f9216be1a50) = 4028
exit_group(0)                           = ?
root@bare1:~# 4029 Backgrounding.

The same problem happens if I use init.d directly without service;

root@bare1:~# python
Python 2.6.6 (r266:84292, Dec 26 2010, 22:31:48) 
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import subprocess
>>> kwargs = {'executable': '/bin/bash', 'shell': True, 'stdout': subprocess.PIPE, 'stderr': subprocess.PIPE}
>>> 
>>> x = subprocess.Popen("/etc/init.d/freeswitch start", **kwargs)
>>> out, err = x.communicate()

If I try this directly against the freeswitch binary, I get the following;

root@bare1:~# python
Python 2.6.6 (r266:84292, Dec 26 2010, 22:31:48) 
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import subprocess
>>> kwargs = {'executable': '/bin/bash', 'shell': True, 'stdout': subprocess.PIPE, 'stderr': subprocess.PIPE}
>>> x = subprocess.Popen("/usr/bin/freeswitch -u freeswitch -g freeswitch -nc -rp -nonat", **kwargs)
>>> out, err = x.communicate()
# ps faux
root      8120  0.0  1.1 138800  5696 pts/1    S+   00:37   0:00          \_ python
root      8121  0.0  0.0      0     0 pts/1    Z+   00:38   0:00              \_ [freeswitch] <defunct>
999       8123  2.3  6.4 267104 32548 ?        S<l  00:38   0:00 /usr/bin/freeswitch -u freeswitch -g freeswitch -nc -rp -nonat

The line it hangs on is;

>>> out, err = x.communicate()
^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/subprocess.py", line 695, in communicate
    return self._communicate(input)
  File "/usr/lib/python2.6/subprocess.py", line 1215, in _communicate
    rlist, wlist, xlist = select.select(read_set, write_set, [])
KeyboardInterrupt

So to clarify the bug here, subprocess.communicate() does not cleanly handle communication when a process forks().

I'm still digging into this, and will update the thread as I go along.

@foxx
Copy link
Author

foxx commented Jun 17, 2013

Okay I've managed to make a fix for this based on the advice contained in [1] and [2].

This also has the added benefit that it fixes the deadlocking bug that occurs if the buffer size is too big :)

I've made it so the code can function as a drop in replacement and functions exactly as you'd expect communicate() to.

The downside is that it writes the output to disk, which isn't exactly the most cleanest method.


cmd = "service freeswitch start"
kwargs = {'executable': '/bin/bash', 'shell': True, 'env': {'LC_NUMERIC': 'en_US.UTF-8', 'LC_MEASUREMENT': 'en_US.UTF-8', 'SSH_CLIENT': '172.16.2.4 34881 22', 'LOGNAME': 'root', 'USER': 'root', 'HOME': '/root', 'LC_PAPER': 'en_US.UTF-8', 'PATH': '/sbin:/usr/sbin:/bin:/usr/bin', 'UCF_FORCE_CONFFOLD': '1', 'LANG': 'en_GB.UTF-8', 'TERM': 'xterm', 'SHELL': '/bin/bash', 'LANGUAGE': 'en_GB:en', 'SHLVL': '1', '_': '/etc/init.d/salt-minion', 'LC_ADDRESS': 'en_US.UTF-8', 'DEBIAN_FRONTEND': 'noninteractive', 'LC_ALL': 'C', 'APT_LISTCHANGES_FRONTEND': 'none', 'LC_IDENTIFICATION': 'en_US.UTF-8', 'SSH_TTY': '/dev/pts/0', 'OLDPWD': '/usr/lib/pymodules/python2.6/salt/modules', 'LC_TELEPHONE': 'en_US.UTF-8', 'LC_MONETARY': 'en_US.UTF-8', 'PWD': '/root', 'LC_NAME': 'en_US.UTF-8', 'MAIL': '/var/mail/root', 'LC_TIME': 'en_US.UTF-8', 'SSH_CONNECTION': '172.16.2.4 34881 172.16.2.10 22', 'APT_LISTBUGS_FRONTEND': 'none'}, 'stdout': -1, 'close_fds': True, 'stderr': -2, 'cwd': '/root'}

import subprocess
import tempfile
import time
import os

def easy_execute(cmd, kwargs):
    fdout = None
    fderr = None
    try:
        fdout = tempfile.NamedTemporaryFile()
        fderr = tempfile.NamedTemporaryFile()
        kwargs['stdout'] = fdout
        kwargs['stderr'] = fderr
        x = subprocess.Popen(cmd, **kwargs)
        x.wait()
        time.sleep(1)
        for fd in (fdout, fderr):
            fd.flush()
            fd.seek(0)
        return (fdout.read(), fderr.read())
    finally:
        if fdout:
            fdout.close()
        if fderr:
            fderr.close()

print easy_execute(cmd, kwargs)

This works by using a tempfile for subprocess, it waits for the process to exit with wait(), and then does a seek(0) back to the start of each tempfile and reads the content.

If you're happy with this patch, let me know and I'll do a fork/merge.

[1] http://stackoverflow.com/questions/3575554/python-subprocess-with-timeout-and-large-output-64k
[2] http://thraxil.org/users/anders/posts/2008/03/13/Subprocess-Hanging-PIPE-is-your-enemy/

@HarryR
Copy link

HarryR commented Jun 17, 2013

Ideally we'd want to use the NamedTemporaryFile and return the two temporary files instead of returning the whole string in ram.

The iterate_fds approach is uh.... a little hacky, but I've been using this in production for ages to get around the bug so shrug.

import signal
import os
import subprocess
import select

def handler(signum, frame):
    print 'Signal handler called with signal', signum

def iterate_fds(handles, buffersize=1024):
    handles = dict([(handle, True) for handle in handles])
    results = dict([(handle, '') for handle in handles])
    while len(handles):
        try:
            for handle in select.select(handles.keys(), tuple(), tuple())[0]:
                print '.'
                data = handle.read()
                if data:
                    results[handle] += data
                else:
                    handles.pop(handle)
        except select.error, e:
            # If there's a signal handler installed select() will raise a 
            # select.error(4, 'Interrupted system call') when the signal handler
            # returns.
            # By ignoring interrupted select() it continues on as normal rather
            # than breaking...
            if e.args[0] == 4:
                pass
            raise e
    return [results[handle] for handle in results.keys()]

signal.signal(signal.SIGUSR1, handler)

kwargs = {'executable': '/bin/bash', 'shell': True, 'stdout': subprocess.PIPE, 'stderr': subprocess.PIPE}
x = subprocess.Popen("sleep 1000", **kwargs)
out, err = iterate_fds([x.stdout, x.stderr])
x.wait()

@foxx
Copy link
Author

foxx commented Jun 17, 2013

I'm testing the code from @HarryR now against a real salt deployment, and re-factoring into a drop-in replacement. I'll update once finished (should be in an hour or so)

@foxx
Copy link
Author

foxx commented Jun 17, 2013

The code from @HarryR will not work for following reasons (i've just tested this and got the same problem as before);

  1. Any forked processes will hang on .read()
  2. Checking if we need to return using .returncode before read() will not work, as data might still be in stdout/stderr
  3. Checking if we need to return using .returncode after read() will not work, as read() will block indefinitely
  4. select.select() tells us the stdout/stderr handler is ready to read(), but read() will block indefinitely
  5. setting non blocking reads on stdout/stderr using fnctl is not an option as it won't work on windows

The zombie is being caused by stdout/stderr holding a file descriptor open to the parent of the child fork, as soon as this fd is closed, the zombie disappears.

According to [1], read() may still block after select.select() in exceptional cases. The suggested fix for this is to use non blocking reads (via fnctl), but again this means it would be windows incompatible.

So from what I can tell, the only safe way to do this is to use tempfile.

@thatch45 Are you happy to accept tempfile as the fix? If so, I'll go ahead and make a drop in communicate() replacement which uses tempfile instead.

[1] http://stackoverflow.com/questions/5351994/will-read-ever-block-after-select

@foxx
Copy link
Author

foxx commented Jun 17, 2013

Okay, another slight problem.. using tempfile results in a potential race condition. Calling wait() guarantees that the file has terminated, but it doesn't guarantee that the file has been written to yet by forked process. If we look here;

            print "READ ONE"
            for fd in (self._stdout, self._stderr):
                #fd.write("hello")
                fd.flush()
                fd.seek(0)
                #fd.close()
            print (self._stdout.read(), self._stderr.read())
            time.sleep(0.1)

            print "READ TWO"
            for fd in (self._stdout, self._stderr):
                fd.flush()
                fd.seek(0)
            print (self._stdout.read(), self._stderr.read())
write(1, "READ ONE\n", 9READ ONE
)               = 9
lseek(3, 0, SEEK_SET)                   = 0
lseek(4, 0, SEEK_SET)                   = 0
fstat(3, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR)                   = 0
read(3, "", 8192)                       = 0
fstat(4, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
lseek(4, 0, SEEK_CUR)                   = 0
read(4, "", 8192)                       = 0
write(1, "('', '')\n", 9('', '')
)               = 9
select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout)

write(1, "READ TWO\n", 9READ TWO
)               = 9
lseek(3, 0, SEEK_SET)                   = 0
lseek(4, 0, SEEK_SET)                   = 0
fstat(3, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR)                   = 0
read(3, "", 8192)                       = 0
fstat(4, {st_mode=S_IFREG|0600, st_size=21, ...}) = 0
lseek(4, 0, SEEK_CUR)                   = 0
read(4, "13213 Backgrounding.\n", 4096) = 21
read(4, "", 4096)                       = 0

In the above we can see that st_size=0 on the first read attempt, we then sleep for 0.001 seconds, after which st_size=21. However if I write to the fd within the same scope (i.e. just before the .flush()), this appears immediately. Although we could use a sleep(), this isn't really guaranteed to be safe.

So far my conclusion is that subprocess is fucked.

@foxx
Copy link
Author

foxx commented Jun 17, 2013

There was some useful discussion about this in [1] and [2]. They tackle the problem by using a double fork, but this means you cannot read data that is coming back. If we are going to ignore the data, then we wouldn't be using communicate() anyway, we'd use wait().

Interestingly, I managed to stop the zombie from happening by using;

x = subprocess.Popen("service freeswitch start", shell=True, stdout=subprocess.PIPE,
    stderr=subprocess.PIPE, close_fds=True)
print "communicate()"
x.wait() # XXX: this stops the child proc from becoming a zombie
print x.pid
x.communicate()
print "communicate() done"

However it still hangs on;

# python lol.py 
communicate()
14120
--- hangs here ---
^CTraceback (most recent call last):
  File "lol.py", line 78, in <module>
    x.communicate()
  File "/usr/lib/python2.6/subprocess.py", line 691, in communicate
    return self._communicate(input)
  File "/usr/lib/python2.6/subprocess.py", line 1211, in _communicate
    rlist, wlist, xlist = select.select(read_set, write_set, [])
KeyboardInterrupt

And as you can see here, it doesn't show up in proc list during hanging;

root@bare1:~# ps -p 14120
  PID TTY          TIME CMD

Attaching strace shows the fork, and also data being received from the child proc "14063 Backgrounding"

clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f60bead79d0) = 14120
close(8)                                = 0
close(4)                                = 0
close(6)                                = 0
mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f60b6be9000
read(7, "", 1048576)                    = 0
mremap(0x7f60b6be9000, 1052672, 4096, MREMAP_MAYMOVE) = 0x7f60b6be9000
close(7)                                = 0
munmap(0x7f60b6be9000, 4096)            = 0
fcntl(3, F_GETFL)                       = 0 (flags O_RDONLY)
fstat(3, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f60b6ce9000
lseek(3, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
fstat(3, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
munmap(0x7f60b6ce9000, 4096)            = 0
fcntl(5, F_GETFL)                       = 0 (flags O_RDONLY)
fstat(5, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f60b6ce9000
lseek(5, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
fstat(5, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
munmap(0x7f60b6ce9000, 4096)            = 0
fstat(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 2), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f60b6ce9000
write(1, "communicate()\n", 14communicate()
)         = 14
wait4(14120, 0x7fffe8f3e064, WNOHANG, NULL) = 0
write(1, "14120\n", 614120
)                  = 6
select(6, [3 5], [], [], NULL)          = ? ERESTARTNOHAND (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
select(6, [3 5], [], [], NULL)          = 1 (in [5])
read(5, "14131 Backgrounding.\n", 1024) = 21
select(6, [3 5], [], [], NULL

@thatch45 It's possible we might need to go with your original suggestion, which involves selectively disabling communicate() on calls that don't need it.

[1] http://code.activestate.com/recipes/278731-creating-a-daemon-the-python-way/
[2] http://pcapriotti.wordpress.com/2008/02/12/fighting-zombies-with-a-double-fork/

@foxx
Copy link
Author

foxx commented Jun 17, 2013

Next I tried to repeat this fork() problem by creating a simple python script which forks. However, I was unable to repeat the same behavior as freeswitch;

communicate() on python fork proc

print "START"
x = subprocess.Popen("/root/fork.py", shell=True, stdout=subprocess.PIPE,
    stderr=subprocess.PIPE, close_fds=True)
print "communicate()"
x.poll()
print x.pid
print x.communicate()
print "communicate() done"

Results in;

write(1, "START\n", 6START
)                  = 6
pipe([3, 4])                            = 0
pipe([5, 6])                            = 0
pipe([7, 8])                            = 0
fcntl(8, F_GETFD)                       = 0
fcntl(8, F_SETFD, FD_CLOEXEC)           = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7ff74f74f9d0) = 14264
close(8)                                = 0
close(4)                                = 0
close(6)                                = 0
mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff747861000
read(7, "", 1048576)                    = 0
mremap(0x7ff747861000, 1052672, 4096, MREMAP_MAYMOVE) = 0x7ff747861000
close(7)                                = 0
munmap(0x7ff747861000, 4096)            = 0
fcntl(3, F_GETFL)                       = 0 (flags O_RDONLY)
fstat(3, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff747961000
lseek(3, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
fstat(3, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
munmap(0x7ff747961000, 4096)            = 0
fcntl(5, F_GETFL)                       = 0 (flags O_RDONLY)
fstat(5, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff747961000
lseek(5, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
fstat(5, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
munmap(0x7ff747961000, 4096)            = 0
write(1, "communicate()\n", 14communicate()
)         = 14
wait4(14264, 0x7fff93b32544, WNOHANG, NULL) = 0
write(1, "14264\n", 614264
)                  = 6
select(6, [3 5], [], [], NULL)          = 1 (in [3])
read(3, "Parent Process: PID# 14265\n", 1024) = 27
select(6, [3 5], [], [], NULL)          = 1 (in [3])
read(3, "Child Process: PID# 14266\n", 1024) = 26
select(6, [3 5], [], [], NULL)          = ? ERESTARTNOHAND (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
select(6, [3 5], [], [], NULL)          = 1 (in [3])
read(3, "", 1024)                       = 0
close(3)                                = 0
select(6, [5], [], [], NULL)            = 1 (in [5])
read(5, "", 1024)                       = 0
close(5)                                = 0
wait4(14264, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 14264
write(1, "('Parent Process: PID# 14265\\nCh"..., 64('Parent Process: PID# 14265\nChild Process: PID# 14266\n', '')
) = 64
write(1, "communicate() done\n", 19communicate() done
)    = 19
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7ff74f32eff0}, {0x4da8c0, [], SA_RESTORER, 0x7ff74f32eff0}, 8) = 0
exit_group(0)                           = ?

communicate() on freeswitch fork proc

write(1, "START\n", 6START
)                  = 6
pipe([3, 4])                            = 0
pipe([5, 6])                            = 0
pipe([7, 8])                            = 0
fcntl(8, F_GETFD)                       = 0
fcntl(8, F_SETFD, FD_CLOEXEC)           = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fd23786b9d0) = 14328
close(8)                                = 0
close(4)                                = 0
close(6)                                = 0
mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd22f97d000
read(7, "", 1048576)                    = 0
mremap(0x7fd22f97d000, 1052672, 4096, MREMAP_MAYMOVE) = 0x7fd22f97d000
close(7)                                = 0
munmap(0x7fd22f97d000, 4096)            = 0
fcntl(3, F_GETFL)                       = 0 (flags O_RDONLY)
fstat(3, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd22fa7d000
lseek(3, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
fstat(3, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
munmap(0x7fd22fa7d000, 4096)            = 0
fcntl(5, F_GETFL)                       = 0 (flags O_RDONLY)
fstat(5, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd22fa7d000
lseek(5, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
fstat(5, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
munmap(0x7fd22fa7d000, 4096)            = 0
write(1, "communicate()\n", 14communicate()
)         = 14
wait4(14328, 0x7ffffd3b30d4, WNOHANG, NULL) = 0
write(1, "14328\n", 614328
)                  = 6
select(6, [3 5], [], [], NULL)          = ? ERESTARTNOHAND (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
select(6, [3 5], [], [], NULL)          = 1 (in [5])
read(5, "14331 Backgrounding.\n", 1024) = 21
select(6, [3 5], [], [], NULL

As you can see in the freeswitch example, select() gets some data, read() fetches it, then select() hangs again until the freeswitch process dies. So let's have a look at the difference between them;

fork.py

write(1, "START\n", 6START
)                  = 6
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f044030c9d0) = 14456
write(1, "Parent Process: PID# 14455\n", 27Parent Process: PID# 14455
) = 27
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f043feebff0}, {0x4da8c0, [], SA_RESTORER, 0x7f043feebff0}, 8) = 0
exit_group(0)                           = ?
root@bare1:~# Child Process: PID# 14456

freeswitch

arch_prctl(ARCH_SET_FS, 0x7f4db2006780) = 0
mprotect(0x7f4dae129000, 4096, PROT_READ) = 0
mprotect(0x7f4daf747000, 4096, PROT_READ) = 0
mprotect(0x7f4dafa3f000, 28672, PROT_READ) = 0
mprotect(0x7f4dafee8000, 4096, PROT_READ) = 0
mprotect(0x7f4db0b52000, 4096, PROT_READ) = 0
mprotect(0x7f4db0d5b000, 4096, PROT_READ) = 0
mprotect(0x7f4db0f8d000, 4096, PROT_READ) = 0
mprotect(0x7f4db14fe000, 16384, PROT_READ) = 0
mprotect(0x7f4db171e000, 4096, PROT_READ) = 0
mprotect(0x7f4db201b000, 4096, PROT_READ) = 0
munmap(0x7f4db2015000, 14391)           = 0
set_tid_address(0x7f4db2006a50)         = 14461
set_robust_list(0x7f4db2006a60, 0x18)   = 0
futex(0x7fff41c95dfc, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fff41c95dfc, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, NULL, 7f4db2006780) = -1 EAGAIN (Resource temporarily unavailable)
rt_sigaction(SIGRTMIN, {0x7f4db150d870, [], SA_RESTORER|SA_SIGINFO, 0x7f4db1516ff0}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {0x7f4db150d900, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x7f4db1516ff0}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM_INFINITY}) = 0
brk(0)                                  = 0x1cc0000
brk(0x1ce1000)                          = 0x1ce1000
rt_sigaction(SIGILL, {0x4037d0, [ILL], SA_RESTORER|SA_RESTART, 0x7f4db11d8230}, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGTERM, {0x4037d0, [TERM], SA_RESTORER|SA_RESTART, 0x7f4db11d8230}, {SIG_DFL, [], 0}, 8) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f4db2006a50) = 14462
exit_group(0)                           = ?

So the manner in which these two examples are forking are quite different, and a simple os.fork() does not replicate this problem. At this stage I have completely reached the end of my knowledge, to the point where I'm not even sure what the correct behavior should be lol.

Or as quoted on IRC:
"i emerged 5 hours later, after being dragged into the rabbit hole and gang raped by alice herself."

@foxx
Copy link
Author

foxx commented Jun 17, 2013

I just did a real quick hack, and was able to apply a temporary fix which bypasses .commuicate() for cmdrun.retcode. However, this would be on the assumption that no where is abusing retcode, i.e. they are calling retcode instead of run_all and expecting stdout/stderr to be filled.

So from what I can tell, the available options to us are;

  1. Find a fix for subprocess to communicate() from hanging
  2. Don't call communicate() unless we need to

I was unable to make [1] work, however initial tests on [2] were promising.

@thatch45 If you're happy with [2], let me know and I'll get a patch written up.

EDIT: Very alpha patch written up - #5576

Cal

@foxx
Copy link
Author

foxx commented Jun 17, 2013

Also, an amusing fact;

ssh bare1.int -- "killall -9 freeswitch; /etc/init.d/freeswitch start &"
foxx@b1 ~ $ f2
freeswitch: no process found
--- hangs forever ---

It would appear this causes ssh to hang as well... It would appear that Python isn't the only thing that has issues with child fork().

@olliewalsh
Copy link
Contributor

This is a freeswitch bug:

http://fisheye.freeswitch.org/browse/freeswitch.git/src/switch.c?hb=true#to339

At line 339 daemonize() returns instead of redirecting and closing the std* file descriptors.

@foxx
Copy link
Author

foxx commented Jun 17, 2013

@olliewalsh Oh man, nicely spotted! I'll report that to them now.

However, users may not always have the ability to fix their broken binary.. in some cases, they will have no control over whether or not their binary is forking correctly or not. On that basis, I think we should still avoid using communicate() unless it's needed.

Thoughts?

@foxx
Copy link
Author

foxx commented Jun 18, 2013

@olliewalsh Btw, I just patched FS to remove that return, and it now functions correctly again - thank you so much for spotting that, really appreciated. Turns out this bug was only just introduced less than 2 months ago lol.

@olliewalsh
Copy link
Contributor

I think we should still avoid using communicate() unless it's needed.

I've seen subprocess.communicate hangs in the past but adding bufsize=-1 to the Popen args always resolved the issue. Never seen this one but then I probably wasn't starting many daemons via subprocess so I've no idea how common this issue might be.

@Gestator
Copy link

Hi same bug ......
#5469

@foxx
Copy link
Author

foxx commented Jun 18, 2013

So moving forward, the core devs need to make a decision on how this should be handled.

@terminalmage closed #5469 on the grounds that the original program is broken. However, I do not feel that salt should assume that all programs are working correctly, in some cases it could take weeks, even years, to get a patch to fix the upstream.

Although there is no way to fully prevent this happening in a safe way, we can avoid it happening by not calling communicate() unnecessarily in cmdmod.retcode. Admittedly this is a rare edge case, but one I believe salt should address. If you need a logical pythonic argument, then cmdmod.retcode is broken purely on the basis that it's making an unnecessary std* call.

This needs a design decision from a core developer.

@basepi
Copy link
Contributor

basepi commented Jun 18, 2013

I'll bug @thatch45 to look into this, I'm not experienced enough with communicate() to make said decision.

@thatch45
Copy link
Contributor

cmd.retcode should not call communicate, looks like that will solve the underlying problem, we should make sure and get a comment in there as well that explains why it is being called that way.

@foxx
Copy link
Author

foxx commented Jun 18, 2013

@thatch45 +1. Although I'm sure it goes without saying that someone will need to manually check every call to cmd.retcode to ensure it's not being used for stdout/stderr. You could perhaps remove stdout/stderr from the return dict, thus causing a KeyError if anyone attempts to access. You could even override set so if anyone tries to manually set 'std*' on the dict afterwards it throws an exception. This would guarantee no one is abusing it. Thoughts?

@thatch45
Copy link
Contributor

This is a dup of #4410, same underlying issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior severity-low 4th level, cosemtic problems, work around exists
Projects
None yet
Development

No branches or pull requests

6 participants