Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minion never returns from state and/or master never sees state completion #4975

Closed
hdemers opened this issue May 10, 2013 · 16 comments
Closed
Assignees
Labels
Bug broken, incorrect, or confusing behavior severity-low 4th level, cosemtic problems, work around exists

Comments

@hdemers
Copy link

hdemers commented May 10, 2013

Trying to install CouchDB on a minion when salt-minion was started from its init script fails. When started from the command line it works.

Installing other packages work. Only CouchDB was found so far not to work.

The command:

salt "ip-10-196-38-159" state.sls couchdb dev -v

The state file:

couchdb:
  pkg:
    - installed

Note the following command does not work either (the logs below are from running this command):

salt "ip-10-196-38-159" pkg.install couchdb

Versions:

salt-minion --version: 0.15.1
salt-master --version: 0.15.1

Both running on Ubuntu 12.10 under AWS EC2.

The output from pkg.install couchdb:

Executing job with jid 20130510152803303471
-------------------------------------------

Execution is still running on ip-10-196-38-159
Execution is still running on ip-10-196-38-159
Execution is still running on ip-10-196-38-159
Execution is still running on ip-10-196-38-159
Execution is still running on ip-10-196-38-159
Execution is still running on ip-10-196-38-159
Execution is still running on ip-10-196-38-159
Execution is still running on ip-10-196-38-159
Execution is still running on ip-10-196-38-159
Execution is still running on ip-10-196-38-159
^CExiting on Ctrl-C
This job's jid is:
20130510152803303471
The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later run:
salt-run jobs.lookup_jid 20130510152803303471

At this point, CouchDB is installed on the minion. I can query it: couchdb -V returns couchdb - Apache CouchDB 1.2.0. No matter how long I wait, it keeps going.

Running the minion in debug mode shows that the following command is executed repeatedly:

saltutil.find_job

An excerpt from the minion debug log:

2013-05-10 19:30:30,156 [salt.minion      ][INFO    ] User hdemers Executing command saltutil.find_job with jid 20130510153030089224
2013-05-10 19:30:30,161 [salt.minion      ][DEBUG   ] Command details {'tgt_type': 'glob', 'jid': '20130510153030089224', 'tgt': 'ip-10-196-38-159', 'ret': '', 'user': 'hdemers', 'arg': ['20130510152803303471'], 'fun': 'saltutil.find_job'}
2013-05-10 19:30:30,192 [salt.loaded.int.module.cmdmod][INFO    ] Executing command 'ps -efH' in directory '/root'
2013-05-10 19:30:30,214 [salt.loaded.int.module.cmdmod][DEBUG   ] output: UID        PID  PPID  C STIME TTY          TIME CMD
root         2     0  0 19:26 ?        00:00:00 [kthreadd]
root         3     2  0 19:26 ?        00:00:00   [ksoftirqd/0]
root         4     2  0 19:26 ?        00:00:00   [kworker/0:0]
root         5     2  0 19:26 ?        00:00:00   [kworker/u:0]
root         6     2  0 19:26 ?        00:00:00   [migration/0]
root         7     2  0 19:26 ?        00:00:00   [watchdog/0]
root         8     2  0 19:26 ?        00:00:00   [cpuset]
root         9     2  0 19:26 ?        00:00:00   [khelper]
root        10     2  0 19:26 ?        00:00:00   [kdevtmpfs]
root        11     2  0 19:26 ?        00:00:00   [netns]
root        12     2  0 19:26 ?        00:00:00   [xenwatch]
root        13     2  0 19:26 ?        00:00:00   [xenbus]
root        14     2  0 19:26 ?        00:00:00   [sync_supers]
root        15     2  0 19:26 ?        00:00:00   [bdi-default]
root        16     2  0 19:26 ?        00:00:00   [kintegrityd]
root        17     2  0 19:26 ?        00:00:00   [kblockd]
root        18     2  0 19:26 ?        00:00:00   [ata_sff]
root        19     2  0 19:26 ?        00:00:00   [khubd]
root        20     2  0 19:26 ?        00:00:00   [md]
root        21     2  0 19:26 ?        00:00:00   [kworker/0:1]
root        22     2  0 19:26 ?        00:00:00   [kworker/u:1]
root        23     2  0 19:26 ?        00:00:00   [khungtaskd]
root        24     2  0 19:26 ?        00:00:00   [kswapd0]
root        25     2  0 19:26 ?        00:00:00   [ksmd]
root        26     2  0 19:26 ?        00:00:00   [fsnotify_mark]
root        27     2  0 19:26 ?        00:00:00   [ecryptfs-kthrea]
root        28     2  0 19:26 ?        00:00:00   [crypto]
root        37     2  0 19:26 ?        00:00:00   [kthrotld]
root        38     2  0 19:26 ?        00:00:00   [khvcd]
root        39     2  0 19:26 ?        00:00:00   [binder]
root        58     2  0 19:26 ?        00:00:00   [deferwq]
root        59     2  0 19:26 ?        00:00:00   [charger_manager]
root        60     2  0 19:26 ?        00:00:00   [devfreq_wq]
root       186     2  0 19:26 ?        00:00:00   [jbd2/xvda1-8]
root       187     2  0 19:26 ?        00:00:00   [ext4-dio-unwrit]
root       312     2  0 19:26 ?        00:00:00   [kjournald]
root       853     2  0 19:26 ?        00:00:00   [flush-202:1]
root       854     2  0 19:26 ?        00:00:00   [flush-202:16]
root         1     0  0 19:26 ?        00:00:00 /sbin/init
root       300     1  0 19:26 ?        00:00:00   upstart-udev-bridge --daemon
root       306     1  0 19:26 ?        00:00:00   /sbin/udevd --daemon
root       356   306  0 19:26 ?        00:00:00     /sbin/udevd --daemon
root       357   306  0 19:26 ?        00:00:00     /sbin/udevd --daemon
root       446     1  0 19:26 ?        00:00:00   upstart-socket-bridge --daemon
root       499     1  0 19:26 ?        00:00:00   dhclient -1 -v -pf /run/dhclient.eth0.pid -lf /var/lib/dhcp/dhclient.eth0.leases eth0
root       678     1  0 19:26 ?        00:00:00   /usr/sbin/sshd -D
root       895   678  0 19:26 ?        00:00:00     sshd: ubuntu [priv] 
ubuntu    1040   895  0 19:26 ?        00:00:00       sshd: ubuntu@pts/0  
ubuntu    1041  1040  0 19:26 pts/0    00:00:00         -bash
ubuntu    1198  1041  0 19:27 pts/0    00:00:00           tail -f /var/log/salt/minion
102        689     1  0 19:26 ?        00:00:00   dbus-daemon --system --fork
syslog     704     1  0 19:26 ?        00:00:00   rsyslogd -c5
root       762     1  0 19:26 tty4     00:00:00   /sbin/getty -8 38400 tty4
root       769     1  0 19:26 tty5     00:00:00   /sbin/getty -8 38400 tty5
root       780     1  0 19:26 tty2     00:00:00   /sbin/getty -8 38400 tty2
root       781     1  0 19:26 tty3     00:00:00   /sbin/getty -8 38400 tty3
root       786     1  0 19:26 tty6     00:00:00   /sbin/getty -8 38400 tty6
root       796     1  0 19:26 ?        00:00:00   cron
daemon     797     1  0 19:26 ?        00:00:00   atd
root       808     1  0 19:26 ?        00:00:00   acpid -c /etc/acpi/events -s /var/run/acpid.socket
whoopsie   819     1  0 19:26 ?        00:00:00   whoopsie
root       827     1  0 19:26 tty1     00:00:00   /sbin/getty -8 38400 tty1
root       898     1  0 19:26 ?        00:00:00   /usr/sbin/console-kit-daemon --no-daemon
root       965     1  0 19:26 ?        00:00:00   /usr/lib/policykit-1/polkitd --no-debug
root      1148     1  0 19:27 ?        00:00:01   python /usr/bin/salt-minion
root      1201     1  0 19:28 ?        00:00:00   python /usr/bin/salt-minion
root      1204  1201  0 19:28 ?        00:00:00     [sh] <defunct>
couchdb   1860     1  0 19:28 ?        00:00:00   /bin/sh -e /usr/bin/couchdb -a /etc/couchdb/default.ini -a /etc/couchdb/local.ini -b -r 5 -p /var/run/couchdb/couchdb.pid -o /dev/null -e /dev/null -R
couchdb   1873  1860  0 19:28 ?        00:00:00     /bin/sh -e /usr/bin/couchdb -a /etc/couchdb/default.ini -a /etc/couchdb/local.ini -b -r 5 -p /var/run/couchdb/couchdb.pid -o /dev/null -e /dev/null -R
couchdb   1874  1873  0 19:28 ?        00:00:00       /usr/lib/erlang/erts-5.9.1/bin/beam -Bd -K true -A 4 -- -root /usr/lib/erlang -progname erl -- -home /var/lib/couchdb -- -noshell -noinput -os_mon start_memsup false start_cpu_sup false disk_space_check_interval 1 disk_almost_full_threshold 1 -sasl errlog_type error -couch_ini /etc/couchdb/default.ini /etc/couchdb/local.ini /etc/couchdb/default.ini /etc/couchdb/local.ini -s couch -pidfile /var/run/couchdb/couchdb.pid -heart
couchdb   1882  1874  0 19:28 ?        00:00:00         heart -pid 1874 -ht 11
couchdb   1905  1874  0 19:28 ?        00:00:00         sh -s disksup
root      1953     1  0 19:30 ?        00:00:00   python /usr/bin/salt-minion
root      1954  1953  0 19:30 ?        00:00:00     /bin/sh -c ps -efH
root      1955  1954  0 19:30 ?        00:00:00       ps -efH
2013-05-10 19:30:30,283 [salt.minion      ][INFO    ] Returning information for job: 20130510153030089224

The corresponding master log:

[INFO    ] Clear payload received with command publish
[INFO    ] User hdemers Published command saltutil.find_job with jid 20130510153030089224
[DEBUG   ] Published command details {'tgt_type': 'glob', 'jid': '20130510153030089224', 'tgt': 'ip-10-196-38-159', 'ret': '', 'user': 'hdemers', 'arg': ['20130510152803303471'], 'fun': 'saltutil.find_job'}
[INFO    ] AES payload received with command _return
[INFO    ] Got return from ip-10-196-38-159 for job 20130510153030089224

This:

salt-run jobs.lookup_jid 20130510152803303471

returns nothing. And this:

salt-run jobs.active

returns:

'20130510152803303471':
  Arguments:
  - couchdb
  Function: pkg.install
  Returned:
  - jid
  Running: []
  Target: ip-10-196-38-159
  Target-type: glob

Running the minion from the command line works, no problem at all. Only when started from its init script does things break.

@thatch45
Copy link
Contributor

Is this using an init script or upstart? I imagine that this has to do with the issue where scripts block in some cases. We are actively looking into this problem, I will cross reference the issue when I find it, @UtahDave has been at the forefront of this one

@hdemers
Copy link
Author

hdemers commented May 12, 2013

It's using upstart. Installed via the Ubuntu salt-minion package, 0.15.1.

@thatch45
Copy link
Contributor

Thanks for the feedback, we should be able to reproduce this

@hdemers
Copy link
Author

hdemers commented May 13, 2013

Let me know if I can help in any way.

Thanks.

@thatch45
Copy link
Contributor

If you want to try open up the file:
salt/modules/cmdmod.py
and add:
kwargs['close_fds'] = True
before the subprocess.Popen call in the _run function, should be around line 309
and see if the problem goes away

@hdemers
Copy link
Author

hdemers commented May 13, 2013

I've tried this to no avail. That was to be expected I guess because close_fds is already set to True at line 301:

if not salt.utils.is_windows():
    # close_fds is not supported on Windows platforms if you redirect
    # stdin/stdout/stderr
    kwargs['executable'] = shell
    log.debug("check this is executed.")
    kwargs['close_fds'] = True

Any other ideas I can try?

@thatch45
Copy link
Contributor

No, this is only being applied if the platform is windows, add the line outside of the if block

@hdemers
Copy link
Author

hdemers commented May 15, 2013

Hum, I do see my debug message there: log.debug("check this is executed"), see my previous post. And if not is_windows looks targeted to non-windows. Am I missing something?

Beside, I did add the line outside the if block, without success.

@thatch45
Copy link
Contributor

Ok, that helps our data, which means that this is lower level....

@slai
Copy link

slai commented Jun 1, 2013

(Not sure if this applies at all to Ubuntu and Upstart.)

I saw something similar for an internal application package. Turned out it was because the init.d script started the app as a different user using /bin/sh, but only redirected stdout to a file. When a redirect was added that redirected stderr to /dev/null, the package installed properly. I'm not quite sure why, but I'm assuming the application/shell was holding on to the stderr fd, causing Python to wait. When running on the command line, your current shell is still active so that fd isn't closed.

I'd try adding in a debug logging line before Popen, between Popen and communicate, and after communicate, in cmdmod.py, here -

https://github.com/saltstack/salt/blob/develop/salt/modules/cmdmod.py#L298

If you find that it gets to communicate but stops there, have a look in your processes and see if there are any zombie processes (i.e. [defunct] processes). I had one when it happened and if I killed the app that went away and Salt continued on.

I'd also try running the minion in non-daemonised mode under strace and see what the last call is - just prefix your minion command with 'strace' to do that.

@hdemers
Copy link
Author

hdemers commented Jun 5, 2013

I tried what you suggested and looks like we never return from communicate for the pkg.install command:

BEFORE Popen FOR "dpkg-query --showformat='${Status} ${Package} ${Version}\n' -W"
AFTER Popen FOR "dpkg-query --showformat='${Status} ${Package} ${Version}\n' -W", BEFORE communicate
AFTER communicate OF "dpkg-query --showformat='${Status} ${Package} ${Version}\n' -W"

BEFORE Popen FOR 'grep-available -F Provides -s Package,Provides -e "^.+$"'
AFTER Popen FOR 'grep-available -F Provides -s Package,Provides -e "^.+$"', BEFORE communicate
AFTER communicate OF 'grep-available -F Provides -s Package,Provides -e "^.+$"'

BEFORE Popen FOR 'apt-get -q -y -o DPkg::Options::=--force-confold -o DPkg::Options::=--force-confdef   install couchdb'
AFTER Popen FOR 'apt-get -q -y -o DPkg::Options::=--force-confold -o DPkg::Options::=--force-confdef   install couchdb', BEFORE communicate
>>> MISSING AFTER communicate <<<

BEFORE Popen FOR 'ps -efH'
AFTER Popen FOR 'ps -efH', BEFORE communicate
AFTER communicate OF 'ps -efH'

BEFORE Popen FOR 'ps -efH'
AFTER Popen FOR 'ps -efH', BEFORE communicate
AFTER communicate OF 'ps -efH'

Any fix for this? Aside from modifying the init.d script of couchdb.

There is indeed a defunct process, but killing it does not help.

@ghost ghost assigned thatch45 Jun 11, 2013
@slai
Copy link

slai commented Jun 30, 2013

@hdemers that looks like the same issue I and others have had. There looks like there's a dev consensus for fixing this problem in issue #5567.

@olliewalsh
Copy link
Contributor

@slai Installing packages on debian/ubuntu uses cmd.run_all, not cmd.retcode. The changes discussed in #5567 will not help in this case.

@slai
Copy link

slai commented Jul 9, 2013

Good point. Although looking at the cmd.run_all call in the install method (https://github.com/saltstack/salt/blob/develop/salt/modules/apt.py#L350), there's really no point in using that because it doesn't do anything with the output anyway. Maybe the solution is to simply change it to use cmd.retcode once the #5567 changes are in.

@olliewalsh
Copy link
Contributor

Problem is that salt is a remote execution system, not just configuration management. Someone will eventually want to run salt <foo> cmd.run 'apt-get install couchdb' and parse the output themselves.

I think non-blocking I/O in subprocess is the only way to put an end to all of these issues with misbehaving init scripts. E.g http://www.python.org/dev/peps/pep-3145/

Something like:

proc = subprocess.Popen(...)
out = ''
err = ''
while True:
    out += proc.asyncread()
    err += proc.asyncread(stderr=True)
    if proc.poll() is not None:
        break
out += proc.asyncread()
err += proc.asyncread(stderr=True)
...

It would need a lot of testing though!

@thatch45
Copy link
Contributor

Duplicate underlying problem as #4410

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior severity-low 4th level, cosemtic problems, work around exists
Projects
None yet
Development

No branches or pull requests

4 participants