-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
150+ salt-minions running on 0.15.3 #5729
Comments
Very strange. This does look like a bug to me, though I'd be interested to know if anyone else has run into this issue. Thanks for the report. |
I can do a simple test as well on a test machine, start a minion up and point it to a null host (one not running a master) and put the minion in debug mode. |
And the problem still happens? Definitely a bug. We'll look into it. |
I hit something similar over the weekend. Several minions had duplicate IDs and failed authentication. There were multiple minion processes for each minion retrying the connection to the master. The master ran out of file descriptors eventually. I suspect all of the minion processes were from the _mine schedule. |
Same thing here. While master is offline, number of salt-minion processes (0.15.3 from Ubuntu Precise PPA) grow to almost 200. |
Process list looks like this:
|
I started salt-master and within 10 minutes number of minion processes reduced to 1. My setup is described here: https://groups.google.com/d/msg/salt-users/q9YHzL2hmRs/tewPF-0I2QUJ |
I've also experienced this on a few nodes but not as bad as described here. I just had a few nodes with 2-10 minions running. They were not killed using the init-skript and had to be killed killing '/usr/bin/python'. |
@thatch45 you should look at this. |
So far I am unable to reproduce this on git head, running Arch, I will try again with Ubuntu. Sorry about being late to the party on this one! |
Upgraded to salt-minion 0.16.0. After 10 days (and some usage and restarts in between) there are multiple minion processes again. Nothing suspicious in minion logs except this one (multiple times):
And there is something interesting in syslog (unfortunately, pid 10207 is not running anymore):
Running /etc/init.d/salt-minion stop does not terminate minion processes. Starting salt-master does. Ubuntu 12.04.2 LTS (GNU/Linux 3.9.3-x86_64-linode33 x86_64) |
@max-arnold: the imuxsock message does not necessarily have to be serious. It just means, that your rsyslog-setup has rate limiting enabled. Depending on the loglevel of your minion, quite a lot of log messages are being created upon minion start. You can/should tune your rsyslog-settings to higher rate limits: $SystemLogRateLimitInterval 2 This would allow 500 messages in 2 seconds. I tuned mine to really high values because i really want all messages i can get. Also see: http://www.rsyslog.com/changing-the-settings/ |
Looks like I found one way to reproduce the problem more often. The trick is to shutdown master server without proper tcp connection termination.
It is not 100% reliable (managed to do that only twice), but after some time I got several minion processes hanging around. |
Also this time I enabled more verbose minion logging (log_level_logfile: trace) and this is the output:
|
Ok, no need to suspend tcp connections. Just set mine_interval in minion config to 1, restart it, then kill salt-master and watch growing list of minion processes:
|
Wait, does everyone who's experiencing this issue have the Salt Mine enabled? That could be the cause. |
Talked to @thatch45, told him it was probably Mine related, and immediately he knew what the problem was. He'll fix it soon. =) |
Yes, that is it, the mine processes block waiting for a master connection to open up and don't have a timeout, this could make them just pile up, I am on it! |
WIll the fix cover all scheduled jobs or just Salt Mine? Is adding a timeout the right solution? Maybe it is better to not allow specific job to start again before previous (of the same type) has finished? Or allow both approaches? |
Well, i did not knowingly activate the mine-module. If its enabled per default i guess im using it (which seems to be the case). Little story: Well, thinks like this can happen. Its not like i've never forgotten a timeout or handling case xy etc. :-) But what should be considered after a bug like this one is:
Depending on your setup things like this can really bite you where you dont really appreciate being bitten. Especially in larger setups i really dont like the idea of asking thousands of minions for data if i have not told the master explicitly to do so. Also in the current minion and master config i dont see any options to disable this feature. There is only a _mine_interval with a default of 60. The mine-module itself mentions a function_cache which is not mentioned in the current master- and/or minion-config and not even in the salt/conf.py. Is this feature really ready to be used in a production environment and being enabled per default? |
Looking at the code, setting mine_interval to 0 is not a good idea to disable salt-mine (leads to fork bomb). I think the schedule module needs additional guard against zero (and negative?) interval. The only way to disable mine module is to add two lines to minion config:
The only downside with this approach is the following message every second in minion log with INFO level:
|
Also utils/schedule.py probably needs additional protection against time jitter. I've seen some schedulers which can trigger extra events if time jumps backwards to one second or more. This can happen if the system administrator put the command like I even written a module to avoid some cases like this: https://bitbucket.org/marnold/py-cron/src/7a4a20c9a69c2364fe5fba1c3300647b52f51f6b/pycron.py?at=default |
I'm running Salt 0.16.4 on Ubuntu 12.04 and just experienced this issue. It basically made the machine unusable until I finally managed to kill all salt-minion processes. |
No, we have not solved the issue. The solution is to disable the mine using this in the minion config:
Of course, you should only be seeing this in a masterless setup. If your minion is connected to the master, it may be a different issue. |
I'm also experiencing this. Salt 0.16.4 on Ubuntu 12.04 and the minions are connecting to a master. I install via pip and place an upstart config which looks like https://gist.github.com/anveo/6816398. Let me know if there is any other information I can give. |
@anveo If you're not running masterless, then it's a different issue. Can you please open a new issue? The extra processes here are unclosed ZMQ procs from trying (unsuccessfully) to connect to a master for the Salt Mine. (Only happens in masterless) |
We're also seeing this on 0.16.4. 108 on one server and 148 on another taking up GBs of memory. I'd be happy to provide more information on our setup if it would help debug the issue. |
This was fixed in #8222. You should try 0.17.2, that's the first version containing the fix. |
My salt-master runs in an HA cluster and one we were rebuilding the master over the weekend so the salt-master was down for over 48hrs, all of my minions have spawed around 100-150 processes all with separate PIDs but they are all forked from PID 1 as if they are coming from systemd. I am not sure why I am seeing 100+ minions though, systemd manifest also looks pretty straight forward. Curious if this is a bug I have hit I wouldn't imagine this to be the normal behavior.
Has anyone seen anything like that with the latest salt? Any strace shows them in epoll_wait(), resumes then an attempt to open a socket to the master and then fails. The cycle repeats as epoll should in an event loop.
[pid 32378] <... epoll_wait resumed> {}, 256, 5005) = 0
[pid 32378] socket(PF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 28
[pid 32378] fcntl(28, F_GETFL) = 0x2 (flags O_RDWR)
[pid 32378] fcntl(28, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 32378] connect(28, {sa_family=AF_INET, sin_port=htons(4506), sin_addr=inet_addr("10.8.4.97")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid 32378] epoll_ctl(20, EPOLL_CTL_ADD, 28, {...}) = 0
[pid 32378] epoll_ctl(20, EPOLL_CTL_MOD, 28, {...}) = 0
[pid 32378] epoll_wait(20, {?} 0x7fe192ffb1d0, 256, -1) = 1
[pid 32378] getsockopt(28, SOL_SOCKET, SO_ERROR, [111], [4]) = 0
[pid 32378] epoll_ctl(20, EPOLL_CTL_DEL, 28, {...}) = 0
[pid 32378] close(28) = 0
[pid 32378] epoll_wait(20,
The text was updated successfully, but these errors were encountered: