150+ salt-minions running on 0.15.3 #5729

chjohnst · 2013-06-25T19:47:12Z

My salt-master runs in an HA cluster and one we were rebuilding the master over the weekend so the salt-master was down for over 48hrs, all of my minions have spawed around 100-150 processes all with separate PIDs but they are all forked from PID 1 as if they are coming from systemd. I am not sure why I am seeing 100+ minions though, systemd manifest also looks pretty straight forward. Curious if this is a bug I have hit I wouldn't imagine this to be the normal behavior.

Has anyone seen anything like that with the latest salt? Any strace shows them in epoll_wait(), resumes then an attempt to open a socket to the master and then fails. The cycle repeats as epoll should in an event loop.

[pid 32378] <... epoll_wait resumed> {}, 256, 5005) = 0
[pid 32378] socket(PF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 28
[pid 32378] fcntl(28, F_GETFL) = 0x2 (flags O_RDWR)
[pid 32378] fcntl(28, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 32378] connect(28, {sa_family=AF_INET, sin_port=htons(4506), sin_addr=inet_addr("10.8.4.97")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid 32378] epoll_ctl(20, EPOLL_CTL_ADD, 28, {...}) = 0
[pid 32378] epoll_ctl(20, EPOLL_CTL_MOD, 28, {...}) = 0
[pid 32378] epoll_wait(20, {?} 0x7fe192ffb1d0, 256, -1) = 1
[pid 32378] getsockopt(28, SOL_SOCKET, SO_ERROR, [111], [4]) = 0
[pid 32378] epoll_ctl(20, EPOLL_CTL_DEL, 28, {...}) = 0
[pid 32378] close(28) = 0
[pid 32378] epoll_wait(20,

basepi · 2013-06-25T20:54:31Z

Very strange. This does look like a bug to me, though I'd be interested to know if anyone else has run into this issue.

Thanks for the report.

chjohnst · 2013-06-25T21:14:57Z

I can do a simple test as well on a test machine, start a minion up and point it to a null host (one not running a master) and put the minion in debug mode.

basepi · 2013-06-26T15:18:43Z

And the problem still happens? Definitely a bug. We'll look into it.

olliewalsh · 2013-07-01T22:26:59Z

I hit something similar over the weekend. Several minions had duplicate IDs and failed authentication. There were multiple minion processes for each minion retrying the connection to the master. The master ran out of file descriptors eventually.

I suspect all of the minion processes were from the _mine schedule.

max-arnold · 2013-07-09T08:28:43Z

Same thing here. While master is offline, number of salt-minion processes (0.15.3 from Ubuntu Precise PPA) grow to almost 200.

max-arnold · 2013-07-09T08:46:32Z

Process list looks like this:

  491 ?        Sl     0:38 /usr/bin/python /usr/bin/salt-minion
...
 5438 ?        Ssl   10:40 /usr/bin/python /usr/bin/salt-minion
 6592 ?        Sl     0:05 /usr/bin/python /usr/bin/salt-minion
 9420 ?        Sl     0:07 /usr/bin/python /usr/bin/salt-minion
 9613 ?        Sl     1:23 /usr/bin/python /usr/bin/salt-minion
...
32712 ?        Sl     0:19 /usr/bin/python /usr/bin/salt-minion

# strace -p 491

Process 491 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = 0
poll([{fd=26, events=POLLIN}], 1, 0)    = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {1784496, 199826317}) = 0
write(22, "\1\0\0\0\0\0\0\0", 8)        = 8
write(26, "\1\0\0\0\0\0\0\0", 8)        = 8
poll([{fd=21, events=POLLIN}], 1, -1)   = 1 ([{fd=21, revents=POLLIN}])
write(24, "\1\0\0\0\0\0\0\0", 8)        = 8
close(25)                               = 0
close(24)                               = 0
close(23)                               = 0
close(22)                               = 0
close(21)                               = 0
select(0, NULL, NULL, NULL, {10, 0}
)    = 0 (Timeout)
stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=127, ...}) = 0
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 21
fstat(21, {st_mode=S_IFREG|0644, st_size=159, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fed80fab000
read(21, "127.0.0.1\t\tlocalhost\n::1\t\tip6-lo"..., 4096) = 159
read(21, "", 4096)                      = 0
read(21, "", 4096)                      = 0
close(21)                               = 0
munmap(0x7fed80fab000, 4096)            = 0
stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=127, ...}) = 0
socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 21
connect(21, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, 16) = 0
gettimeofday({1373359195, 572478}, NULL) = 0
poll([{fd=21, events=POLLOUT}], 1, 0)   = 1 ([{fd=21, revents=POLLOUT}])
sendto(21, "\346\272\1\0\0\1\0\0\0\0\0\0\4salt\3example\2com\0\0\34\0\1", 29, MSG_NOSIGNAL, NULL, 0) = 29
poll([{fd=21, events=POLLIN}], 1, 5000) = 1 ([{fd=21, revents=POLLIN}])
ioctl(21, FIONREAD, [112])              = 0
recvfrom(21, "\346\272\201\200\0\1\0\1\0\1\0\0\4salt\3example\2com\0\0\34\0\1\300\f\0"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, [16]) = 112
close(21)                               = 0
stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=127, ...}) = 0
socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 21
connect(21, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, 16) = 0
gettimeofday({1373359195, 574351}, NULL) = 0
poll([{fd=21, events=POLLOUT}], 1, 0)   = 1 ([{fd=21, revents=POLLOUT}])
sendto(21, "|\206\1\0\0\1\0\0\0\0\0\0\4salt\3example\2com\0\0\1\0\1", 29, MSG_NOSIGNAL, NULL, 0) = 29
poll([{fd=21, events=POLLIN}], 1, 5000) = 1 ([{fd=21, revents=POLLIN}])
ioctl(21, FIONREAD, [127])              = 0
recvfrom(21, "|\206\201\200\0\1\0\2\0\3\0\0\4salt\3example\2com\0\0\1\0\1\300\f\0"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, [16]) = 127
close(21)                               = 0
eventfd2(0, 0)                          = 21
fcntl(21, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(21, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
fcntl(21, F_GETFL)                      = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(21, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
eventfd2(0, 0)                          = 22
fcntl(22, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(22, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
fcntl(22, F_GETFL)                      = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(22, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
clock_gettime(CLOCK_MONOTONIC, {1784506, 219294078}) = 0
epoll_create(1)                         = 23
epoll_ctl(23, EPOLL_CTL_ADD, 22, {0, {u32=25560992, u64=25560992}}) = 0
epoll_ctl(23, EPOLL_CTL_MOD, 22, {EPOLLIN, {u32=25560992, u64=25560992}}) = 0
clone(child_stack=0x7fed794b0ff0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fed794b19d0, tls=0x7fed794b1700, child_tidptr=0x7fed794b19d0) = 24924
eventfd2(0, 0)                          = 24
fcntl(24, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(24, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
fcntl(24, F_GETFL)                      = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(24, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
clock_gettime(CLOCK_MONOTONIC, {1784506, 220507808}) = 0
epoll_create(1)                         = 25
epoll_ctl(25, EPOLL_CTL_ADD, 24, {0, {u32=33361808, u64=33361808}}) = 0
epoll_ctl(25, EPOLL_CTL_MOD, 24, {EPOLLIN, {u32=33361808, u64=33361808}}) = 0
clone(child_stack=0x7fed78cafff0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fed78cb09d0, tls=0x7fed78cb0700, child_tidptr=0x7fed78cb09d0) = 24925
eventfd2(0, 0)                          = 26
fcntl(26, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(26, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
fcntl(26, F_GETFL)                      = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(26, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
clock_gettime(CLOCK_MONOTONIC, {1784506, 221883948}) = 0
poll([{fd=26, events=POLLIN}], 1, 0)    = 0 (Timeout)
write(24, "\1\0\0\0\0\0\0\0", 8)        = 8
write(26, "\1\0\0\0\0\0\0\0", 8)        = 8
access("/", R_OK)                       = 0
access("/etc", R_OK)                    = 0
access("/etc/salt", R_OK)               = 0
access("/etc/salt/pki", R_OK)           = 0
access("/etc/salt/pki/minion", R_OK)    = 0
stat("/etc/salt/pki/minion/minion.pem", {st_mode=S_IFREG|0400, st_size=3243, ...}) = 0
open("/etc/salt/pki/minion/minion.pem", O_RDONLY) = 27
fstat(27, {st_mode=S_IFREG|0400, st_size=3243, ...}) = 0
fstat(27, {st_mode=S_IFREG|0400, st_size=3243, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fed80fab000
read(27, "-----BEGIN RSA PRIVATE KEY-----\n"..., 4096) = 3243
close(27)                               = 0
munmap(0x7fed80fab000, 4096)            = 0
gettimeofday({1373359195, 582351}, NULL) = 0
open("/tmp/tmp7zy3ar", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW, 0600) = 27
fcntl(27, F_GETFD)                      = 0
fcntl(27, F_SETFD, FD_CLOEXEC)          = 0
close(27)                               = 0
open("/tmp/tmp7zy3ar", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 27
fstat(27, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
fstat(27, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fed80fab000
write(27, "-----BEGIN PUBLIC KEY-----\nMIICI"..., 800) = 800
close(27)                               = 0
munmap(0x7fed80fab000, 4096)            = 0
open("/etc/salt/pki/minion/minion_master.pub", O_RDONLY) = 27
fstat(27, {st_mode=S_IFREG|0644, st_size=800, ...}) = 0
fstat(27, {st_mode=S_IFREG|0644, st_size=800, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fed80fab000
read(27, "-----BEGIN PUBLIC KEY-----\nMIICI"..., 4096) = 800
close(27)                               = 0
munmap(0x7fed80fab000, 4096)            = 0
open("/tmp/tmp7zy3ar", O_RDONLY)        = 27
fstat(27, {st_mode=S_IFREG|0600, st_size=800, ...}) = 0
fcntl(27, F_GETFD)                      = 0
fcntl(27, F_SETFD, FD_CLOEXEC)          = 0
fstat(27, {st_mode=S_IFREG|0600, st_size=800, ...}) = 0
lseek(27, 0, SEEK_CUR)                  = 0
fstat(27, {st_mode=S_IFREG|0600, st_size=800, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fed80fab000
lseek(27, 0, SEEK_CUR)                  = 0
read(27, "-----BEGIN PUBLIC KEY-----\nMIICI"..., 4096) = 800
read(27, "", 4096)                      = 0
close(27)                               = 0
munmap(0x7fed80fab000, 4096)            = 0
unlink("/tmp/tmp7zy3ar")                = 0
poll([{fd=26, events=POLLIN}], 1, 0)    = 1 ([{fd=26, revents=POLLIN}])
read(26, "\1\0\0\0\0\0\0\0", 8)         = 8
poll([{fd=26, events=POLLIN}], 1, 0)    = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {1784506, 231015349}) = 0
poll([{fd=26, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=26, events=POLLIN}], 1, 0)    = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {1784506, 231488495}) = 0
poll([{fd=26, events=POLLIN}], 1, 60000^C <unfinished ...>

# strace -p 5438

Process 5438 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = 0
poll([{fd=13, events=POLLIN}], 1, 0)    = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {1784295, 760312780}) = 0
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {1784295, 810893159}) = 0
poll([{fd=10, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=10, events=POLLIN}], 1, 0)    = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {1784295, 811429331}) = 0
poll([{fd=10, events=POLLIN}], 1, 1)    = 0 (Timeout)
poll([{fd=10, events=POLLIN}], 1, 0)    = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {1784295, 812968235}) = 0
gettimeofday({1373358985, 170722}, NULL) = 0
clock_gettime(CLOCK_MONOTONIC, {1784295, 813376577}) = 0
poll([{fd=13, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=13, events=POLLIN}], 1, 0)    = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {1784295, 813844703}) = 0
poll([{fd=13, events=POLLIN}], 1, 1000) = 0 (Timeout)
poll([{fd=13, events=POLLIN}], 1, 0)    = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {1784296, 815484337}) = 0
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {1784296, 866159056}) = 0
poll([{fd=10, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=10, events=POLLIN}], 1, 0)    = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {1784296, 866653176}) = 0
poll([{fd=10, events=POLLIN}], 1, 1)    = 0 (Timeout)
poll([{fd=10, events=POLLIN}], 1, 0)    = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {1784296, 868265383}) = 0
gettimeofday({1373358986, 226023}, NULL) = 0

# strace -p 32712

Process 32712 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>
poll([{fd=26, events=POLLIN}], 1, 0)    = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {1784624, 688536898}) = 0
write(22, "\1\0\0\0\0\0\0\0", 8)        = 8
write(26, "\1\0\0\0\0\0\0\0", 8)        = 8
poll([{fd=21, events=POLLIN}], 1, -1)   = 1 ([{fd=21, revents=POLLIN}])
write(24, "\1\0\0\0\0\0\0\0", 8)        = 8
close(25)                               = 0
close(24)                               = 0
close(23)                               = 0
close(22)                               = 0
close(21)                               = 0
select(0, NULL, NULL, NULL, {10, 0}^C <unfinished ...>

max-arnold · 2013-07-09T09:08:07Z

I started salt-master and within 10 minutes number of minion processes reduced to 1. My setup is described here: https://groups.google.com/d/msg/salt-users/q9YHzL2hmRs/tewPF-0I2QUJ

felskrone · 2013-07-09T14:25:10Z

I've also experienced this on a few nodes but not as bad as described here. I just had a few nodes with 2-10 minions running. They were not killed using the init-skript and had to be killed killing '/usr/bin/python'.

basepi · 2013-07-12T22:48:38Z

@thatch45 you should look at this.

thatch45 · 2013-07-12T23:28:17Z

So far I am unable to reproduce this on git head, running Arch, I will try again with Ubuntu. Sorry about being late to the party on this one!

max-arnold · 2013-07-19T04:23:46Z

Upgraded to salt-minion 0.16.0. After 10 days (and some usage and restarts in between) there are multiple minion processes again. Nothing suspicious in minion logs except this one (multiple times):

2013-07-17 11:59:45,523 [salt.config      ][WARNING ] The function 'grains()' defined in '/usr/lib/pymodules/python2.7/salt/loader.py' is not yet using the new 'default_path' argument to `salt.config.load_config()`. As such, the 'SALT_MINION_CONFIG' environment variable will be ignored

And there is something interesting in syslog (unfortunately, pid 10207 is not running anymore):

Jul 19 11:00:47 bf rsyslogd-2177: imuxsock begins to drop messages from pid 10207 due to rate-limiting

Running /etc/init.d/salt-minion stop does not terminate minion processes. Starting salt-master does.

Ubuntu 12.04.2 LTS (GNU/Linux 3.9.3-x86_64-linode33 x86_64)

felskrone · 2013-07-19T06:22:40Z

@max-arnold: the imuxsock message does not necessarily have to be serious. It just means, that your rsyslog-setup has rate limiting enabled. Depending on the loglevel of your minion, quite a lot of log messages are being created upon minion start.

You can/should tune your rsyslog-settings to higher rate limits:

$SystemLogRateLimitInterval 2
$SystemLogRateLimitBurst 500

This would allow 500 messages in 2 seconds. I tuned mine to really high values because i really want all messages i can get. Also see: http://www.rsyslog.com/changing-the-settings/

max-arnold · 2013-07-19T15:41:04Z

Looks like I found one way to reproduce the problem more often. The trick is to shutdown master server without proper tcp connection termination.

Start salt-master on laptop (my setup is described here: https://groups.google.com/d/msg/salt-users/q9YHzL2hmRs/tewPF-0I2QUJ)
ssh to remote machine with remote port forwarding and wait for minions to connect to local machine via ssh tunnel
Close the lid or put laptop to sleep mode
Wait for several hours

It is not 100% reliable (managed to do that only twice), but after some time I got several minion processes hanging around.

max-arnold · 2013-07-19T15:47:16Z

Also this time I enabled more verbose minion logging (log_level_logfile: trace) and this is the output:

2013-07-19 14:14:28,273 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 14:15:38,349 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 14:16:48,424 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 14:17:58,498 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 14:19:08,573 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
...
2013-07-19 15:11:42,136 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 15:12:52,212 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 15:14:02,287 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 15:14:28,529 [salt.utils.schedule][DEBUG   ] Running scheduled job: __mine_interval
2013-07-19 15:14:28,580 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 15:15:12,363 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 15:15:38,658 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
...
2013-07-19 19:09:14,422 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:09:22,504 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:09:39,791 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:09:58,128 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:10:05,866 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:10:20,713 [salt.crypt       ][DEBUG   ] Decrypting the current master AES key
2013-07-19 19:10:20,713 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:10:22,255 [salt.crypt       ][DEBUG   ] Decrypting the current master AES key
2013-07-19 19:10:22,255 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:10:23,058 [salt.crypt       ][DEBUG   ] Decrypting the current master AES key
2013-07-19 19:10:23,058 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:10:24,499 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:10:24,999 [salt.crypt       ][DEBUG   ] Decrypting the current master AES key
2013-07-19 19:10:24,1000 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:10:32,579 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:10:33,046 [salt.crypt       ][DEBUG   ] Decrypting the current master AES key
2013-07-19 19:10:33,047 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:14:28,484 [salt.utils.schedule][DEBUG   ] Running scheduled job: __mine_interval
2013-07-19 19:14:28,674 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:15:38,753 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:16:48,829 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:17:58,905 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:19:08,982 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:19:55,064 [salt.crypt       ][DEBUG   ] Decrypting the current master AES key
2013-07-19 19:19:55,064 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 20:14:28,603 [salt.utils.schedule][DEBUG   ] Running scheduled job: __mine_interval
2013-07-19 20:14:28,704 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 20:15:38,781 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 20:16:48,856 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 20:17:58,930 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 20:19:09,005 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 20:20:19,081 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
...
2013-07-19 22:12:52,222 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 22:13:36,877 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 22:14:02,297 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 22:14:28,445 [salt.utils.schedule][DEBUG   ] Running scheduled job: __mine_interval
2013-07-19 22:14:28,489 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 22:14:46,952 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 22:15:12,374 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 22:15:38,573 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem

max-arnold · 2013-07-19T16:09:47Z

Ok, no need to suspend tcp connections. Just set mine_interval in minion config to 1, restart it, then kill salt-master and watch growing list of minion processes:

watch "ps ax | grep '[s]alt'"

basepi · 2013-07-19T18:47:27Z

Wait, does everyone who's experiencing this issue have the Salt Mine enabled? That could be the cause.

@chjohnst? @felskrone?

basepi · 2013-07-19T18:57:02Z

Talked to @thatch45, told him it was probably Mine related, and immediately he knew what the problem was. He'll fix it soon. =)

thatch45 · 2013-07-19T19:03:41Z

Yes, that is it, the mine processes block waiting for a master connection to open up and don't have a timeout, this could make them just pile up, I am on it!

max-arnold · 2013-07-20T06:22:38Z

WIll the fix cover all scheduled jobs or just Salt Mine? Is adding a timeout the right solution? Maybe it is better to not allow specific job to start again before previous (of the same type) has finished? Or allow both approaches?

felskrone · 2013-07-22T15:31:37Z

Well, i did not knowingly activate the mine-module. If its enabled per default i guess im using it (which seems to be the case).

Little story:
I shutdown the salt-master on friday because its in a debugging state (for issue 5948) and i did not want anyone to use it over the weekend. On sunday we had thirty productive machines (minions) that had spawned minion after minion process and eventually we had so many minions running in process state D (uninterrruptable) that those servers had to be rebooted.

Well, thinks like this can happen. Its not like i've never forgotten a timeout or handling case xy etc. :-)

But what should be considered after a bug like this one is:

Do you really want to default to setting new features/modules active?

Depending on your setup things like this can really bite you where you dont really appreciate being bitten. Especially in larger setups i really dont like the idea of asking thousands of minions for data if i have not told the master explicitly to do so.

Also in the current minion and master config i dont see any options to disable this feature. There is only a _mine_interval with a default of 60. The mine-module itself mentions a function_cache which is not mentioned in the current master- and/or minion-config and not even in the salt/conf.py.

Is this feature really ready to be used in a production environment and being enabled per default?

max-arnold · 2013-07-22T17:17:46Z

Looking at the code, setting mine_interval to 0 is not a good idea to disable salt-mine (leads to fork bomb). I think the schedule module needs additional guard against zero (and negative?) interval. The only way to disable mine module is to add two lines to minion config:

disable_modules:
  - mine

The only downside with this approach is the following message every second in minion log with INFO level:

[salt.utils.schedule][INFO    ] Invalid function: __mine_interval in job mine.update. Ignoring.

max-arnold · 2013-07-22T17:38:05Z

Also utils/schedule.py probably needs additional protection against time jitter. I've seen some schedulers which can trigger extra events if time jumps backwards to one second or more.

This can happen if the system administrator put the command like ntpdate -b pool.ntp.org to crontab every hour and another software scheduler runs some job with more fine-grained interval (less than a minute, for example every hour at second 0 or 1). This can lead to hard to reproduce double runs when time jumps back slightly.

I even written a module to avoid some cases like this: https://bitbucket.org/marnold/py-cron/src/7a4a20c9a69c2364fe5fba1c3300647b52f51f6b/pycron.py?at=default

mdanilescu · 2013-09-24T12:34:31Z

I'm running Salt 0.16.4 on Ubuntu 12.04 and just experienced this issue. It basically made the machine unusable until I finally managed to kill all salt-minion processes.
Was this issue solved in the newer versions? I saw that the milestone was set for 0.17.0.
Thanks.

basepi · 2013-09-25T19:23:05Z

No, we have not solved the issue.

The solution is to disable the mine using this in the minion config:

disable_modules:
  - mine

Of course, you should only be seeing this in a masterless setup. If your minion is connected to the master, it may be a different issue.

anveo · 2013-10-03T20:13:13Z

I'm also experiencing this. Salt 0.16.4 on Ubuntu 12.04 and the minions are connecting to a master. I install via pip and place an upstart config which looks like https://gist.github.com/anveo/6816398. Let me know if there is any other information I can give.

basepi · 2013-10-03T23:45:08Z

@anveo If you're not running masterless, then it's a different issue. Can you please open a new issue? The extra processes here are unclosed ZMQ procs from trying (unsuccessfully) to connect to a master for the Salt Mine. (Only happens in masterless)

Track mine processes, fixes #5729

tyre · 2013-11-14T19:35:41Z

We're also seeing this on 0.16.4. 108 on one server and 148 on another taking up GBs of memory. I'd be happy to provide more information on our setup if it would help debug the issue.

basepi · 2013-11-14T22:51:12Z

This was fixed in #8222. You should try 0.17.2, that's the first version containing the fix.

basepi mentioned this issue Aug 12, 2013

salt-minion high memory consumption, when master is unavailable #6636

Closed

basepi mentioned this issue Sep 11, 2013

salt-minion creating process every hour and lost communication with master #7177

Closed

basepi mentioned this issue Oct 11, 2013

Salt minion launches many processes after a day of downtime of the master #7733

Closed

ghost assigned cro Oct 17, 2013

cro mentioned this issue Nov 1, 2013

Track mine processes, fixes #5729 #8222

Merged

thatch45 closed this as completed in #8222 Nov 6, 2013

thatch45 added a commit that referenced this issue Nov 6, 2013

Merge pull request #8222 from cro/track_mine_processes

efd6579

Track mine processes, fixes #5729

mekstrem mentioned this issue Nov 12, 2013

Multiple salt-minions (fork bomb) after short salt-master outage (0.17.1) #8435

Closed

This was referenced Nov 13, 2013

Salt Master under some load bubbles ZMQ exceptions when sent SIGTERM #8432

Closed

Salt minion needs two stop actions with init script to die #8484

Closed

olliewalsh mentioned this issue Dec 2, 2013

scheduled jobs that raise an exception are not rescheduled correctly #8941

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

150+ salt-minions running on 0.15.3 #5729

150+ salt-minions running on 0.15.3 #5729

chjohnst commented Jun 25, 2013

basepi commented Jun 25, 2013

chjohnst commented Jun 25, 2013

basepi commented Jun 26, 2013

olliewalsh commented Jul 1, 2013

max-arnold commented Jul 9, 2013

max-arnold commented Jul 9, 2013

max-arnold commented Jul 9, 2013

felskrone commented Jul 9, 2013

basepi commented Jul 12, 2013

thatch45 commented Jul 12, 2013

max-arnold commented Jul 19, 2013

felskrone commented Jul 19, 2013

max-arnold commented Jul 19, 2013

max-arnold commented Jul 19, 2013

max-arnold commented Jul 19, 2013

basepi commented Jul 19, 2013

basepi commented Jul 19, 2013

thatch45 commented Jul 19, 2013

max-arnold commented Jul 20, 2013

felskrone commented Jul 22, 2013

max-arnold commented Jul 22, 2013

max-arnold commented Jul 22, 2013

mdanilescu commented Sep 24, 2013

basepi commented Sep 25, 2013

anveo commented Oct 3, 2013

basepi commented Oct 3, 2013

tyre commented Nov 14, 2013

basepi commented Nov 14, 2013

150+ salt-minions running on 0.15.3 #5729

150+ salt-minions running on 0.15.3 #5729

Comments

chjohnst commented Jun 25, 2013

basepi commented Jun 25, 2013

chjohnst commented Jun 25, 2013

basepi commented Jun 26, 2013

olliewalsh commented Jul 1, 2013

max-arnold commented Jul 9, 2013

max-arnold commented Jul 9, 2013

max-arnold commented Jul 9, 2013

felskrone commented Jul 9, 2013

basepi commented Jul 12, 2013

thatch45 commented Jul 12, 2013

max-arnold commented Jul 19, 2013

felskrone commented Jul 19, 2013

max-arnold commented Jul 19, 2013

max-arnold commented Jul 19, 2013

max-arnold commented Jul 19, 2013

basepi commented Jul 19, 2013

basepi commented Jul 19, 2013

thatch45 commented Jul 19, 2013

max-arnold commented Jul 20, 2013

felskrone commented Jul 22, 2013

max-arnold commented Jul 22, 2013

max-arnold commented Jul 22, 2013

mdanilescu commented Sep 24, 2013

basepi commented Sep 25, 2013

anveo commented Oct 3, 2013

basepi commented Oct 3, 2013

tyre commented Nov 14, 2013

basepi commented Nov 14, 2013