Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

150+ salt-minions running on 0.15.3 #5729

Closed
chjohnst opened this issue Jun 25, 2013 · 28 comments · Fixed by #8222
Closed

150+ salt-minions running on 0.15.3 #5729

chjohnst opened this issue Jun 25, 2013 · 28 comments · Fixed by #8222
Assignees
Labels
Bug broken, incorrect, or confusing behavior severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around

Comments

@chjohnst
Copy link
Contributor

My salt-master runs in an HA cluster and one we were rebuilding the master over the weekend so the salt-master was down for over 48hrs, all of my minions have spawed around 100-150 processes all with separate PIDs but they are all forked from PID 1 as if they are coming from systemd. I am not sure why I am seeing 100+ minions though, systemd manifest also looks pretty straight forward. Curious if this is a bug I have hit I wouldn't imagine this to be the normal behavior.

Has anyone seen anything like that with the latest salt? Any strace shows them in epoll_wait(), resumes then an attempt to open a socket to the master and then fails. The cycle repeats as epoll should in an event loop.

[pid 32378] <... epoll_wait resumed> {}, 256, 5005) = 0
[pid 32378] socket(PF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 28
[pid 32378] fcntl(28, F_GETFL) = 0x2 (flags O_RDWR)
[pid 32378] fcntl(28, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 32378] connect(28, {sa_family=AF_INET, sin_port=htons(4506), sin_addr=inet_addr("10.8.4.97")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid 32378] epoll_ctl(20, EPOLL_CTL_ADD, 28, {...}) = 0
[pid 32378] epoll_ctl(20, EPOLL_CTL_MOD, 28, {...}) = 0
[pid 32378] epoll_wait(20, {?} 0x7fe192ffb1d0, 256, -1) = 1
[pid 32378] getsockopt(28, SOL_SOCKET, SO_ERROR, [111], [4]) = 0
[pid 32378] epoll_ctl(20, EPOLL_CTL_DEL, 28, {...}) = 0
[pid 32378] close(28) = 0
[pid 32378] epoll_wait(20,

@basepi
Copy link
Contributor

basepi commented Jun 25, 2013

Very strange. This does look like a bug to me, though I'd be interested to know if anyone else has run into this issue.

Thanks for the report.

@chjohnst
Copy link
Contributor Author

I can do a simple test as well on a test machine, start a minion up and point it to a null host (one not running a master) and put the minion in debug mode.

@basepi
Copy link
Contributor

basepi commented Jun 26, 2013

And the problem still happens? Definitely a bug. We'll look into it.

@olliewalsh
Copy link
Contributor

I hit something similar over the weekend. Several minions had duplicate IDs and failed authentication. There were multiple minion processes for each minion retrying the connection to the master. The master ran out of file descriptors eventually.

I suspect all of the minion processes were from the _mine schedule.

@max-arnold
Copy link
Contributor

Same thing here. While master is offline, number of salt-minion processes (0.15.3 from Ubuntu Precise PPA) grow to almost 200.

@max-arnold
Copy link
Contributor

Process list looks like this:

  491 ?        Sl     0:38 /usr/bin/python /usr/bin/salt-minion
...
 5438 ?        Ssl   10:40 /usr/bin/python /usr/bin/salt-minion
 6592 ?        Sl     0:05 /usr/bin/python /usr/bin/salt-minion
 9420 ?        Sl     0:07 /usr/bin/python /usr/bin/salt-minion
 9613 ?        Sl     1:23 /usr/bin/python /usr/bin/salt-minion
...
32712 ?        Sl     0:19 /usr/bin/python /usr/bin/salt-minion

# strace -p 491

Process 491 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = 0
poll([{fd=26, events=POLLIN}], 1, 0)    = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {1784496, 199826317}) = 0
write(22, "\1\0\0\0\0\0\0\0", 8)        = 8
write(26, "\1\0\0\0\0\0\0\0", 8)        = 8
poll([{fd=21, events=POLLIN}], 1, -1)   = 1 ([{fd=21, revents=POLLIN}])
write(24, "\1\0\0\0\0\0\0\0", 8)        = 8
close(25)                               = 0
close(24)                               = 0
close(23)                               = 0
close(22)                               = 0
close(21)                               = 0
select(0, NULL, NULL, NULL, {10, 0}
)    = 0 (Timeout)
stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=127, ...}) = 0
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 21
fstat(21, {st_mode=S_IFREG|0644, st_size=159, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fed80fab000
read(21, "127.0.0.1\t\tlocalhost\n::1\t\tip6-lo"..., 4096) = 159
read(21, "", 4096)                      = 0
read(21, "", 4096)                      = 0
close(21)                               = 0
munmap(0x7fed80fab000, 4096)            = 0
stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=127, ...}) = 0
socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 21
connect(21, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, 16) = 0
gettimeofday({1373359195, 572478}, NULL) = 0
poll([{fd=21, events=POLLOUT}], 1, 0)   = 1 ([{fd=21, revents=POLLOUT}])
sendto(21, "\346\272\1\0\0\1\0\0\0\0\0\0\4salt\3example\2com\0\0\34\0\1", 29, MSG_NOSIGNAL, NULL, 0) = 29
poll([{fd=21, events=POLLIN}], 1, 5000) = 1 ([{fd=21, revents=POLLIN}])
ioctl(21, FIONREAD, [112])              = 0
recvfrom(21, "\346\272\201\200\0\1\0\1\0\1\0\0\4salt\3example\2com\0\0\34\0\1\300\f\0"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, [16]) = 112
close(21)                               = 0
stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=127, ...}) = 0
socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 21
connect(21, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, 16) = 0
gettimeofday({1373359195, 574351}, NULL) = 0
poll([{fd=21, events=POLLOUT}], 1, 0)   = 1 ([{fd=21, revents=POLLOUT}])
sendto(21, "|\206\1\0\0\1\0\0\0\0\0\0\4salt\3example\2com\0\0\1\0\1", 29, MSG_NOSIGNAL, NULL, 0) = 29
poll([{fd=21, events=POLLIN}], 1, 5000) = 1 ([{fd=21, revents=POLLIN}])
ioctl(21, FIONREAD, [127])              = 0
recvfrom(21, "|\206\201\200\0\1\0\2\0\3\0\0\4salt\3example\2com\0\0\1\0\1\300\f\0"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, [16]) = 127
close(21)                               = 0
eventfd2(0, 0)                          = 21
fcntl(21, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(21, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
fcntl(21, F_GETFL)                      = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(21, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
eventfd2(0, 0)                          = 22
fcntl(22, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(22, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
fcntl(22, F_GETFL)                      = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(22, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
clock_gettime(CLOCK_MONOTONIC, {1784506, 219294078}) = 0
epoll_create(1)                         = 23
epoll_ctl(23, EPOLL_CTL_ADD, 22, {0, {u32=25560992, u64=25560992}}) = 0
epoll_ctl(23, EPOLL_CTL_MOD, 22, {EPOLLIN, {u32=25560992, u64=25560992}}) = 0
clone(child_stack=0x7fed794b0ff0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fed794b19d0, tls=0x7fed794b1700, child_tidptr=0x7fed794b19d0) = 24924
eventfd2(0, 0)                          = 24
fcntl(24, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(24, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
fcntl(24, F_GETFL)                      = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(24, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
clock_gettime(CLOCK_MONOTONIC, {1784506, 220507808}) = 0
epoll_create(1)                         = 25
epoll_ctl(25, EPOLL_CTL_ADD, 24, {0, {u32=33361808, u64=33361808}}) = 0
epoll_ctl(25, EPOLL_CTL_MOD, 24, {EPOLLIN, {u32=33361808, u64=33361808}}) = 0
clone(child_stack=0x7fed78cafff0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fed78cb09d0, tls=0x7fed78cb0700, child_tidptr=0x7fed78cb09d0) = 24925
eventfd2(0, 0)                          = 26
fcntl(26, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(26, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
fcntl(26, F_GETFL)                      = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(26, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
clock_gettime(CLOCK_MONOTONIC, {1784506, 221883948}) = 0
poll([{fd=26, events=POLLIN}], 1, 0)    = 0 (Timeout)
write(24, "\1\0\0\0\0\0\0\0", 8)        = 8
write(26, "\1\0\0\0\0\0\0\0", 8)        = 8
access("/", R_OK)                       = 0
access("/etc", R_OK)                    = 0
access("/etc/salt", R_OK)               = 0
access("/etc/salt/pki", R_OK)           = 0
access("/etc/salt/pki/minion", R_OK)    = 0
stat("/etc/salt/pki/minion/minion.pem", {st_mode=S_IFREG|0400, st_size=3243, ...}) = 0
open("/etc/salt/pki/minion/minion.pem", O_RDONLY) = 27
fstat(27, {st_mode=S_IFREG|0400, st_size=3243, ...}) = 0
fstat(27, {st_mode=S_IFREG|0400, st_size=3243, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fed80fab000
read(27, "-----BEGIN RSA PRIVATE KEY-----\n"..., 4096) = 3243
close(27)                               = 0
munmap(0x7fed80fab000, 4096)            = 0
gettimeofday({1373359195, 582351}, NULL) = 0
open("/tmp/tmp7zy3ar", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW, 0600) = 27
fcntl(27, F_GETFD)                      = 0
fcntl(27, F_SETFD, FD_CLOEXEC)          = 0
close(27)                               = 0
open("/tmp/tmp7zy3ar", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 27
fstat(27, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
fstat(27, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fed80fab000
write(27, "-----BEGIN PUBLIC KEY-----\nMIICI"..., 800) = 800
close(27)                               = 0
munmap(0x7fed80fab000, 4096)            = 0
open("/etc/salt/pki/minion/minion_master.pub", O_RDONLY) = 27
fstat(27, {st_mode=S_IFREG|0644, st_size=800, ...}) = 0
fstat(27, {st_mode=S_IFREG|0644, st_size=800, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fed80fab000
read(27, "-----BEGIN PUBLIC KEY-----\nMIICI"..., 4096) = 800
close(27)                               = 0
munmap(0x7fed80fab000, 4096)            = 0
open("/tmp/tmp7zy3ar", O_RDONLY)        = 27
fstat(27, {st_mode=S_IFREG|0600, st_size=800, ...}) = 0
fcntl(27, F_GETFD)                      = 0
fcntl(27, F_SETFD, FD_CLOEXEC)          = 0
fstat(27, {st_mode=S_IFREG|0600, st_size=800, ...}) = 0
lseek(27, 0, SEEK_CUR)                  = 0
fstat(27, {st_mode=S_IFREG|0600, st_size=800, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fed80fab000
lseek(27, 0, SEEK_CUR)                  = 0
read(27, "-----BEGIN PUBLIC KEY-----\nMIICI"..., 4096) = 800
read(27, "", 4096)                      = 0
close(27)                               = 0
munmap(0x7fed80fab000, 4096)            = 0
unlink("/tmp/tmp7zy3ar")                = 0
poll([{fd=26, events=POLLIN}], 1, 0)    = 1 ([{fd=26, revents=POLLIN}])
read(26, "\1\0\0\0\0\0\0\0", 8)         = 8
poll([{fd=26, events=POLLIN}], 1, 0)    = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {1784506, 231015349}) = 0
poll([{fd=26, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=26, events=POLLIN}], 1, 0)    = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {1784506, 231488495}) = 0
poll([{fd=26, events=POLLIN}], 1, 60000^C <unfinished ...>

# strace -p 5438

Process 5438 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = 0
poll([{fd=13, events=POLLIN}], 1, 0)    = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {1784295, 760312780}) = 0
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {1784295, 810893159}) = 0
poll([{fd=10, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=10, events=POLLIN}], 1, 0)    = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {1784295, 811429331}) = 0
poll([{fd=10, events=POLLIN}], 1, 1)    = 0 (Timeout)
poll([{fd=10, events=POLLIN}], 1, 0)    = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {1784295, 812968235}) = 0
gettimeofday({1373358985, 170722}, NULL) = 0
clock_gettime(CLOCK_MONOTONIC, {1784295, 813376577}) = 0
poll([{fd=13, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=13, events=POLLIN}], 1, 0)    = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {1784295, 813844703}) = 0
poll([{fd=13, events=POLLIN}], 1, 1000) = 0 (Timeout)
poll([{fd=13, events=POLLIN}], 1, 0)    = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {1784296, 815484337}) = 0
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {1784296, 866159056}) = 0
poll([{fd=10, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=10, events=POLLIN}], 1, 0)    = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {1784296, 866653176}) = 0
poll([{fd=10, events=POLLIN}], 1, 1)    = 0 (Timeout)
poll([{fd=10, events=POLLIN}], 1, 0)    = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {1784296, 868265383}) = 0
gettimeofday({1373358986, 226023}, NULL) = 0

# strace -p 32712

Process 32712 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>
poll([{fd=26, events=POLLIN}], 1, 0)    = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {1784624, 688536898}) = 0
write(22, "\1\0\0\0\0\0\0\0", 8)        = 8
write(26, "\1\0\0\0\0\0\0\0", 8)        = 8
poll([{fd=21, events=POLLIN}], 1, -1)   = 1 ([{fd=21, revents=POLLIN}])
write(24, "\1\0\0\0\0\0\0\0", 8)        = 8
close(25)                               = 0
close(24)                               = 0
close(23)                               = 0
close(22)                               = 0
close(21)                               = 0
select(0, NULL, NULL, NULL, {10, 0}^C <unfinished ...>

@max-arnold
Copy link
Contributor

I started salt-master and within 10 minutes number of minion processes reduced to 1. My setup is described here: https://groups.google.com/d/msg/salt-users/q9YHzL2hmRs/tewPF-0I2QUJ

@felskrone
Copy link
Contributor

I've also experienced this on a few nodes but not as bad as described here. I just had a few nodes with 2-10 minions running. They were not killed using the init-skript and had to be killed killing '/usr/bin/python'.

@basepi
Copy link
Contributor

basepi commented Jul 12, 2013

@thatch45 you should look at this.

@thatch45
Copy link
Contributor

So far I am unable to reproduce this on git head, running Arch, I will try again with Ubuntu. Sorry about being late to the party on this one!

@max-arnold
Copy link
Contributor

Upgraded to salt-minion 0.16.0. After 10 days (and some usage and restarts in between) there are multiple minion processes again. Nothing suspicious in minion logs except this one (multiple times):

2013-07-17 11:59:45,523 [salt.config      ][WARNING ] The function 'grains()' defined in '/usr/lib/pymodules/python2.7/salt/loader.py' is not yet using the new 'default_path' argument to `salt.config.load_config()`. As such, the 'SALT_MINION_CONFIG' environment variable will be ignored

And there is something interesting in syslog (unfortunately, pid 10207 is not running anymore):

Jul 19 11:00:47 bf rsyslogd-2177: imuxsock begins to drop messages from pid 10207 due to rate-limiting

Running /etc/init.d/salt-minion stop does not terminate minion processes. Starting salt-master does.

Ubuntu 12.04.2 LTS (GNU/Linux 3.9.3-x86_64-linode33 x86_64)

@felskrone
Copy link
Contributor

@max-arnold: the imuxsock message does not necessarily have to be serious. It just means, that your rsyslog-setup has rate limiting enabled. Depending on the loglevel of your minion, quite a lot of log messages are being created upon minion start.

You can/should tune your rsyslog-settings to higher rate limits:

$SystemLogRateLimitInterval 2
$SystemLogRateLimitBurst 500

This would allow 500 messages in 2 seconds. I tuned mine to really high values because i really want all messages i can get. Also see: http://www.rsyslog.com/changing-the-settings/

@max-arnold
Copy link
Contributor

Looks like I found one way to reproduce the problem more often. The trick is to shutdown master server without proper tcp connection termination.

  1. Start salt-master on laptop (my setup is described here: https://groups.google.com/d/msg/salt-users/q9YHzL2hmRs/tewPF-0I2QUJ)
  2. ssh to remote machine with remote port forwarding and wait for minions to connect to local machine via ssh tunnel
  3. Close the lid or put laptop to sleep mode
  4. Wait for several hours

It is not 100% reliable (managed to do that only twice), but after some time I got several minion processes hanging around.

@max-arnold
Copy link
Contributor

Also this time I enabled more verbose minion logging (log_level_logfile: trace) and this is the output:

2013-07-19 14:14:28,273 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 14:15:38,349 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 14:16:48,424 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 14:17:58,498 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 14:19:08,573 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
...
2013-07-19 15:11:42,136 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 15:12:52,212 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 15:14:02,287 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 15:14:28,529 [salt.utils.schedule][DEBUG   ] Running scheduled job: __mine_interval
2013-07-19 15:14:28,580 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 15:15:12,363 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 15:15:38,658 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
...
2013-07-19 19:09:14,422 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:09:22,504 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:09:39,791 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:09:58,128 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:10:05,866 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:10:20,713 [salt.crypt       ][DEBUG   ] Decrypting the current master AES key
2013-07-19 19:10:20,713 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:10:22,255 [salt.crypt       ][DEBUG   ] Decrypting the current master AES key
2013-07-19 19:10:22,255 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:10:23,058 [salt.crypt       ][DEBUG   ] Decrypting the current master AES key
2013-07-19 19:10:23,058 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:10:24,499 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:10:24,999 [salt.crypt       ][DEBUG   ] Decrypting the current master AES key
2013-07-19 19:10:24,1000 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:10:32,579 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:10:33,046 [salt.crypt       ][DEBUG   ] Decrypting the current master AES key
2013-07-19 19:10:33,047 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:14:28,484 [salt.utils.schedule][DEBUG   ] Running scheduled job: __mine_interval
2013-07-19 19:14:28,674 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:15:38,753 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:16:48,829 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:17:58,905 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:19:08,982 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 19:19:55,064 [salt.crypt       ][DEBUG   ] Decrypting the current master AES key
2013-07-19 19:19:55,064 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 20:14:28,603 [salt.utils.schedule][DEBUG   ] Running scheduled job: __mine_interval
2013-07-19 20:14:28,704 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 20:15:38,781 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 20:16:48,856 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 20:17:58,930 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 20:19:09,005 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 20:20:19,081 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
...
2013-07-19 22:12:52,222 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 22:13:36,877 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 22:14:02,297 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 22:14:28,445 [salt.utils.schedule][DEBUG   ] Running scheduled job: __mine_interval
2013-07-19 22:14:28,489 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 22:14:46,952 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 22:15:12,374 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2013-07-19 22:15:38,573 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem

@max-arnold
Copy link
Contributor

Ok, no need to suspend tcp connections. Just set mine_interval in minion config to 1, restart it, then kill salt-master and watch growing list of minion processes:

watch "ps ax | grep '[s]alt'"

@basepi
Copy link
Contributor

basepi commented Jul 19, 2013

Wait, does everyone who's experiencing this issue have the Salt Mine enabled? That could be the cause.

@chjohnst? @felskrone?

@basepi
Copy link
Contributor

basepi commented Jul 19, 2013

Talked to @thatch45, told him it was probably Mine related, and immediately he knew what the problem was. He'll fix it soon. =)

@thatch45
Copy link
Contributor

Yes, that is it, the mine processes block waiting for a master connection to open up and don't have a timeout, this could make them just pile up, I am on it!

@max-arnold
Copy link
Contributor

WIll the fix cover all scheduled jobs or just Salt Mine? Is adding a timeout the right solution? Maybe it is better to not allow specific job to start again before previous (of the same type) has finished? Or allow both approaches?

@felskrone
Copy link
Contributor

Well, i did not knowingly activate the mine-module. If its enabled per default i guess im using it (which seems to be the case).

Little story:
I shutdown the salt-master on friday because its in a debugging state (for issue 5948) and i did not want anyone to use it over the weekend. On sunday we had thirty productive machines (minions) that had spawned minion after minion process and eventually we had so many minions running in process state D (uninterrruptable) that those servers had to be rebooted.

Well, thinks like this can happen. Its not like i've never forgotten a timeout or handling case xy etc. :-)

But what should be considered after a bug like this one is:

  • Do you really want to default to setting new features/modules active?

Depending on your setup things like this can really bite you where you dont really appreciate being bitten. Especially in larger setups i really dont like the idea of asking thousands of minions for data if i have not told the master explicitly to do so.

Also in the current minion and master config i dont see any options to disable this feature. There is only a _mine_interval with a default of 60. The mine-module itself mentions a function_cache which is not mentioned in the current master- and/or minion-config and not even in the salt/conf.py.

Is this feature really ready to be used in a production environment and being enabled per default?

@max-arnold
Copy link
Contributor

Looking at the code, setting mine_interval to 0 is not a good idea to disable salt-mine (leads to fork bomb). I think the schedule module needs additional guard against zero (and negative?) interval. The only way to disable mine module is to add two lines to minion config:

disable_modules:
  - mine

The only downside with this approach is the following message every second in minion log with INFO level:

[salt.utils.schedule][INFO    ] Invalid function: __mine_interval in job mine.update. Ignoring.

@max-arnold
Copy link
Contributor

Also utils/schedule.py probably needs additional protection against time jitter. I've seen some schedulers which can trigger extra events if time jumps backwards to one second or more.

This can happen if the system administrator put the command like ntpdate -b pool.ntp.org to crontab every hour and another software scheduler runs some job with more fine-grained interval (less than a minute, for example every hour at second 0 or 1). This can lead to hard to reproduce double runs when time jumps back slightly.

I even written a module to avoid some cases like this: https://bitbucket.org/marnold/py-cron/src/7a4a20c9a69c2364fe5fba1c3300647b52f51f6b/pycron.py?at=default

@mdanilescu
Copy link

I'm running Salt 0.16.4 on Ubuntu 12.04 and just experienced this issue. It basically made the machine unusable until I finally managed to kill all salt-minion processes.
Was this issue solved in the newer versions? I saw that the milestone was set for 0.17.0.
Thanks.

@basepi
Copy link
Contributor

basepi commented Sep 25, 2013

No, we have not solved the issue.

The solution is to disable the mine using this in the minion config:

disable_modules:
  - mine

Of course, you should only be seeing this in a masterless setup. If your minion is connected to the master, it may be a different issue.

@anveo
Copy link

anveo commented Oct 3, 2013

I'm also experiencing this. Salt 0.16.4 on Ubuntu 12.04 and the minions are connecting to a master. I install via pip and place an upstart config which looks like https://gist.github.com/anveo/6816398. Let me know if there is any other information I can give.

@basepi
Copy link
Contributor

basepi commented Oct 3, 2013

@anveo If you're not running masterless, then it's a different issue. Can you please open a new issue? The extra processes here are unclosed ZMQ procs from trying (unsuccessfully) to connect to a master for the Salt Mine. (Only happens in masterless)

@tyre
Copy link

tyre commented Nov 14, 2013

We're also seeing this on 0.16.4. 108 on one server and 148 on another taking up GBs of memory. I'd be happy to provide more information on our setup if it would help debug the issue.

@basepi
Copy link
Contributor

basepi commented Nov 14, 2013

This was fixed in #8222. You should try 0.17.2, that's the first version containing the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants