salt-master EventPublisher crashes every few day #36285

czhong111 · 2016-09-14T03:33:02Z

Description of Issue/Question

Every few days (1 or 2 days) salt-master stops responding and we need to restart salt-master.

Setup

We have a number of salt-syndics(about 50) that connect to this salt-master. MasterOfMasters/Syndics are very new version (2016.3.3). We get a fair number of Salt minions(30k+) connected with Syndics

Steps to Reproduce Issue

When restart salt-master after a few hours, the EventPublisher process frequently cost 100% cpu resouce
* and top command shows as follows :*

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                        
31418 saltadmi  20   0  691m 301m 1540 S  100.0  0.1   4:16.11 /usr/bin/python2.6 /usr/bin/salt-master -d EventPublisher

And using command "strace -p 31418" shows like too many EAGAIN error occurring
and the detail is as follows

sendto(300, "\202\244body\332\3\220syndic/XZ_SedMaster_04/"..., 927, 0, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
sendto(230, "\202\244body\332\3\256salt/auth\n\n\203\246_stamp\272201"..., 957, 0, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
sendto(279, "\202\244body\332\3\220syndic/XZ_SedMaster_04/"..., 927, 0, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
sendto(261, "\202\244body\332\2Rsalt/auth\n\n\203\246_stamp\272201"..., 609, 0, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
sendto(253, "\202\244body\332\3\220syndic/XZ_SedMaster_04/"..., 927, 0, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
sendto(258, "\202\244body\332\3\257salt/auth\n\n\203\246_stamp\272201"..., 958, 0, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
sendto(234, "\202\244body\332\0\22020160914105524082395\n\n\203"..., 159, 0, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
sendto(250, "\202\244body\332\3\256salt/auth\n\n\203\246_stamp\272201"..., 957, 0, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
sendto(238, "\202\244body\332\3\217syndic/YH_SedMaster_01/"..., 926, 0, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)

check the master log on master of masters , can see all the minions(30K+) connected with syndics auth event is captured like:

2016-09-14 11:26:58,786 [salt.utils.reactor                                   ][DEBUG   ][31469] Gathering reactors for tag salt/auth
2016-09-14 11:26:58,792 [salt.utils.reactor                                   ][DEBUG   ][31469] Gathering reactors for tag syndic/XZ_SedMast
er_02/salt/auth2016-09-14 11:26:59,219 [salt.utils.event                                     ][DEBUG   ][31550] Sending event - data = {'_stamp': '2016-09-1
4T03:26:59.218695', 'tag': 'salt/auth', 'data': {'result': True, '_stamp': '2016-09-14T03:26:58.804824', 'id': '10.104.103.105', 'pub': '-----BEGIN PUBLIC KEY-----\nMIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEAwbJFCSx9jwl0YTT1UTwd\n3PlYijCt3L0mQ3c7QCWuZkuMc1eRIRUvp2awix2ge50UOWa/h8yJ8tD19wbZdMRU\n7TnD0iKucsiMpoaBCiO3uBwYKu04dlDZCxhTc32SGmfIZawVIgXsv0ixKZlQrrSA\n8xgqMxDcVy1Y+pSswE6XXDfMnt+Fr/SGEOA4HsVm4QP57Arvg6OK7jTRH+XSEAJp\nks8drOY4vKgT/35OsSexgv/Yqv5SytKhN3Uzo8ccPvkX9B/zadbp7FlFl2sGindm\ngwB88BodpliK2cEcCkekZPKsC/D56aeuHBflqkmOuR+p2Qm/a9lTkF91AiikJNnI\nOAMDTLERutsOibmbJTM0KhIaqAYLctqpZgx5ZO+TZb1BDM9kA4vFT5cSAbGcK/r/\nTtOgeh7JE2pcIGurJBntF++OzJISDk+CgrHMZeEKFaJAneGmyWWPkycvaoJFa0/a\n8O9Zxmjr/BqIGcVunAA7fSVDEKq62Q3uh/RajVRMaUX9KPCo6aO4SN17S8U6hUsi\nMoOX3I9axH/lDdBRxxmlLQg2fpiemhclMjbeiYZDqzdNxaMT4XXrFMtMYQH+rtrD\nyVkYcR8yJFq2qQuossJyN4o0DDbbJZF9Fc5an0bg5ZlTpa8lgG+YhR9O9f7NFt8/\n+RAsyyUJmO2OGx3ZZiEgXwUCAwEAAQ==\n-----END PUBLIC KEY-----\n', 'act': 'accept'}}2016-09-14 11:26:59,220 [salt.utils.event                                     ][DEBUG   ][31550] Sending event - data = {'result': True, '_st
amp': '2016-09-14T03:26:59.219794', 'id': '10.104.103.105', 'pub': '-----BEGIN PUBLIC KEY-----\nMIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEAwbJFCSx9jwl0YTT1UTwd\n3PlYijCt3L0mQ3c7QCWuZkuMc1eRIRUvp2awix2ge50UOWa/h8yJ8tD19wbZdMRU\n7TnD0iKucsiMpoaBCiO3uBwYKu04dlDZCxhTc32SGmfIZawVIgXsv0ixKZlQrrSA\n8xgqMxDcVy1Y+pSswE6XXDfMnt+Fr/SGEOA4HsVm4QP57Arvg6OK7jTRH+XSEAJp\nks8drOY4vKgT/35OsSexgv/Yqv5SytKhN3Uzo8ccPvkX9B/zadbp7FlFl2sGindm\ngwB88BodpliK2cEcCkekZPKsC/D56aeuHBflqkmOuR+p2Qm/a9lTkF91AiikJNnI\nOAMDTLERutsOibmbJTM0KhIaqAYLctqpZgx5ZO+TZb1BDM9kA4vFT5cSAbGcK/r/\nTtOgeh7JE2pcIGurJBntF++OzJISDk+CgrHMZeEKFaJAneGmyWWPkycvaoJFa0/a\n8O9Zxmjr/BqIGcVunAA7fSVDEKq62Q3uh/RajVRMaUX9KPCo6aO4SN17S8U6hUsi\nMoOX3I9axH/lDdBRxxmlLQg2fpiemhclMjbeiYZDqzdNxaMT4XXrFMtMYQH+rtrD\nyVkYcR8yJFq2qQuossJyN4o0DDbbJZF9Fc5an0bg5ZlTpa8lgG+YhR9O9f7NFt8/\n+RAsyyUJmO2OGx3ZZiEgXwUCAwEAAQ==\n-----END PUBLIC KEY-----\n', 'act': 'accept'}2016-09-14 11:26:59,226 [salt.utils.reactor                                   ][DEBUG   ][31469] Gathering reactors for tag salt/auth
2016-09-14 11:26:59,236 [salt.utils.reactor                                   ][DEBUG   ][31469] Gathering reactors for tag syndic/YH_SedMaster_05/salt/auth

.....

Versions Report

master of masters and the syndic are all the same version

[root@Zabbix_Server_16 ~]# salt --versions-report
Salt Version:
           Salt: 2016.3.3

Dependency Versions:
           cffi: Not Installed
       cherrypy: 3.2.2
       dateutil: 1.4.1
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.7.3
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: 0.20.2
           Mako: 0.3.4
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: 1.2.3c1
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 14.5.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5

System Versions:
           dist: redhat 6.3 Santiago
        machine: x86_64
        release: 2.6.32-279.el6.x86_64
         system: Linux
        version: Red Hat Enterprise Linux Server 6.3 Santiago

The text was updated successfully, but these errors were encountered:

czhong111 · 2016-09-14T04:01:04Z

Additional info
The master of masters hareware is 256G memory and 24 Core CPU
and add the config file on masterofmasters
/etc/salt/master
order_masters: true
id: top_master
master_id: top_master
interface: top_master
worker_threads: 50
cachedir: /opt/cache/salt/master
log_file: /opt/log/salt/master
log_level_logfile: debug
timeout: 60
user: saltadmin
token_expire: 90000
keep_jobs: 5

the /etc/salt/master on syndic master
syndic_master: top_master
zmq_filtering: true
worker_threads: 30
cachedir: /opt/cache/salt/master
keep_jobs: 5

czhong111 · 2016-09-14T06:13:36Z

it seems like as #35503
can any one help?

Ch3LL · 2016-09-15T16:24:57Z

@czhong111 just a couple follow up questions that will help us to dive in and try to replicate this.

Do you notice when this behavior occurs if you are doing something particular at the time? Like running a highstate for example?

Just to verify are your salt-minoins all 2016.3.3 as well?

Ch3LL · 2016-09-15T16:32:51Z

this might be a long shot but i'm wondering if you are also running into this issue #35480 where if you restart the master it does not kill all the master processes when you have a custom cache_dir and increased worker threads. Can you test if when you run service salt-master stop if its killing all the process or if it leaves around some processes. I did notice in my testing of that issue that this increased CPU usage because of the increase of the processes, but I don't remember seeing eventPublisher errors so might not be related to this issue but wanted to give it a try. Thanks.

czhong111 · 2016-09-18T01:39:43Z

Do you notice when this behavior occurs if you are doing something particular at the time? Like running a highstate for example?

When using cmd like

salt SedMaster cmd.run ''salt-key -y -a xx.xx.xx.xx" -t 3

to accept id on syndic master, the cmd will cost cpu, and meanwhile eventpublisher process cost too much cpu. If restart salt-master service, and running the same cmd, the cmd will cost cpu as normal but the eventpubliser cost cpu usage less than 0.1%.

Just to verify are your salt-minoins all 2016.3.3 as well?

Nope, the salt-minions are almost salt-2015.5.3-5 and salt-2015.5.3-4

Can you test if when you run service salt-master stop if its killing all the process or if it leaves around some processes.
I used crontab to restart salt-master on 6:00 AM every day

[root@Zabbix_Server_16 ~]# crontab -l
00 06 * * *  /etc/init.d/salt-master restart

and using ps command to check salt-master processes, it indeedly does not kill all the master processes.
btw the worker_threads is set to 50

[root@Zabbix_Server_16 ~]# ps aux|grep salt
root      4844  0.0  0.0 419680 24884 ?        S    Sep16   0:00 /usr/bin/python2.6 /usr/bin/salt-master -d MultiprocessingLoggingQueue
520       4904  0.0  0.0 514708 31900 ?        Sl   Sep16   0:00 /usr/bin/python2.6 /usr/bin/salt-master -d
root      6557  0.0  0.0 419680 24880 ?        S    Sep14   0:00 /usr/bin/python2.6 /usr/bin/salt-master -d MultiprocessingLoggingQueue
520       6612  0.0  0.0 514704 31888 ?        Sl   Sep14   0:00 /usr/bin/python2.6 /usr/bin/salt-master -d
520       8507  0.0  0.0 347940 29564 ?        S    Sep14   0:00 /usr/bin/python2.6 /usr/bin/salt-api -d NetAPIProcessManager
520       8660  1.3  0.1 8654008 482096 ?      Sl   Sep14  73:24 /usr/bin/python2.6 /usr/bin/salt-api -d
root     15317  0.0  0.0 419684 24888 ?        S    Sep15   0:00 /usr/bin/python2.6 /usr/bin/salt-master -d MultiprocessingLoggingQueue
520      15379  0.0  0.0 514712 31908 ?        Sl   Sep15   0:00 /usr/bin/python2.6 /usr/bin/salt-master -d
520      21359  0.1  0.0 514708 31456 ?        Sl   06:00   0:12 /usr/bin/python2.6 /usr/bin/salt-master -d ProcessManager
root     21360  0.0  0.0 419680 24884 ?        S    06:00   0:00 /usr/bin/python2.6 /usr/bin/salt-master -d MultiprocessingLoggingQueue
520      21361  0.0  0.0 432472 27580 ?        Sl   06:00   0:00 /usr/bin/python2.6 /usr/bin/salt-master -d ZeroMQPubServerChannel
520      21364  0.0  0.0 590492 191452 ?       S    06:00   0:06 /usr/bin/python2.6 /usr/bin/salt-master -d EventPublisher
520      21366  3.2  0.0 444424 42804 ?        S    06:00   5:37 /usr/bin/python2.6 /usr/bin/salt-master -d Maintenance
520      21413  1.2  0.0 1112436 34448 ?       Sl   06:00   2:06 /usr/bin/python2.6 /usr/bin/salt-master -d Reactor
520      21415  6.7  0.0 439916 32812 ?        Sl   06:00  11:33 /usr/bin/python2.6 /usr/bin/salt-master -d
520      21447  0.0  0.0 514708 29972 ?        S    06:00   0:00 /usr/bin/python2.6 /usr/bin/salt-master -d ReqServer_ProcessManager
520      21448  0.0  0.0 4152148 37864 ?       Sl   06:00   0:04 /usr/bin/python2.6 /usr/bin/salt-master -d MWorkerQueue
520      21454  0.0  0.0 532244 37300 ?        Sl   06:00   0:02 /usr/bin/python2.6 /usr/bin/salt-master -d MWorker-0
...
520      21625  0.0  0.0 532336 37352 ?        Sl   06:00   0:02 /usr/bin/python2.6 /usr/bin/salt-master -d MWorker-49
root     26418  0.0  0.0 419684 24888 ?        S    Sep17   0:00 /usr/bin/python2.6 /usr/bin/salt-master -d MultiprocessingLoggingQueue
520      26475  0.0  0.0 514712 31900 ?        Sl   Sep17   0:00 /usr/bin/python2.6 /usr/bin/salt-master -d
root     26488  0.0  0.0 590288 32748 ?        Sl   Sep14   3:25 /usr/bin/python2.6 /usr/bin/salt-minion -d KeepAlive MinionProcessManager
root     26489  0.0  0.0 494172 23760 ?        S    Sep14   0:00 /usr/bin/python2.6 /usr/bin/salt-minion -d KeepAlive MultiprocessingLoggingQ

Ch3LL · 2016-09-19T21:50:17Z

Thanks for all The info! So if you stop the master process and then manually kill all the processes and start back the master does that help?

Also any chance you could try to upgrade minions and see if it helps the situation?? I know you have a lot so I understand this is might not be plausible.

czhong111 · 2016-09-20T02:08:26Z

Thanks for all The info! So if you stop the master process and then manually kill all the processes and start back the master does that help?
Yes, when manually kill all processes that did helped.

Also any chance you could try to upgrade minions and see if it helps the situation?? I know you have a lot so I understand this is might not be plausible.
Sorry, the minions now are set in stable version and now it is not allowed to upgrade lots minions to 2016.3.3 version becasuse the VMs with minion installed are used by other department.
And yesterday, when I roll back top master to 2015.8.12 versions(syndic still 2016.3.3) that seems works well.

and the process info is as below

[root@salt]# ll /proc/eventPublisher_PID/fd | grep socket | wc -l
705

[root@salt]#strace -p eventPulisher_PID

write(9, "\1\0\0\0\0\0\0\0", 8)         = 8
write(9, "\1\0\0\0\0\0\0\0", 8)         = 8
poll([{fd=12, events=POLLIN}], 1, -1)   = 1 ([{fd=12, revents=POLLIN}])
read(12, "\1\0\0\0\0\0\0\0", 8)         = 8
poll([{fd=12, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=11, events=POLLIN}], 1, 0)    = 0 (Timeout)
write(9, "\1\0\0\0\0\0\0\0", 8)         = 8
poll([{fd=12, events=POLLIN}], 1, -1)   = 1 ([{fd=12, revents=POLLIN}])
read(12, "\1\0\0\0\0\0\0\0", 8)         = 8
poll([{fd=12, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=11, events=POLLIN}], 1, 0)    = 0 (Timeout)
write(9, "\1\0\0\0\0\0\0\0", 8)         = 8
poll([{fd=12, events=POLLIN}], 1, -1)   = 1 ([{fd=12, revents=POLLIN}])
read(12, "\1\0\0\0\0\0\0\0", 8)         = 8
poll([{fd=12, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=11, events=POLLIN}], 1, 0)    = 0 (Timeout)
...

DmitryKuzmenko · 2016-09-20T10:38:06Z

There is a known issue that looks related. It's fixed in #36024 and will be released in 2016.3.4.
@czhong111 have you checked the salt-master memory consumption right before it crashes? Because without the fix the EventPublisher process leaks the memory because one of master subprocesses subscribes to the event bus and doesn't read the messages. The mentioned salt-key command sends a 1 event per key being added to the master, maybe there are a lot of events and master is being killed by OOM killer?
Another one issue related to the master shut down is here #36184.

DmitryKuzmenko · 2016-09-20T10:38:51Z

@czhong111 BTW, #36024 just removes 2 codelines, you could try to do it manually on your master and try.

czhong111 · 2016-09-21T01:05:01Z

Thanks for reply, it did leaks the memory before the master crashes. And i will remove the codelines manually to see if it happens again.

DmitryKuzmenko · 2016-09-21T10:52:44Z

@czhong111 thank you!

DmitryKuzmenko · 2017-04-10T12:21:23Z

@czhong111 I've tried to reproduce your issue during a number of days using different states combinations but without success.
By the code review I've found there are some significant issues fixed in salt since 2016.3.3. Could you please try to upgrade your master to 2016.3.4 or higher and look does the issue gone?

cachedout · 2017-04-20T17:45:31Z

Closed for lack of response. If we do get a response, we can easily re-open this. Thanks.

czhong111 changed the title ~~salt-master EventPublisher crashes every few days~~ salt-master EventPublisher crashes every day Sep 14, 2016

czhong111 changed the title ~~salt-master EventPublisher crashes every day~~ salt-master EventPublisher crashes every few day Sep 14, 2016

Ch3LL added Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt severity-high 2nd top severity, seen by most users, causes major problems P4 Priority 4 labels Sep 15, 2016

Ch3LL added this to the Approved milestone Sep 15, 2016

Ch3LL mentioned this issue Sep 15, 2016

EventPublisher consumes 100% CPU #35503

Closed

meggiebot added the ZRELEASED - 2016.3.4 label Sep 15, 2016

meggiebot modified the milestones: C 3, Approved Sep 15, 2016

meggiebot assigned DmitryKuzmenko Sep 15, 2016

meggiebot added the TEAM Core label Sep 15, 2016

meggiebot modified the milestones: C 2, C 3 Sep 19, 2016

meggiebot removed the ZRELEASED - 2016.3.4 label Sep 29, 2016

DmitryKuzmenko modified the milestones: C 1, C 2 Nov 1, 2016

meggiebot added this to the C 0 milestone Nov 28, 2016

meggiebot removed this from the C 1 milestone Nov 28, 2016

meggiebot modified the milestones: Nitrogen 2, Nitrogen 1 Dec 19, 2016

meggiebot modified the milestones: Nitrogen 3, Nitrogen 2 Jan 18, 2017

meggiebot modified the milestones: Nitrogen 4, Nitrogen 3 Feb 6, 2017

Ch3LL modified the milestones: Nitrogen 5, Nitrogen 4 Feb 27, 2017

DmitryKuzmenko added cannot-reproduce cannot be replicated with info/context provided Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged labels Apr 10, 2017

cachedout closed this as completed Apr 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

salt-master EventPublisher crashes every few day #36285

salt-master EventPublisher crashes every few day #36285

czhong111 commented Sep 14, 2016 •

edited by Ch3LL

Loading

czhong111 commented Sep 14, 2016 •

edited

Loading

czhong111 commented Sep 14, 2016 •

edited

Loading

Ch3LL commented Sep 15, 2016

Ch3LL commented Sep 15, 2016

czhong111 commented Sep 18, 2016 •

edited

Loading

Ch3LL commented Sep 19, 2016

czhong111 commented Sep 20, 2016 •

edited

Loading

DmitryKuzmenko commented Sep 20, 2016

DmitryKuzmenko commented Sep 20, 2016

czhong111 commented Sep 21, 2016

DmitryKuzmenko commented Sep 21, 2016

DmitryKuzmenko commented Apr 10, 2017

cachedout commented Apr 20, 2017

salt-master EventPublisher crashes every few day #36285

salt-master EventPublisher crashes every few day #36285

Comments

czhong111 commented Sep 14, 2016 • edited by Ch3LL Loading

Description of Issue/Question

Setup

Steps to Reproduce Issue

Versions Report

czhong111 commented Sep 14, 2016 • edited Loading

czhong111 commented Sep 14, 2016 • edited Loading

Ch3LL commented Sep 15, 2016

Ch3LL commented Sep 15, 2016

czhong111 commented Sep 18, 2016 • edited Loading

Ch3LL commented Sep 19, 2016

czhong111 commented Sep 20, 2016 • edited Loading

DmitryKuzmenko commented Sep 20, 2016

DmitryKuzmenko commented Sep 20, 2016

czhong111 commented Sep 21, 2016

DmitryKuzmenko commented Sep 21, 2016

DmitryKuzmenko commented Apr 10, 2017

cachedout commented Apr 20, 2017

czhong111 commented Sep 14, 2016 •

edited by Ch3LL

Loading

czhong111 commented Sep 14, 2016 •

edited

Loading

czhong111 commented Sep 14, 2016 •

edited

Loading

czhong111 commented Sep 18, 2016 •

edited

Loading

czhong111 commented Sep 20, 2016 •

edited

Loading