Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

salt-master EventPublisher crashes every few day #36285

Closed
czhong111 opened this issue Sep 14, 2016 · 13 comments
Closed

salt-master EventPublisher crashes every few day #36285

czhong111 opened this issue Sep 14, 2016 · 13 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior cannot-reproduce cannot be replicated with info/context provided Core relates to code central or existential to Salt P4 Priority 4 Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged severity-high 2nd top severity, seen by most users, causes major problems
Milestone

Comments

@czhong111
Copy link

czhong111 commented Sep 14, 2016

Description of Issue/Question

Every few days (1 or 2 days) salt-master stops responding and we need to restart salt-master.

Setup

We have a number of salt-syndics(about 50) that connect to this salt-master. MasterOfMasters/Syndics are very new version (2016.3.3). We get a fair number of Salt minions(30k+) connected with Syndics

Steps to Reproduce Issue

When restart salt-master after a few hours, the EventPublisher process frequently cost 100% cpu resouce
* and top command shows as follows :*

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                        
31418 saltadmi  20   0  691m 301m 1540 S  100.0  0.1   4:16.11 /usr/bin/python2.6 /usr/bin/salt-master -d EventPublisher   

And using command "strace -p 31418" shows like too many EAGAIN error occurring
and the detail is as follows

sendto(300, "\202\244body\332\3\220syndic/XZ_SedMaster_04/"..., 927, 0, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
sendto(230, "\202\244body\332\3\256salt/auth\n\n\203\246_stamp\272201"..., 957, 0, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
sendto(279, "\202\244body\332\3\220syndic/XZ_SedMaster_04/"..., 927, 0, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
sendto(261, "\202\244body\332\2Rsalt/auth\n\n\203\246_stamp\272201"..., 609, 0, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
sendto(253, "\202\244body\332\3\220syndic/XZ_SedMaster_04/"..., 927, 0, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
sendto(258, "\202\244body\332\3\257salt/auth\n\n\203\246_stamp\272201"..., 958, 0, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
sendto(234, "\202\244body\332\0\22020160914105524082395\n\n\203"..., 159, 0, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
sendto(250, "\202\244body\332\3\256salt/auth\n\n\203\246_stamp\272201"..., 957, 0, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
sendto(238, "\202\244body\332\3\217syndic/YH_SedMaster_01/"..., 926, 0, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)

check the master log on master of masters , can see all the minions(30K+) connected with syndics auth event is captured like:

2016-09-14 11:26:58,786 [salt.utils.reactor                                   ][DEBUG   ][31469] Gathering reactors for tag salt/auth
2016-09-14 11:26:58,792 [salt.utils.reactor                                   ][DEBUG   ][31469] Gathering reactors for tag syndic/XZ_SedMast
er_02/salt/auth2016-09-14 11:26:59,219 [salt.utils.event                                     ][DEBUG   ][31550] Sending event - data = {'_stamp': '2016-09-1
4T03:26:59.218695', 'tag': 'salt/auth', 'data': {'result': True, '_stamp': '2016-09-14T03:26:58.804824', 'id': '10.104.103.105', 'pub': '-----BEGIN PUBLIC KEY-----\nMIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEAwbJFCSx9jwl0YTT1UTwd\n3PlYijCt3L0mQ3c7QCWuZkuMc1eRIRUvp2awix2ge50UOWa/h8yJ8tD19wbZdMRU\n7TnD0iKucsiMpoaBCiO3uBwYKu04dlDZCxhTc32SGmfIZawVIgXsv0ixKZlQrrSA\n8xgqMxDcVy1Y+pSswE6XXDfMnt+Fr/SGEOA4HsVm4QP57Arvg6OK7jTRH+XSEAJp\nks8drOY4vKgT/35OsSexgv/Yqv5SytKhN3Uzo8ccPvkX9B/zadbp7FlFl2sGindm\ngwB88BodpliK2cEcCkekZPKsC/D56aeuHBflqkmOuR+p2Qm/a9lTkF91AiikJNnI\nOAMDTLERutsOibmbJTM0KhIaqAYLctqpZgx5ZO+TZb1BDM9kA4vFT5cSAbGcK/r/\nTtOgeh7JE2pcIGurJBntF++OzJISDk+CgrHMZeEKFaJAneGmyWWPkycvaoJFa0/a\n8O9Zxmjr/BqIGcVunAA7fSVDEKq62Q3uh/RajVRMaUX9KPCo6aO4SN17S8U6hUsi\nMoOX3I9axH/lDdBRxxmlLQg2fpiemhclMjbeiYZDqzdNxaMT4XXrFMtMYQH+rtrD\nyVkYcR8yJFq2qQuossJyN4o0DDbbJZF9Fc5an0bg5ZlTpa8lgG+YhR9O9f7NFt8/\n+RAsyyUJmO2OGx3ZZiEgXwUCAwEAAQ==\n-----END PUBLIC KEY-----\n', 'act': 'accept'}}2016-09-14 11:26:59,220 [salt.utils.event                                     ][DEBUG   ][31550] Sending event - data = {'result': True, '_st
amp': '2016-09-14T03:26:59.219794', 'id': '10.104.103.105', 'pub': '-----BEGIN PUBLIC KEY-----\nMIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEAwbJFCSx9jwl0YTT1UTwd\n3PlYijCt3L0mQ3c7QCWuZkuMc1eRIRUvp2awix2ge50UOWa/h8yJ8tD19wbZdMRU\n7TnD0iKucsiMpoaBCiO3uBwYKu04dlDZCxhTc32SGmfIZawVIgXsv0ixKZlQrrSA\n8xgqMxDcVy1Y+pSswE6XXDfMnt+Fr/SGEOA4HsVm4QP57Arvg6OK7jTRH+XSEAJp\nks8drOY4vKgT/35OsSexgv/Yqv5SytKhN3Uzo8ccPvkX9B/zadbp7FlFl2sGindm\ngwB88BodpliK2cEcCkekZPKsC/D56aeuHBflqkmOuR+p2Qm/a9lTkF91AiikJNnI\nOAMDTLERutsOibmbJTM0KhIaqAYLctqpZgx5ZO+TZb1BDM9kA4vFT5cSAbGcK/r/\nTtOgeh7JE2pcIGurJBntF++OzJISDk+CgrHMZeEKFaJAneGmyWWPkycvaoJFa0/a\n8O9Zxmjr/BqIGcVunAA7fSVDEKq62Q3uh/RajVRMaUX9KPCo6aO4SN17S8U6hUsi\nMoOX3I9axH/lDdBRxxmlLQg2fpiemhclMjbeiYZDqzdNxaMT4XXrFMtMYQH+rtrD\nyVkYcR8yJFq2qQuossJyN4o0DDbbJZF9Fc5an0bg5ZlTpa8lgG+YhR9O9f7NFt8/\n+RAsyyUJmO2OGx3ZZiEgXwUCAwEAAQ==\n-----END PUBLIC KEY-----\n', 'act': 'accept'}2016-09-14 11:26:59,226 [salt.utils.reactor                                   ][DEBUG   ][31469] Gathering reactors for tag salt/auth
2016-09-14 11:26:59,236 [salt.utils.reactor                                   ][DEBUG   ][31469] Gathering reactors for tag syndic/YH_SedMaster_05/salt/auth

.....

Versions Report

master of masters and the syndic are all the same version

[root@Zabbix_Server_16 ~]# salt --versions-report
Salt Version:
           Salt: 2016.3.3

Dependency Versions:
           cffi: Not Installed
       cherrypy: 3.2.2
       dateutil: 1.4.1
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.7.3
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: 0.20.2
           Mako: 0.3.4
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: 1.2.3c1
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 14.5.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5

System Versions:
           dist: redhat 6.3 Santiago
        machine: x86_64
        release: 2.6.32-279.el6.x86_64
         system: Linux
        version: Red Hat Enterprise Linux Server 6.3 Santiago
@czhong111
Copy link
Author

czhong111 commented Sep 14, 2016

Additional info
The master of masters hareware is 256G memory and 24 Core CPU
and add the config file on masterofmasters
/etc/salt/master
order_masters: true
id: top_master
master_id: top_master
interface: top_master
worker_threads: 50
cachedir: /opt/cache/salt/master
log_file: /opt/log/salt/master
log_level_logfile: debug
timeout: 60
user: saltadmin
token_expire: 90000
keep_jobs: 5

the /etc/salt/master on syndic master
syndic_master: top_master
zmq_filtering: true
worker_threads: 30
cachedir: /opt/cache/salt/master
keep_jobs: 5

@czhong111
Copy link
Author

czhong111 commented Sep 14, 2016

it seems like as #35503
can any one help?

@czhong111 czhong111 changed the title salt-master EventPublisher crashes every few days salt-master EventPublisher crashes every day Sep 14, 2016
@czhong111 czhong111 changed the title salt-master EventPublisher crashes every day salt-master EventPublisher crashes every few day Sep 14, 2016
@Ch3LL
Copy link
Contributor

Ch3LL commented Sep 15, 2016

@czhong111 just a couple follow up questions that will help us to dive in and try to replicate this.

Do you notice when this behavior occurs if you are doing something particular at the time? Like running a highstate for example?

Just to verify are your salt-minoins all 2016.3.3 as well?

@Ch3LL Ch3LL added Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt severity-high 2nd top severity, seen by most users, causes major problems P4 Priority 4 labels Sep 15, 2016
@Ch3LL Ch3LL added this to the Approved milestone Sep 15, 2016
@meggiebot meggiebot modified the milestones: C 3, Approved Sep 15, 2016
@Ch3LL
Copy link
Contributor

Ch3LL commented Sep 15, 2016

this might be a long shot but i'm wondering if you are also running into this issue #35480 where if you restart the master it does not kill all the master processes when you have a custom cache_dir and increased worker threads. Can you test if when you run service salt-master stop if its killing all the process or if it leaves around some processes. I did notice in my testing of that issue that this increased CPU usage because of the increase of the processes, but I don't remember seeing eventPublisher errors so might not be related to this issue but wanted to give it a try. Thanks.

@czhong111
Copy link
Author

czhong111 commented Sep 18, 2016

Do you notice when this behavior occurs if you are doing something particular at the time? Like running a highstate for example?

When using cmd like

salt SedMaster cmd.run ''salt-key -y -a xx.xx.xx.xx" -t 3

to accept id on syndic master, the cmd will cost cpu, and meanwhile eventpublisher process cost too much cpu. If restart salt-master service, and running the same cmd, the cmd will cost cpu as normal but the eventpubliser cost cpu usage less than 0.1%.

Just to verify are your salt-minoins all 2016.3.3 as well?

Nope, the salt-minions are almost salt-2015.5.3-5 and salt-2015.5.3-4

Can you test if when you run service salt-master stop if its killing all the process or if it leaves around some processes.
I used crontab to restart salt-master on 6:00 AM every day

[root@Zabbix_Server_16 ~]# crontab -l
00 06 * * *  /etc/init.d/salt-master restart

and using ps command to check salt-master processes, it indeedly does not kill all the master processes.
btw the worker_threads is set to 50

[root@Zabbix_Server_16 ~]# ps aux|grep salt
root      4844  0.0  0.0 419680 24884 ?        S    Sep16   0:00 /usr/bin/python2.6 /usr/bin/salt-master -d MultiprocessingLoggingQueue
520       4904  0.0  0.0 514708 31900 ?        Sl   Sep16   0:00 /usr/bin/python2.6 /usr/bin/salt-master -d
root      6557  0.0  0.0 419680 24880 ?        S    Sep14   0:00 /usr/bin/python2.6 /usr/bin/salt-master -d MultiprocessingLoggingQueue
520       6612  0.0  0.0 514704 31888 ?        Sl   Sep14   0:00 /usr/bin/python2.6 /usr/bin/salt-master -d
520       8507  0.0  0.0 347940 29564 ?        S    Sep14   0:00 /usr/bin/python2.6 /usr/bin/salt-api -d NetAPIProcessManager
520       8660  1.3  0.1 8654008 482096 ?      Sl   Sep14  73:24 /usr/bin/python2.6 /usr/bin/salt-api -d
root     15317  0.0  0.0 419684 24888 ?        S    Sep15   0:00 /usr/bin/python2.6 /usr/bin/salt-master -d MultiprocessingLoggingQueue
520      15379  0.0  0.0 514712 31908 ?        Sl   Sep15   0:00 /usr/bin/python2.6 /usr/bin/salt-master -d
520      21359  0.1  0.0 514708 31456 ?        Sl   06:00   0:12 /usr/bin/python2.6 /usr/bin/salt-master -d ProcessManager
root     21360  0.0  0.0 419680 24884 ?        S    06:00   0:00 /usr/bin/python2.6 /usr/bin/salt-master -d MultiprocessingLoggingQueue
520      21361  0.0  0.0 432472 27580 ?        Sl   06:00   0:00 /usr/bin/python2.6 /usr/bin/salt-master -d ZeroMQPubServerChannel
520      21364  0.0  0.0 590492 191452 ?       S    06:00   0:06 /usr/bin/python2.6 /usr/bin/salt-master -d EventPublisher
520      21366  3.2  0.0 444424 42804 ?        S    06:00   5:37 /usr/bin/python2.6 /usr/bin/salt-master -d Maintenance
520      21413  1.2  0.0 1112436 34448 ?       Sl   06:00   2:06 /usr/bin/python2.6 /usr/bin/salt-master -d Reactor
520      21415  6.7  0.0 439916 32812 ?        Sl   06:00  11:33 /usr/bin/python2.6 /usr/bin/salt-master -d
520      21447  0.0  0.0 514708 29972 ?        S    06:00   0:00 /usr/bin/python2.6 /usr/bin/salt-master -d ReqServer_ProcessManager
520      21448  0.0  0.0 4152148 37864 ?       Sl   06:00   0:04 /usr/bin/python2.6 /usr/bin/salt-master -d MWorkerQueue
520      21454  0.0  0.0 532244 37300 ?        Sl   06:00   0:02 /usr/bin/python2.6 /usr/bin/salt-master -d MWorker-0
...
520      21625  0.0  0.0 532336 37352 ?        Sl   06:00   0:02 /usr/bin/python2.6 /usr/bin/salt-master -d MWorker-49
root     26418  0.0  0.0 419684 24888 ?        S    Sep17   0:00 /usr/bin/python2.6 /usr/bin/salt-master -d MultiprocessingLoggingQueue
520      26475  0.0  0.0 514712 31900 ?        Sl   Sep17   0:00 /usr/bin/python2.6 /usr/bin/salt-master -d
root     26488  0.0  0.0 590288 32748 ?        Sl   Sep14   3:25 /usr/bin/python2.6 /usr/bin/salt-minion -d KeepAlive MinionProcessManager
root     26489  0.0  0.0 494172 23760 ?        S    Sep14   0:00 /usr/bin/python2.6 /usr/bin/salt-minion -d KeepAlive MultiprocessingLoggingQ

@meggiebot meggiebot modified the milestones: C 2, C 3 Sep 19, 2016
@Ch3LL
Copy link
Contributor

Ch3LL commented Sep 19, 2016

Thanks for all The info! So if you stop the master process and then manually kill all the processes and start back the master does that help?

Also any chance you could try to upgrade minions and see if it helps the situation?? I know you have a lot so I understand this is might not be plausible.

@czhong111
Copy link
Author

czhong111 commented Sep 20, 2016

Thanks for all The info! So if you stop the master process and then manually kill all the processes and start back the master does that help?
Yes, when manually kill all processes that did helped.

Also any chance you could try to upgrade minions and see if it helps the situation?? I know you have a lot so I understand this is might not be plausible.
Sorry, the minions now are set in stable version and now it is not allowed to upgrade lots minions to 2016.3.3 version becasuse the VMs with minion installed are used by other department.
And yesterday, when I roll back top master to 2015.8.12 versions(syndic still 2016.3.3) that seems works well.

and the process info is as below

[root@salt]# ll /proc/eventPublisher_PID/fd | grep socket | wc -l
705

[root@salt]#strace -p eventPulisher_PID

write(9, "\1\0\0\0\0\0\0\0", 8)         = 8
write(9, "\1\0\0\0\0\0\0\0", 8)         = 8
poll([{fd=12, events=POLLIN}], 1, -1)   = 1 ([{fd=12, revents=POLLIN}])
read(12, "\1\0\0\0\0\0\0\0", 8)         = 8
poll([{fd=12, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=11, events=POLLIN}], 1, 0)    = 0 (Timeout)
write(9, "\1\0\0\0\0\0\0\0", 8)         = 8
poll([{fd=12, events=POLLIN}], 1, -1)   = 1 ([{fd=12, revents=POLLIN}])
read(12, "\1\0\0\0\0\0\0\0", 8)         = 8
poll([{fd=12, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=11, events=POLLIN}], 1, 0)    = 0 (Timeout)
write(9, "\1\0\0\0\0\0\0\0", 8)         = 8
poll([{fd=12, events=POLLIN}], 1, -1)   = 1 ([{fd=12, revents=POLLIN}])
read(12, "\1\0\0\0\0\0\0\0", 8)         = 8
poll([{fd=12, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=11, events=POLLIN}], 1, 0)    = 0 (Timeout)
...

@DmitryKuzmenko
Copy link
Contributor

There is a known issue that looks related. It's fixed in #36024 and will be released in 2016.3.4.
@czhong111 have you checked the salt-master memory consumption right before it crashes? Because without the fix the EventPublisher process leaks the memory because one of master subprocesses subscribes to the event bus and doesn't read the messages. The mentioned salt-key command sends a 1 event per key being added to the master, maybe there are a lot of events and master is being killed by OOM killer?
Another one issue related to the master shut down is here #36184.

@DmitryKuzmenko
Copy link
Contributor

@czhong111 BTW, #36024 just removes 2 codelines, you could try to do it manually on your master and try.

@czhong111
Copy link
Author

Thanks for reply, it did leaks the memory before the master crashes. And i will remove the codelines manually to see if it happens again.

@DmitryKuzmenko
Copy link
Contributor

@czhong111 thank you!

@DmitryKuzmenko DmitryKuzmenko modified the milestones: C 1, C 2 Nov 1, 2016
@meggiebot meggiebot added this to the C 0 milestone Nov 28, 2016
@meggiebot meggiebot removed this from the C 1 milestone Nov 28, 2016
@meggiebot meggiebot modified the milestones: Nitrogen 2, Nitrogen 1 Dec 19, 2016
@meggiebot meggiebot modified the milestones: Nitrogen 3, Nitrogen 2 Jan 18, 2017
@meggiebot meggiebot modified the milestones: Nitrogen 4, Nitrogen 3 Feb 6, 2017
@Ch3LL Ch3LL modified the milestones: Nitrogen 5, Nitrogen 4 Feb 27, 2017
@DmitryKuzmenko
Copy link
Contributor

@czhong111 I've tried to reproduce your issue during a number of days using different states combinations but without success.
By the code review I've found there are some significant issues fixed in salt since 2016.3.3. Could you please try to upgrade your master to 2016.3.4 or higher and look does the issue gone?

@DmitryKuzmenko DmitryKuzmenko added cannot-reproduce cannot be replicated with info/context provided Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged labels Apr 10, 2017
@cachedout
Copy link
Contributor

Closed for lack of response. If we do get a response, we can easily re-open this. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior cannot-reproduce cannot be replicated with info/context provided Core relates to code central or existential to Salt P4 Priority 4 Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged severity-high 2nd top severity, seen by most users, causes major problems
Projects
None yet
Development

No branches or pull requests

5 participants