salt-api automatically restart caused by "opening too many files" #40245

czhong111 · 2017-03-23T07:58:50Z

Description of Issue/Question
When using salt-api for weeks I found some jobs are no result So I checked the salt-master log and I found the error appears as followed

2017-03-20 15:58:15,317 [salt.utils.event                         ][DEBUG   ][28048] MasterEvent PUB socket URI: ipc:///var/run/salt/master/master_event_pub.ipc2017-03-20 15:58:15,318 [salt.utils.event                         ][DEBUG   ][28048] MasterEvent PULL socket URI: ipc:///var/run/salt/master/master_event_pull.ipc
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/salt/netapi/rest_cherrypy/app.py", line 507, in hypermedia_handler
  File "/usr/lib/python2.6/site-packages/cherrypy/_cpdispatch.py", line 34, in __call__
  File "/usr/lib/python2.6/site-packages/salt/netapi/rest_cherrypy/app.py", line 933, in POST
  File "/usr/lib/python2.6/site-packages/salt/netapi/rest_cherrypy/app.py", line 775, in exec_lowstate
  File "/usr/lib/python2.6/site-packages/salt/netapi/__init__.py", line 70, in run
  File "/usr/lib/python2.6/site-packages/salt/netapi/__init__.py", line 98, in local
  File "/usr/lib/python2.6/site-packages/salt/client/__init__.py", line 535, in cmd
  File "/usr/lib/python2.6/site-packages/salt/client/__init__.py", line 290, in run_job
SaltClientError: [Errno 24] Too many open files
2017-03-20 15:58:15,751 [salt.utils.process                       ][INFO    ][28040] Process <function start at 0x1ce6b18> (28048) died with exit status None, restarting...
2017-03-20 15:58:16,757 [salt.utils.process                       ][DEBUG   ][28040] Started 'salt.loaded.int.netapi.rest_cherrypy.start' with pid 20716

Steps to Reproduce Issue

After the salt-api restart automaticlly, I have beening monitoring the salt-api process
In the beginning the fd number is very small

ll /proc/${salt-api_id}/fd|wc -l
300

But with salt-api jobs running, the fd files are growing rapidly

ll /proc/${salt-api_id}/fd|wc -l
6112

After a few days with salt-api jobs executing（cmd.run/cmd.script/test.ping on minions） , the fd files number will go even higer. The fd files are almost eventpoll or eventfd

#ll /proc/${salt-api_id}/fd

lrwx------ 1 root root 64 Mar 22 01:59 990 -> anon_inode:[eventfd]
lrwx------ 1 root root 64 Mar 22 01:59 991 -> anon_inode:[eventpoll]
lrwx------ 1 root root 64 Mar 22 00:01 992 -> anon_inode:[eventfd]
lrwx------ 1 root root 64 Mar 22 00:01 993 -> socket:[444176582]
lrwx------ 1 root root 64 Mar 22 01:59 994 -> socket:[444269960]
lrwx------ 1 root root 64 Mar 22 01:59 995 -> anon_inode:[eventfd]
...
lrwx------ 1 root root 64 Mar 22 01:58 997 -> anon_inode:[eventfd]
lrwx------ 1 root root 64 Mar 22 01:58 998 -> anon_inode:[eventfd]
lrwx------ 1 root root 64 Mar 22 01:58 999 -> anon_inode:[eventpoll]

The system ulimit info

#ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2066283
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
**open files                      (-n) 51200**
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 10240
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

the system openfile max are almost cost by salt

#lsof | awk '{ print $2 " " $1; }' | sort -rn | uniq -c | sort -rn
   opening files   pid
   6220            ${salt-api_pid}  /usr/bin/
   1645            ${salt-master_ventPublisher_pid}   /usr/bin/

Versions Report
And here is the salt version info:

# salt --versions-report
Salt Version:
           Salt: 2015.8.12
 
Dependency Versions:
         Jinja2: 2.7.3
       M2Crypto: 0.20.2
           Mako: Not Installed
         PyYAML: 3.11
          PyZMQ: 14.5.0
         Python: 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
           RAET: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5
           cffi: Not Installed
       cherrypy: 3.2.2
       dateutil: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
        libgit2: Not Installed
        libnacl: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
   python-gnupg: Not Installed
          smmap: Not Installed
        timelib: Not Installed
 
System Versions:
           dist: redhat 6.3 Santiago
        machine: x86_64
        release: 2.6.32-279.el6.x86_64
         system: Red Hat Enterprise Linux Server 6.3 Santiago

The text was updated successfully, but these errors were encountered:

gtmanfred · 2017-03-23T14:39:50Z

Can you also provide the output of cat /proc/<API PID>/limits ?

How many minions do you have that the api is running commands against?
Even with that information, it shouldn't be leaking connections like that. Unfortunately 2015.8 is in CVE support only, so we won't be able to fix it.

I do remember something similar to this being fixed at some point. Would you be able to update your master to a newer version of 2016.3, or 2016.11?

Thanks,
Daniel

czhong111 · 2017-03-24T01:12:53Z

Thanks for reply!
Add the limits info at present environment. there is more than 10000 minions in my environment by using 3-level topology structure and about 3000 jobs one day （one job for only one minion ）.
And I will try version 2016.x, is it necessary to update both syndic node and master node to 2016.x?
Or it is just need to upgrade master node to 2016.x and syndic node still use 2015.8.12.

# cat /proc/<API PID>/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            10485760             unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             10240                2066283              processes 
Max open files            65535                65535                files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       2066283              2066283              signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us

# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2066283
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 655360
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 10240
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

rickh563 · 2017-12-20T16:03:39Z

ZD-2084

extending on the idea used in saltstack#32145, when _check_pub_data is called it it will create jid subscriptions, regardless of whether anyone will ever come back to retrieve them; in the case of local_async calls noone ever does. In addition to the above, we use the listen kwarg provided by c59a5ad to know whether we need to subscribe to events in addition to ensuring the ioloop is listening before a call is made. This should fix saltstack#40245, saltstack#20639, saltstack#36374

gtmanfred added the Bug broken, incorrect, or confusing behavior label Mar 23, 2017

gtmanfred added this to the Approved milestone Mar 23, 2017

gtmanfred added the Core relates to code central or existential to Salt label Mar 23, 2017

rickh563 added the ZD The issue is related to a Zendesk customer support ticket. label Dec 20, 2017

mattp- mentioned this issue Apr 2, 2018

LocalClient "del" method don't call self.event.destroy() #38876

Closed

mattp- mentioned this issue Apr 2, 2018

address filehandle/event leak in async run_job invocations #46817

Merged

rallytime closed this as completed in #46817 Apr 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

salt-api automatically restart caused by "opening too many files" #40245

salt-api automatically restart caused by "opening too many files" #40245

czhong111 commented Mar 23, 2017 •

edited by gtmanfred

gtmanfred commented Mar 23, 2017 •

edited

czhong111 commented Mar 24, 2017 •

edited

rickh563 commented Dec 20, 2017

salt-api automatically restart caused by "opening too many files" #40245

salt-api automatically restart caused by "opening too many files" #40245

Comments

czhong111 commented Mar 23, 2017 • edited by gtmanfred

gtmanfred commented Mar 23, 2017 • edited

czhong111 commented Mar 24, 2017 • edited

rickh563 commented Dec 20, 2017

czhong111 commented Mar 23, 2017 •

edited by gtmanfred

gtmanfred commented Mar 23, 2017 •

edited

czhong111 commented Mar 24, 2017 •

edited