200 processes of salt-master (2016.3.2) #35480

jelenak · 2016-08-16T10:44:37Z

Description of Issue/Question

$ salt-master --version
salt-master 2016.3.2 (Boron)
$ ps aux | grep salt-master | wc -l
200

Steps to Reproduce Issue

Install salt-master 2016.3.2
Salt master config:

worker_threads: 96

cachedir: /tmp/salt/master
timeout: 45
minion_data_cache: True
file_roots:
  base:
    - /salt/states

pillar_roots:
  base:
    - /salt/pillars

Versions Report

root@b621993b659e:/# salt-master --versions-report
Salt Version:
           Salt: 2016.3.2

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 1.5
          gitdb: 0.5.4
      gitpython: 0.3.2 RC1
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: 0.9.1
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: 1.2.3
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.7.6 (default, Jun 22 2015, 17:58:13)
   python-gnupg: Not Installed
         PyYAML: 3.10
          PyZMQ: 14.0.1
           RAET: Not Installed
          smmap: 0.8.2
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5

System Versions:
           dist: Ubuntu 14.04 trusty
        machine: x86_64
        release: 3.13.0-32-generic
         system: Linux
        version: Ubuntu 14.04 trusty

The text was updated successfully, but these errors were encountered:

Ch3LL · 2016-08-16T19:17:46Z

@jelenak I am only seeing this behavior when restarting the salt-master process. From what I can tell when running service salt-master stop sometimes it does not kill all the processes when increasing the worker_threads. I was able to reproduce this sometimes when setting the worker_threads to 15-20. Can you try running service salt-master stop and checking to see if their are still salt-master processes left? I think what might be happening is as you restart the saltmaster you keep these processes that were never killed and more and more get added to that list.

Here is my test case:

master config:

root@vagrant-ubuntu-trusty-64:~# grep -E -v '^$|#' /etc/salt/master                                                                                                       
worker_threads: 20
cachedir: /tmp/salt/master

service salt-master start
check processes everything looks good so far:

root@vagrant-ubuntu-trusty-64:~# ps aux  | grep -i salt-mas | wc -l
28

root@vagrant-ubuntu-trusty-64:~# service salt-master stop
salt-master stop/waiting

Check log file in master log:

2016-08-16 18:18:52,237 [salt.log.setup   ][ERROR   ][4372] An un-handled exception was caught by salt's global exception handler:
OSError: [Errno 3] No such process
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/usr/lib/python2.7/multiprocessing/util.py", line 321, in _exit_function
    p._popen.terminate()
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 171, in terminate
    os.kill(self.pid, signal.SIGTERM)
OSError: [Errno 3] No such process

check salt-master processes

root@vagrant-ubuntu-trusty-64:~# ps aux  | grep -i salt-mas 
root      5526  0.1  0.9 326804 36644 ?        Sl   18:19   0:00 /usr/bin/python /usr/bin/salt-master
root      5528  0.1  0.9 326816 36652 ?        Sl   18:19   0:00 /usr/bin/python /usr/bin/salt-master
root      7775  0.0  0.0  10460   952 pts/0    S+   18:27   0:00 grep --color=auto -i salt-mas
root@vagrant-ubuntu-trusty-64:~#

As shown above even though running stop on salt-master process not all of the processes are being killed and there is a log in the error every time I stop the master process.

Also to clarify I can only reproduce this when having both cachedir setting and worker threads up to about 15-20.

Ch3LL · 2016-08-16T19:29:37Z

I can't seem to reproduce this on 2015.8.11 so this seems to be a regression

DmitryKuzmenko · 2016-09-09T14:39:20Z

Additionally to the PR I've just added there are still 2 more issues:

MultiprocessingLoggingQueue is starting as a daemon: https://github.com/saltstack/salt/blob/develop/salt/log/setup.py#L801

In this case python tries to kill this during atexit procedure where it gets Process Not Found error:

Error in atexit._run_exitfuncs: 
Traceback (most recent call last): 
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs 
    func(*targs, **kargs) 
  File "/usr/lib/python2.7/multiprocessing/util.py", line 321, in _exit_function 
    p._popen.terminate() 
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 171, in terminate 
    os.kill(self.pid, signal.SIGTERM) 
OSError: [Errno 3] No such process

I've tried to remove it and the error has gone. And by my tests the logging queue shuts correctly without it. @s0undt3ch what do you think about it?

More funny. ProcessManager (PM) process is managing subprocesses including ReqServer_ProcessManager (RSPM) process that manages subprocesses too. PM receives TERM first and starts the children termination procedure that initiates the same procedure in RSPM. The procedure consists of termination/wait/kill sequence. If a worker subprocess doesn't terminate by TERM signal RSPM waits some time before kill it. At the same time PM waits RSPM and it's (PM's) timeout appeared a bit earlier than in RSPM. So RSPM doesn't kill it's naughty children because is being killed by PM. Thus we have hung worker processes alive.

I'll continue to work on this on Monday.

s0undt3ch · 2016-09-09T16:56:22Z

The fun of signal handling...

👍 we just want it shutdown.

DmitryKuzmenko · 2016-10-11T08:36:28Z

There are a test issues in #36555. I'll take a look.

See saltstack#35480 for details.

DmitryKuzmenko · 2016-10-28T14:29:21Z

I've analyzed the problem producing the following trace on master shutdown:

OSError: [Errno 3] No such process
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/usr/lib/python2.7/multiprocessing/util.py", line 321, in _exit_function
    p._popen.terminate()
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 171, in terminate
    os.kill(self.pid, signal.SIGTERM)

and implemented another solution that is more correct as I see now.

The problem description:
Main master process manages subprocesses with ProcessManager. Also it creates a daemon subprocess for logging queue. So the main master process has 2 kinds of subprocesses.

ProcessManager watches all the subprocesses with os.wait() if it runs in linux in sync mode. So if any child process terminates ProcessManager will get its pid and return code.

If the logging queue is the first process exited during master shutdown the ProcessManager gets its status in the os.wait() call. After this the kernel, since it already passed the process result to the parent process, removes the child info from the parent process and following waitpid calls for this subprocess will return ECHILD error. So when main master process shuts down and the multiprocessing module performs cleanup it has a record about the logging queue child but gets the ECHILD error that multiprocessing treats as error and raises the exception.

Implemented fix

Do not mix high-level multiprocessing wrappers with destructive low level os.wait() calls. Under destructive here I mean calls changing processes states.

I.e. I've just removed the event-handling children control from ProcessManager. Now it works equally on linux and windows by checking the children every 10 seconds.

Another possible solutions

I don't see a significant effect here from the event based mechanism but I may be wrong.

To keep the subprocesses management using os.wait() we could do the following.
Manage the MultiprocessingLoggingQueue with ProcessManager and don't set the daemon flag to it. In this case multiprocessing will not try to terminate() it and will not raise the error.
Possible issues: if a child process dies and get restarted the information about the died process will be kept in multiprocessing._current_process._children all the life time of the parent process. It's wrong but will not produce problems because salt subprocesses should live all the parent's lifetime.

DmitryKuzmenko · 2016-10-28T15:01:01Z

One more note: probably there is a reason to file a bug to python community, at least to see what they will say about it.

Here is a simple test reproducing the issue.

import multiprocessing
import os


def daemon_main():
    print('I\'m a daemon')


if __name__ == '__main__':
    print('Starting the daemon')
    d = multiprocessing.Process(target=daemon_main)
    d.daemon = True
    d.start()

    print('Waiting children')
    ret = os.wait()
    print('OS Wait returned: {0}'.format(ret))

The test runs a daemon child with the multiprocessing package, waits while it'll done and exits.
The output of this is the following:

Starting the daemon
Waiting children
I'm a daemon
OS Wait returned: (27655, 0)
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib64/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/usr/lib64/python2.7/multiprocessing/util.py", line 321, in _exit_function
    p._popen.terminate()
  File "/usr/lib64/python2.7/multiprocessing/forking.py", line 171, in terminate
    os.kill(self.pid, signal.SIGTERM)
OSError: [Errno 3] No such process
Error in sys.exitfunc:
Traceback (most recent call last):
  File "/usr/lib64/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/usr/lib64/python2.7/multiprocessing/util.py", line 321, in _exit_function
    p._popen.terminate()
  File "/usr/lib64/python2.7/multiprocessing/forking.py", line 171, in terminate
    os.kill(self.pid, signal.SIGTERM)
OSError: [Errno 3] No such process

BTW, Python 3 has the issue fixed: at least it doesn't raise the error if the process not found.
https://github.com/python/cpython/blob/master/Lib/multiprocessing/popen_fork.py#L58-L59

Ch3LL added this to the Approved milestone Aug 16, 2016

Ch3LL added the Core relates to code central or existential to Salt label Aug 16, 2016

Ch3LL added the Regression The issue is a bug that breaks functionality known to work in previous releases. label Aug 16, 2016

meggiebot added ZRELEASED - Carbon P1 Priority 1 labels Aug 16, 2016

meggiebot modified the milestones: C 5, Approved Aug 16, 2016

meggiebot added TEAM Core and removed P1 Priority 1 labels Aug 16, 2016

meggiebot assigned cachedout Aug 18, 2016

meggiebot added P1 Priority 1 and removed P4 Priority 4 labels Aug 18, 2016

cachedout removed the ZRELEASED - Carbon label Aug 19, 2016

meggiebot modified the milestones: C 4, C 5 Aug 22, 2016

meggiebot unassigned cachedout Aug 25, 2016

meggiebot removed this from the C 4 milestone Aug 25, 2016

meggiebot added TEAM Core and removed TEAM Core labels Aug 25, 2016

meggiebot added this to the C 4 milestone Aug 26, 2016

meggiebot assigned DmitryKuzmenko Aug 26, 2016

DmitryKuzmenko mentioned this issue Sep 9, 2016

Disable signal handling while handling signal #36184

Merged

Ch3LL mentioned this issue Sep 15, 2016

salt-master EventPublisher crashes every few day #36285

Closed

meggiebot modified the milestones: C 3, C 4, C 2 Sep 19, 2016

Ch3LL mentioned this issue Sep 21, 2016

EventPublisher consumes 100% CPU #35503

Closed

DmitryKuzmenko mentioned this issue Sep 24, 2016

Bugs/35480 master shutdown #36555

Merged

cmclaughlin mentioned this issue Oct 25, 2016

Restarting master causes minion to hang #37238

Closed

DmitryKuzmenko mentioned this issue Oct 26, 2016

Bugs/37191 minion hangs #37254

Merged

DmitryKuzmenko pushed a commit to DSRCorporation/salt that referenced this issue Oct 28, 2016

Don't use os.wait() on subprocesses managed by multiprocessing.

b6937eb

See saltstack#35480 for details.

DmitryKuzmenko mentioned this issue Oct 28, 2016

Don't use os.wait() on subprocesses managed by multiprocessing. #37306

Merged

DmitryKuzmenko closed this as completed Nov 1, 2016

DmitryKuzmenko mentioned this issue Nov 15, 2016

Salt request timed out. The master is not responding. If this error persists after verifying the master is up, worker_threads may need to be increased. #36940

Closed

Ch3LL mentioned this issue Feb 14, 2017

Minion did not return. [No response] - Salt: 2015.8.12 #39156

Closed

Ch3LL mentioned this issue Feb 24, 2017

Minions fail to connect or auth with master #39308

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

200 processes of salt-master (2016.3.2) #35480

200 processes of salt-master (2016.3.2) #35480

jelenak commented Aug 16, 2016

Ch3LL commented Aug 16, 2016

Ch3LL commented Aug 16, 2016

DmitryKuzmenko commented Sep 9, 2016

s0undt3ch commented Sep 9, 2016

DmitryKuzmenko commented Oct 11, 2016

DmitryKuzmenko commented Oct 28, 2016 •

edited

Loading

DmitryKuzmenko commented Oct 28, 2016 •

edited

Loading

200 processes of salt-master (2016.3.2) #35480

200 processes of salt-master (2016.3.2) #35480

Comments

jelenak commented Aug 16, 2016

Description of Issue/Question

Steps to Reproduce Issue

Versions Report

Ch3LL commented Aug 16, 2016

Ch3LL commented Aug 16, 2016

DmitryKuzmenko commented Sep 9, 2016

s0undt3ch commented Sep 9, 2016

DmitryKuzmenko commented Oct 11, 2016

DmitryKuzmenko commented Oct 28, 2016 • edited Loading

DmitryKuzmenko commented Oct 28, 2016 • edited Loading

DmitryKuzmenko commented Oct 28, 2016 •

edited

Loading

DmitryKuzmenko commented Oct 28, 2016 •

edited

Loading