Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

200 processes of salt-master (2016.3.2) #35480

Closed
jelenak opened this issue Aug 16, 2016 · 7 comments
Closed

200 processes of salt-master (2016.3.2) #35480

jelenak opened this issue Aug 16, 2016 · 7 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt P1 Priority 1 Regression The issue is a bug that breaks functionality known to work in previous releases. severity-high 2nd top severity, seen by most users, causes major problems
Milestone

Comments

@jelenak
Copy link

jelenak commented Aug 16, 2016

Description of Issue/Question

$ salt-master --version
salt-master 2016.3.2 (Boron)
$ ps aux | grep salt-master | wc -l
200

Steps to Reproduce Issue

  1. Install salt-master 2016.3.2
  2. Salt master config:
worker_threads: 96

cachedir: /tmp/salt/master
timeout: 45
minion_data_cache: True
file_roots:
  base:
    - /salt/states

pillar_roots:
  base:
    - /salt/pillars

Versions Report

root@b621993b659e:/# salt-master --versions-report
Salt Version:
           Salt: 2016.3.2

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 1.5
          gitdb: 0.5.4
      gitpython: 0.3.2 RC1
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: 0.9.1
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: 1.2.3
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.7.6 (default, Jun 22 2015, 17:58:13)
   python-gnupg: Not Installed
         PyYAML: 3.10
          PyZMQ: 14.0.1
           RAET: Not Installed
          smmap: 0.8.2
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5

System Versions:
           dist: Ubuntu 14.04 trusty
        machine: x86_64
        release: 3.13.0-32-generic
         system: Linux
        version: Ubuntu 14.04 trusty
@Ch3LL
Copy link
Contributor

Ch3LL commented Aug 16, 2016

@jelenak I am only seeing this behavior when restarting the salt-master process. From what I can tell when running service salt-master stop sometimes it does not kill all the processes when increasing the worker_threads. I was able to reproduce this sometimes when setting the worker_threads to 15-20. Can you try running service salt-master stop and checking to see if their are still salt-master processes left? I think what might be happening is as you restart the saltmaster you keep these processes that were never killed and more and more get added to that list.

Here is my test case:

master config:

root@vagrant-ubuntu-trusty-64:~# grep -E -v '^$|#' /etc/salt/master                                                                                                       
worker_threads: 20
cachedir: /tmp/salt/master
  1. service salt-master start
  2. check processes everything looks good so far:
root@vagrant-ubuntu-trusty-64:~# ps aux  | grep -i salt-mas | wc -l
28
root@vagrant-ubuntu-trusty-64:~# service salt-master stop
salt-master stop/waiting
  1. Check log file in master log:
2016-08-16 18:18:52,237 [salt.log.setup   ][ERROR   ][4372] An un-handled exception was caught by salt's global exception handler:
OSError: [Errno 3] No such process
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/usr/lib/python2.7/multiprocessing/util.py", line 321, in _exit_function
    p._popen.terminate()
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 171, in terminate
    os.kill(self.pid, signal.SIGTERM)
OSError: [Errno 3] No such process
  1. check salt-master processes
root@vagrant-ubuntu-trusty-64:~# ps aux  | grep -i salt-mas 
root      5526  0.1  0.9 326804 36644 ?        Sl   18:19   0:00 /usr/bin/python /usr/bin/salt-master
root      5528  0.1  0.9 326816 36652 ?        Sl   18:19   0:00 /usr/bin/python /usr/bin/salt-master
root      7775  0.0  0.0  10460   952 pts/0    S+   18:27   0:00 grep --color=auto -i salt-mas
root@vagrant-ubuntu-trusty-64:~# 

As shown above even though running stop on salt-master process not all of the processes are being killed and there is a log in the error every time I stop the master process.

Also to clarify I can only reproduce this when having both cachedir setting and worker threads up to about 15-20.

@Ch3LL Ch3LL added Bug broken, incorrect, or confusing behavior severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around P4 Priority 4 severity-high 2nd top severity, seen by most users, causes major problems and removed severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around labels Aug 16, 2016
@Ch3LL Ch3LL added this to the Approved milestone Aug 16, 2016
@Ch3LL Ch3LL added the Core relates to code central or existential to Salt label Aug 16, 2016
@Ch3LL
Copy link
Contributor

Ch3LL commented Aug 16, 2016

I can't seem to reproduce this on 2015.8.11 so this seems to be a regression

@Ch3LL Ch3LL added the Regression The issue is a bug that breaks functionality known to work in previous releases. label Aug 16, 2016
@meggiebot meggiebot modified the milestones: C 5, Approved Aug 16, 2016
@meggiebot meggiebot added TEAM Core and removed P1 Priority 1 labels Aug 16, 2016
@meggiebot meggiebot added P1 Priority 1 and removed P4 Priority 4 labels Aug 18, 2016
@meggiebot meggiebot modified the milestones: C 4, C 5 Aug 22, 2016
@meggiebot meggiebot removed this from the C 4 milestone Aug 25, 2016
@meggiebot meggiebot added this to the C 4 milestone Aug 26, 2016
@DmitryKuzmenko
Copy link
Contributor

Additionally to the PR I've just added there are still 2 more issues:

  1. MultiprocessingLoggingQueue is starting as a daemon: https://github.com/saltstack/salt/blob/develop/salt/log/setup.py#L801

    In this case python tries to kill this during atexit procedure where it gets Process Not Found error:
Error in atexit._run_exitfuncs: 
Traceback (most recent call last): 
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs 
    func(*targs, **kargs) 
  File "/usr/lib/python2.7/multiprocessing/util.py", line 321, in _exit_function 
    p._popen.terminate() 
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 171, in terminate 
    os.kill(self.pid, signal.SIGTERM) 
OSError: [Errno 3] No such process

I've tried to remove it and the error has gone. And by my tests the logging queue shuts correctly without it. @s0undt3ch what do you think about it?

  1. More funny. ProcessManager (PM) process is managing subprocesses including ReqServer_ProcessManager (RSPM) process that manages subprocesses too. PM receives TERM first and starts the children termination procedure that initiates the same procedure in RSPM. The procedure consists of termination/wait/kill sequence. If a worker subprocess doesn't terminate by TERM signal RSPM waits some time before kill it. At the same time PM waits RSPM and it's (PM's) timeout appeared a bit earlier than in RSPM. So RSPM doesn't kill it's naughty children because is being killed by PM. Thus we have hung worker processes alive.

I'll continue to work on this on Monday.

@s0undt3ch
Copy link
Collaborator

The fun of signal handling...

  1. 👍 we just want it shutdown.

@DmitryKuzmenko
Copy link
Contributor

There are a test issues in #36555. I'll take a look.

@DmitryKuzmenko
Copy link
Contributor

DmitryKuzmenko commented Oct 28, 2016

I've analyzed the problem producing the following trace on master shutdown:

OSError: [Errno 3] No such process
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/usr/lib/python2.7/multiprocessing/util.py", line 321, in _exit_function
    p._popen.terminate()
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 171, in terminate
    os.kill(self.pid, signal.SIGTERM)

and implemented another solution that is more correct as I see now.

The problem description:
Main master process manages subprocesses with ProcessManager. Also it creates a daemon subprocess for logging queue. So the main master process has 2 kinds of subprocesses.

ProcessManager watches all the subprocesses with os.wait() if it runs in linux in sync mode. So if any child process terminates ProcessManager will get its pid and return code.

If the logging queue is the first process exited during master shutdown the ProcessManager gets its status in the os.wait() call. After this the kernel, since it already passed the process result to the parent process, removes the child info from the parent process and following waitpid calls for this subprocess will return ECHILD error. So when main master process shuts down and the multiprocessing module performs cleanup it has a record about the logging queue child but gets the ECHILD error that multiprocessing treats as error and raises the exception.

Implemented fix

Do not mix high-level multiprocessing wrappers with destructive low level os.wait() calls. Under destructive here I mean calls changing processes states.

I.e. I've just removed the event-handling children control from ProcessManager. Now it works equally on linux and windows by checking the children every 10 seconds.

Another possible solutions

I don't see a significant effect here from the event based mechanism but I may be wrong.

To keep the subprocesses management using os.wait() we could do the following.
Manage the MultiprocessingLoggingQueue with ProcessManager and don't set the daemon flag to it. In this case multiprocessing will not try to terminate() it and will not raise the error.
Possible issues: if a child process dies and get restarted the information about the died process will be kept in multiprocessing._current_process._children all the life time of the parent process. It's wrong but will not produce problems because salt subprocesses should live all the parent's lifetime.

@DmitryKuzmenko
Copy link
Contributor

DmitryKuzmenko commented Oct 28, 2016

One more note: probably there is a reason to file a bug to python community, at least to see what they will say about it.

Here is a simple test reproducing the issue.

import multiprocessing
import os


def daemon_main():
    print('I\'m a daemon')


if __name__ == '__main__':
    print('Starting the daemon')
    d = multiprocessing.Process(target=daemon_main)
    d.daemon = True
    d.start()

    print('Waiting children')
    ret = os.wait()
    print('OS Wait returned: {0}'.format(ret))

The test runs a daemon child with the multiprocessing package, waits while it'll done and exits.
The output of this is the following:

Starting the daemon
Waiting children
I'm a daemon
OS Wait returned: (27655, 0)
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib64/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/usr/lib64/python2.7/multiprocessing/util.py", line 321, in _exit_function
    p._popen.terminate()
  File "/usr/lib64/python2.7/multiprocessing/forking.py", line 171, in terminate
    os.kill(self.pid, signal.SIGTERM)
OSError: [Errno 3] No such process
Error in sys.exitfunc:
Traceback (most recent call last):
  File "/usr/lib64/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/usr/lib64/python2.7/multiprocessing/util.py", line 321, in _exit_function
    p._popen.terminate()
  File "/usr/lib64/python2.7/multiprocessing/forking.py", line 171, in terminate
    os.kill(self.pid, signal.SIGTERM)
OSError: [Errno 3] No such process

BTW, Python 3 has the issue fixed: at least it doesn't raise the error if the process not found.
https://github.com/python/cpython/blob/master/Lib/multiprocessing/popen_fork.py#L58-L59

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt P1 Priority 1 Regression The issue is a bug that breaks functionality known to work in previous releases. severity-high 2nd top severity, seen by most users, causes major problems
Projects
None yet
Development

No branches or pull requests

6 participants