Remote execution leaves defunct/zombie salt-minion process on minion #3395

Closed
madduck opened this Issue Jan 23, 2013 · 6 comments

Comments

Projects
None yet
3 participants
@madduck
Contributor

madduck commented Jan 23, 2013

If I run salt, it will periodically publish saltutil.find_job to keep on top
of things. Each one of those calls results in a zombie process on the minion:

root      5529  0.0  0.0      0     0 ?        Z    04:40   0:00 [salt-minion] <defunct>

This can also be reproduced with e.g. calls to status.procs.

For your convenience: a zombie process is a child process which was not
"reaped" on exit by its parent. The parent should trap SIGCHLD and then
wait() appropriately to clean up the child process properly.

Thus, it seems that salt-minion spawns a worker and does not properly clean up
after the worker exits. A quick look in salt.util.schedule seems to confirm
this suspicion: the subprocess or thread gets started at the end of
Schedule.eval, but there are no calls to waitpid() or join() in the code.

It seems to me that Salt simply does not clean up child processes. That should be fixed.

@thatch45

This comment has been minimized.

Show comment
Hide comment
@thatch45

thatch45 Jan 24, 2013

Member

Salt does clean these up after an iteration, are they still around after 60 seconds (the default iteration time to check to clean these up)?

If we call join then the process blocks and salt stops accepting future commands, which is why the minion calls multiprocessing.active_children at each iteration

Member

thatch45 commented Jan 24, 2013

Salt does clean these up after an iteration, are they still around after 60 seconds (the default iteration time to check to clean these up)?

If we call join then the process blocks and salt stops accepting future commands, which is why the minion calls multiprocessing.active_children at each iteration

@madduck

This comment has been minimized.

Show comment
Hide comment
@madduck

madduck Jan 24, 2013

Contributor

Ah, okay. You don't want to call join(), for the reason you
mention, nor does it make sense to call waitpid(), unless you know
the PID of the child that just exited (which SIGCHLD does not
include, you have to track that down yourself).

But it does make sense to trap SIGCHLD and poll() the
subprocesses, IMHO.

Leaving zombies around, even if you reap them regularly, is not
really what I expect a well-designed Unix daemon to do.

In fact, my monitoring regularly reports problems on these hosts
due to those zombies (which is how I found out about them in the
first place…)

Contributor

madduck commented Jan 24, 2013

Ah, okay. You don't want to call join(), for the reason you
mention, nor does it make sense to call waitpid(), unless you know
the PID of the child that just exited (which SIGCHLD does not
include, you have to track that down yourself).

But it does make sense to trap SIGCHLD and poll() the
subprocesses, IMHO.

Leaving zombies around, even if you reap them regularly, is not
really what I expect a well-designed Unix daemon to do.

In fact, my monitoring regularly reports problems on these hosts
due to those zombies (which is how I found out about them in the
first place…)

@thatch45

This comment has been minimized.

Show comment
Hide comment
@thatch45

thatch45 Jan 24, 2013

Member

So I think the problem is that the poll interval is too long so the zombies are hanging around for too long, I will shorten the interval to pool for the processes and audit this

Member

thatch45 commented Jan 24, 2013

So I think the problem is that the poll interval is too long so the zombies are hanging around for too long, I will shorten the interval to pool for the processes and audit this

@madduck

This comment has been minimized.

Show comment
Hide comment
@madduck

madduck Jan 24, 2013

Contributor

So I think the problem is that the poll interval is too long so
the zombies are hanging around for too long, I will shorten the
interval to pool for the processes and audit this

The parent receives a SIGCHLD and can react to reap right away.
I don't understand why you want to poll instead.

My point is that zombies should not exist, not even for a short
period of time. They are signs of broken code.

Contributor

madduck commented Jan 24, 2013

So I think the problem is that the poll interval is too long so
the zombies are hanging around for too long, I will shorten the
interval to pool for the processes and audit this

The parent receives a SIGCHLD and can react to reap right away.
I don't understand why you want to poll instead.

My point is that zombies should not exist, not even for a short
period of time. They are signs of broken code.

@thatch45 thatch45 closed this in a7532f7 Jan 25, 2013

@dahpgjgamgan

This comment has been minimized.

Show comment
Hide comment
@dahpgjgamgan

dahpgjgamgan Feb 2, 2013

Contributor

Hi - unfortunately this does not work on Windows (stacktrace below). I think that any logic required when multiprocessing is being used should check if multiprocessing is enabled in minion configuration.

I made a quick fix: #3565

[DEBUG   ] Minion PUB socket URI: tcp://127.0.0.1:4510
[DEBUG   ] Minion PULL socket URI: tcp://127.0.0.1:4511
Traceback (most recent call last):
  File "salt-minion", line 5, in <module>
    pkg_resources.run_script('salt==0.12.0-511-gbdc6c26', 'salt-minion')
  File "(...)\site-packages\setuptools-0.6c11-py2.7.egg\pkg_resources.py", line 489, in run_script
  File "(...)\site-packages\setuptools-0.6c11-py2.7.egg\pkg_resources.py", line 1207, in run_script
  File "(...)\site-packages\salt-0.12.0_511_gbdc6c26-py2.7.egg\EGG-INFO\scripts\salt-minion", line 14, in <module>
    salt_minion()
  File "(...)\site-packages\salt-0.12.0_511_gbdc6c26-py2.7.egg\salt\scripts.py", line 29, in salt_minion
    minion.start()
  File "(...)\site-packages\salt-0.12.0_511_gbdc6c26-py2.7.egg\salt\__init__.py", line 189, in start
    self.minion.tune_in()
  File "(...)\site-packages\salt-0.12.0_511_gbdc6c26-py2.7.egg\salt\minion.py", line 746, in tune_in
    signal.signal(signal.SIGCHLD, self.handle_sigchld)
AttributeError: 'module' object has no attribute 'SIGCHLD'
Contributor

dahpgjgamgan commented Feb 2, 2013

Hi - unfortunately this does not work on Windows (stacktrace below). I think that any logic required when multiprocessing is being used should check if multiprocessing is enabled in minion configuration.

I made a quick fix: #3565

[DEBUG   ] Minion PUB socket URI: tcp://127.0.0.1:4510
[DEBUG   ] Minion PULL socket URI: tcp://127.0.0.1:4511
Traceback (most recent call last):
  File "salt-minion", line 5, in <module>
    pkg_resources.run_script('salt==0.12.0-511-gbdc6c26', 'salt-minion')
  File "(...)\site-packages\setuptools-0.6c11-py2.7.egg\pkg_resources.py", line 489, in run_script
  File "(...)\site-packages\setuptools-0.6c11-py2.7.egg\pkg_resources.py", line 1207, in run_script
  File "(...)\site-packages\salt-0.12.0_511_gbdc6c26-py2.7.egg\EGG-INFO\scripts\salt-minion", line 14, in <module>
    salt_minion()
  File "(...)\site-packages\salt-0.12.0_511_gbdc6c26-py2.7.egg\salt\scripts.py", line 29, in salt_minion
    minion.start()
  File "(...)\site-packages\salt-0.12.0_511_gbdc6c26-py2.7.egg\salt\__init__.py", line 189, in start
    self.minion.tune_in()
  File "(...)\site-packages\salt-0.12.0_511_gbdc6c26-py2.7.egg\salt\minion.py", line 746, in tune_in
    signal.signal(signal.SIGCHLD, self.handle_sigchld)
AttributeError: 'module' object has no attribute 'SIGCHLD'
@thatch45

This comment has been minimized.

Show comment
Hide comment
@thatch45

thatch45 Feb 2, 2013

Member

Thanks for the fix @dahpgjgamgan

Member

thatch45 commented Feb 2, 2013

Thanks for the fix @dahpgjgamgan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment