Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle defunct processes in child processes? #34

Closed
sailxjx opened this issue Mar 21, 2022 · 9 comments
Closed

How to handle defunct processes in child processes? #34

sailxjx opened this issue Mar 21, 2022 · 9 comments
Assignees
Labels
enhancement New feature or request

Comments

@sailxjx
Copy link

sailxjx commented Mar 21, 2022

This issue is not brought up by mpire, but I thought I'd discuss how to deal with it or improve it here.

When I start multiple child processes with the Pool module and call exit() in one of them, or just use the kill command to kill the child process, it will become a defunct process and cause the parent process to fail to exit.

The code to reproduce this problem is very simple:

from time import sleep
from mpire.pool import WorkerPool

def main_naive(i):
    if i == 0:
        sleep(1)
        print("Exiting", i)
        exit()  # If remove this line, everything works fine.
    else:
        sleep(0.1)
        print("Exiting", i)

def main():
    with WorkerPool(n_jobs=2, start_method="spawn", daemon=False) as pool:
        pool.map(main_naive, list(range(2)))

if __name__ == "__main__":
    main()

So how to make the parent process exit normally in this case?

@sybrenjansen
Copy link
Owner

This is indeed a situation that isn't handled by MPIRE right now. When a child process dies I would expect the other children to die as well and the main process to throw an exception.

Do you have any suggestions for handling this? I guess in Linux you can set up a signal handler (https://stackoverflow.com/questions/3675675/how-do-i-know-when-a-child-process-died). We would need to check how to do this on Windows.

@sailxjx
Copy link
Author

sailxjx commented Mar 30, 2022

The use of signal controllers may be an improvement, but there are still many remain issues, for example when using kill -9 to kill a subprocess, the process will not receive a term signal.

Perhaps we could add a heartbeat mechanism between the parent and child processes, such as a heartbeat signal sent by the child process to the parent process every second to ensure that the child process is alive. If the heartbeat stops, it can be assumed to have exited abnormally, so that the parent process can be terminated and an exception thrown.

@sybrenjansen
Copy link
Owner

Sounds like a good option to me.

I was thinking about how the implementation would look like. We could set an Event (or boolean value) for each worker and check that on regular intervals. The main process would then reset the Event again and if the Event wasn't set it would raise. The only thing to keep in mind is that the timing for checking these values should be correct. You could let the workers update the event each 0.1 seconds in a separate thread and in the main process only check it every second or so. In that case, you're pretty sure that the workers had the chance to set the value to True again.

@sailxjx
Copy link
Author

sailxjx commented Apr 1, 2022

Yes, that's exactly what I was thinking. It's better to maintain this state only in the main process and don't forget to stop those threads in the child process.

@sybrenjansen sybrenjansen self-assigned this Apr 15, 2022
@sybrenjansen sybrenjansen added the enhancement New feature or request label Apr 15, 2022
@towr
Copy link

towr commented Apr 19, 2022

I'm not sure heartbeats would work very well. Because a worker may not be able to send a heartbeat in time if it's stuck in a long-running operation that doesn't release the GIL (either in cpython, or an external C module).

@sybrenjansen
Copy link
Owner

You're right @towr . I experienced this already when I tried to implement it. I already have a different approach which does seem to work, at least 99% of the time. Still working on that 1%.

I'm using process.is_alive() together with an event object I already used in each worker. Basically, when a worker is supposed to be alive, the event is set. During that time process.is_alive() should also be true. If not, then it means the worker was abruptly killed. I handle exit calls from within the worker differently, as that just throws a SystemExit exception which you can catch

@sailxjx
Copy link
Author

sailxjx commented Apr 20, 2022

Another request is to provide another mode of automatic recovery when a child process dead (restarting the dead child process), so that we can implement a supervisor mode similar to erlang, which is a bit more practical than throwing exceptions.

sybrenjansen added a commit that referenced this issue Apr 25, 2022
* MPIRE now handles defunct child processes properly, instead of deadlocking (`#34`_)
* Added benchmark highlights to README (`#38`_)

Co-authored-by: sybrenjansen <sybren.jansen@gmail.com>
@sybrenjansen
Copy link
Owner

@sailxjx Changing this in mpire would require adding quite a few assumptions on how the end user would want to handle it. You can catch the exception that is thrown and restart it if you need it to manually.

If there's more interest in changing this, I will reconsider. For now, I'll leave it to the end user.

@sybrenjansen
Copy link
Owner

New v2.3.5 release is now available that deals with defunct processes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants