-
-
Notifications
You must be signed in to change notification settings - Fork 31.7k
Popen.terminate fails with ProcessLookupError under certain conditions #84730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The following program frequently raises a ProcessLookupError exception when calling proc.terminate():
I'm reproducing this with Python 3.8.2 on Arch Linux by saving the script and rapidly executing it like this: $ bash -e -c "while true; do python3 test.py; done The (unused) multiprocessing.Queue seems to play a role here because the problem vanishes when removing that one line. |
This is the backtrace I get: Traceback (most recent call last):
File "/home/anubis/test/multiprocessing-error.py", line 16, in <module>
proc.terminate()
File "/home/anubis/git/cpython/Lib/subprocess.py", line 2069, in terminate
self.send_signal(signal.SIGTERM)
File "/home/anubis/git/cpython/Lib/subprocess.py", line 2064, in send_signal
os.kill(self.pid, sig)
ProcessLookupError: [Errno 3] No such process Is yours the same? This is expected, the process exited before proc.terminate(). You should wrap proc.terminate() in a try..except block: try: I am not sure we want to suppress this. |
I'm not sure that it is expected since Popen.send_signal does contain the following check:
Additionally, the following program does not raise a ProcessLookupError despite the program already having exited:
|
This is a simple time-of-check - time-of-action issue, which is why I suggested that it shouldn't be fixed. I was not aware send_signal did have this check, which tells us it is supposed to be suported(?). Anyway, to fix it we need to catch the exception and check for errno == 3 so that we can ignore it. Optimally we would want to have an atomic operation here, but no such thing exists. There is still the very faint possibility that after your process exits a new process will take its id and we kill it instead. We should keep the returncode check and just ignore the exception when errno == 3. This is the best option. |
I submitted a patch. As explained above, this only mitigates the time-of-check/time-of-action issue in case the process doesn't exist, it *is not* a proper fix for the issue. But don't see any way to properly fix it. |
I understand that it's not a perfect solution, but at least it's a little bit closer. Thanks for your patch :) |
Thanks for the patch! PRs are in or on their way in for 3.10 and 3.9. The 3.8 auto-backport failed, if anyone wants it on a future 3.8.x please follow up with a manual cherry pick to make a PR for the 3.8 branch. |
I'm late to the party, but I want to explain what's going on here in case it's helpful to folks. The issue you're seeing here has to do with whether a child processs has been "reaped". (Windows is different from Unix here, because the parent keeps an open handle to the child, so this is mostly a Unix thing.) In short, when a child exits, it leaves a "zombie" process whose only job is to hold some metadata and keep the child's PID reserved. When the parent calls wait/waitpid/waitid or similar, that zombie process is cleaned up. That means that waiting has important correctness properties apart from just blocking the parent -- signaling after wait returns is unsafe, and forgetting to wait also leaks kernel resources. Here's a short example demonstrating this:
With that in mind, the original behavior with communicate() that started this bug is expected. The docs say that communicate() "waits for process to terminate and sets the returncode attribute." That means internally it calls waitpid, so your terminate() thread is racing against process exit. Catching the exception thrown by terminate() will hide the problem, but the underlying race condition means your program might end up killing an unrelated process that just happens to reuse the same PID at the wrong time. Doing this properly requires using waitid(WNOWAIT), which is...tricky. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: