-
-
Notifications
You must be signed in to change notification settings - Fork 33k
Description
Feature or enhancement
Proposal:
Background
When using a concurrent.futures.ProcessPoolExecutor
, if a subprocess crashes/terminates, the library gives you an exception like:
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
This doesn't tell you two very important bits of information:
- Which subprocess terminated
- What exit code it returned
When running a nontrivial application, odds are that you'll run into a BrokenProcessPool, whether because a subprocess segfaulted, OOMed, or something else, and having some basic information about what exactly died and why would be incredibly helpful for correlating crashes with logs/telemetry/etc and aiding further investigation.
As a possible but suboptimal workaround, you can inspect the private _processes
list upon failure to try and figure out what went wrong, e.g.
from concurrent.futures import as_completed, ProcessPoolExecutor
from concurrent.futures.process import BrokenProcessPool
import ctypes
import os
import traceback
def worker_ok() -> str:
print(f"worker_ok {os.getpid()=}", flush=True)
return "ok"
def worker_segfault():
print(f"worker_segfault {os.getpid()=}", flush=True)
ctypes.string_at(0)
if __name__ == "__main__":
with ProcessPoolExecutor(max_workers=2) as executor:
futures = [
executor.submit(worker_ok),
executor.submit(worker_segfault),
]
for future in as_completed(futures):
try:
future.result()
except BrokenProcessPool as e:
# Need to look at private _processes to figure out what died and why
for pid, p in executor._processes.items():
if p.exitcode:
print(f"Process {pid} failed with exitcode {p.exitcode}")
traceback.print_exception(e)
worker_ok os.getpid()=37334
worker_segfault os.getpid()=37335
Process 37334 failed with exitcode -15
Process 37335 failed with exitcode -11
Traceback (most recent call last):
File "/users/me/projects/repo/example.py", line 26, in <module>
future.result()
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
Accessing _processes
works, but it means reading a nonpublic part of the api, and writing slightly nontrivial exception handling code. It would be much better if the ProcessPoolExecutor automatically reported this kind of crash info when raising a BrokenProcessPool exception.
Proposal
There is already a concept in the codebase of a bpe.__cause__
, which contains the cause of the BrokenProcessPool if one is known. When _ExecutorManagerThread._terminate_broken
is called, it should loop through the processes and report any nonzero exit codes.
I've implemented the above approach, and running the example script above with my change, I now see:
worker_ok os.getpid()=37202
worker_segfault os.getpid()=37202
Lib.concurrent.futures.process._RemoteTraceback: Process 37202 terminated abruptly with exit code -11
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/jberg/Projects/cpython/example.py", line 26, in <module>
future.result()
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
Lib.concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
Note the Lib.concurrent.futures.process._RemoteTraceback
above - imo this is a much more informative traceback.
If this looks good to everyone, I'll raise a PR shortly.
Has this already been discussed elsewhere?
This is a minor feature, which does not need previous discussion elsewhere
Links to previous discussion of this feature:
No response