-
-
Notifications
You must be signed in to change notification settings - Fork 33.6k
Description
Bug report
Bug description:
This forkserver load test program fails reliably when run with 8 jobs on my 8 core laptop:
#!/usr/bin/env python3.14
import asyncio
import multiprocessing
import os
import sys
def target(count):
print("success", count)
async def main(argv):
max_jobs = int(argv[1])
jobs = {}
count = 0
while True:
count += 1
if len(jobs) >= max_jobs:
done, pending = await asyncio.wait(
list(jobs), return_when=asyncio.FIRST_COMPLETED
)
for waiter in done:
proc = jobs.pop(waiter)
if proc.exitcode != os.EX_OK:
print("failure with exitcode", proc.exitcode)
if jobs:
await asyncio.wait(
list(jobs), return_when=asyncio.ALL_COMPLETED
)
return
proc = multiprocessing.Process(target=target, args=(count,))
proc.start()
waiter = asyncio.ensure_future(asyncio.to_thread(proc.join))
jobs[waiter] = proc
if __name__ == "__main__":
asyncio.run(main(sys.argv))Result:
$ ./forkserver_load_test.py 8
success 1
success 2
success 4
success 5
success 6
success 3
success 7
success 8
failure with exitcode 255
success 9
success 10
success 11
success 12
I patched forkserver.py for debugging and found that the test case does not cause SystemExit to be raised here until the main function of the test case returns:
https://github.com/python/cpython/blob/v3.14.0/Lib/multiprocessing/forkserver.py#L272
I patched popen_forkserver.py for debugging and found that read_signed raises EOFError here:
https://github.com/python/cpython/blob/v3.14.0/Lib/multiprocessing/forkserver.py#L394
Meanwhile, it seems that the corresponding write_signed call was successful on the forkserver side.
I did some more debugging and found that the underlying issue is a thread safety issue for the self.returncode setting in Popen.poll(), and the load test runs better with this patch:
--- a/Lib/multiprocessing/popen_forkserver.py
+++ b/Lib/multiprocessing/popen_forkserver.py
@@ -1,5 +1,6 @@
import io
import os
+import threading
from .context import reduction, set_spawning_popen
if not reduction.HAVE_SEND_HANDLE:
@@ -32,6 +33,7 @@
def __init__(self, process_obj):
self._fds = []
+ self._poll_lock = threading.Lock()
super().__init__(process_obj)
def duplicate_for_child(self, fd):
@@ -65,7 +67,9 @@
if not wait([self.sentinel], timeout):
return None
try:
- self.returncode = forkserver.read_signed(self.sentinel)
+ with self._poll_lock:
+ if self.returncode is None:
+ self.returncode = forkserver.read_signed(self.sentinel)
except (OSError, EOFError):
# This should not happen usually, but perhaps the forkserver
# process itself got killedI created the test case because the same issue was triggered by a program called egencache here:
https://bugs.gentoo.org/965132
CPython versions tested on:
3.14
Operating systems tested on:
Linux