Skip to content

forkserver fails silently under load causing Popen.poll() to return 255 #140867

@zmedico

Description

@zmedico

Bug report

Bug description:

This forkserver load test program fails reliably when run with 8 jobs on my 8 core laptop:

#!/usr/bin/env python3.14

import asyncio
import multiprocessing
import os
import sys


def target(count):
    print("success", count)


async def main(argv):

    max_jobs = int(argv[1])

    jobs = {}
    count = 0
    while True:
        count += 1

        if len(jobs) >= max_jobs:
            done, pending = await asyncio.wait(
                list(jobs), return_when=asyncio.FIRST_COMPLETED
            )
            for waiter in done:
                proc = jobs.pop(waiter)
                if proc.exitcode != os.EX_OK:
                    print("failure with exitcode", proc.exitcode)
                    if jobs:
                        await asyncio.wait(
                            list(jobs), return_when=asyncio.ALL_COMPLETED
                        )
                    return

        proc = multiprocessing.Process(target=target, args=(count,))
        proc.start()
        waiter = asyncio.ensure_future(asyncio.to_thread(proc.join))
        jobs[waiter] = proc


if __name__ == "__main__":
    asyncio.run(main(sys.argv))

Result:

 $ ./forkserver_load_test.py 8
success 1
success 2
success 4
success 5
success 6
success 3
success 7
success 8
failure with exitcode 255
success 9
success 10
success 11
success 12

I patched forkserver.py for debugging and found that the test case does not cause SystemExit to be raised here until the main function of the test case returns:

https://github.com/python/cpython/blob/v3.14.0/Lib/multiprocessing/forkserver.py#L272

I patched popen_forkserver.py for debugging and found that read_signed raises EOFError here:

https://github.com/python/cpython/blob/v3.14.0/Lib/multiprocessing/forkserver.py#L394

Meanwhile, it seems that the corresponding write_signed call was successful on the forkserver side.

I did some more debugging and found that the underlying issue is a thread safety issue for the self.returncode setting in Popen.poll(), and the load test runs better with this patch:

--- a/Lib/multiprocessing/popen_forkserver.py
+++ b/Lib/multiprocessing/popen_forkserver.py 
@@ -1,5 +1,6 @@
 import io
 import os
+import threading
 
 from .context import reduction, set_spawning_popen
 if not reduction.HAVE_SEND_HANDLE:
@@ -32,6 +33,7 @@
 
     def __init__(self, process_obj):
         self._fds = []
+        self._poll_lock = threading.Lock()
         super().__init__(process_obj)
 
     def duplicate_for_child(self, fd):
@@ -65,7 +67,9 @@
             if not wait([self.sentinel], timeout):
                 return None
             try:
-                self.returncode = forkserver.read_signed(self.sentinel)
+                with self._poll_lock:
+                    if self.returncode is None:
+                        self.returncode = forkserver.read_signed(self.sentinel)
             except (OSError, EOFError):
                 # This should not happen usually, but perhaps the forkserver
                 # process itself got killed

I created the test case because the same issue was triggered by a program called egencache here:

https://bugs.gentoo.org/965132

CPython versions tested on:

3.14

Operating systems tested on:

Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    3.14bugs and security fixes3.15new features, bugs and security fixesstdlibStandard Library Python modules in the Lib/ directorytopic-multiprocessingtype-bugAn unexpected behavior, bug, or error

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions