Use `multiprocessing` `start_method` `"forkserver"` #4306

rwgk · 2022-11-01T22:28:54Z

Description

Please see #4105 (comment) for full background.

Alternative to PR #4305

Quoting from

https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

fork

... Note that safely forking a multithreaded process is problematic.

Available on Unix only. The default on Unix.

The idea to try "spawn" or "forkserver"is the result of deadlock debugging. After reviewing tracebacks @jbms wrote:

This actually looks like a general fork thread-safety issue, not related to pybind11. Using fork in a multithreaded program is generally not safe.

This stackoverflow posting provides some details:

https://stackoverflow.com/questions/6078712/is-it-safe-to-fork-from-within-a-thread

Excerpts:

The problem is that fork() only copies the calling thread, and any mutexes held in child threads will be forever locked in the forked child.
...
The problem is that the order that pthread_atfork handlers are called for unrelated libraries is undefined (it depends on the order that the libraries are loaded by the program). So this means that technically a deadlock can happen inside of a prefork handler because of a race condition.

Note that this PR only changes how test_gil_scoped.py is run, therefore a changelog entry is not needed.

Suggested changelog entry:

rwgk · 2022-11-01T23:04:45Z

Both failures here are unrelated: test_iostream flake, eigen download problem. Trying again.

…similar to pybind#4306, the only difference between the two PRs is the `"spawn"` vs `"forkserver"` argument).

rwgk · 2022-11-02T18:17:53Z

Windows PyPy is still giving us DEADLOCK grief:

https://github.com/pybind/pybind11/actions/runs/3379884767/jobs/5612023977

@mattip Do you have an interest and the tools for debugging these deadlocks? I think I've only ever seen this under Windows, but with with different PyPy versions (3.7, 3.8, 3.9).

test_gil_scoped.py on master (or in the 2.10.1 release) is already set up for easy deadlock debugging, all you'd need to do is change these lines:

-ALL_BASIC_TESTS_PLUS_INTENTIONAL_DEADLOCK = ALL_BASIC_TESTS + (_intentional_deadlock,)
+ALL_BASIC_TESTS_PLUS_INTENTIONAL_DEADLOCK = ALL_BASIC_TESTS # (_intentional_deadlock,)
-SKIP_IF_DEADLOCK = True  # See PR #4216
+SKIP_IF_DEADLOCK = False  # See PR #4216
...
-    timeout = 0.1 if test_fn is _intentional_deadlock else 10
+    timeout = 0.1 if test_fn is _intentional_deadlock else 1000

Then run test_gil_scoped.py and wait until it hits a deadlock, then attach a debugger to inspect the stack traces for the two threads waiting on the GIL.

I don't have a suitable Windows machine and don't know anything about debugging under Windows.

rwgk · 2022-11-02T18:20:22Z

@Skylion007 Sorry I still need to put back the skipif for Windows PyPy, until we find someone who can help debugging the deadlocks for that platform.

rwgk · 2022-11-02T18:30:13Z

For completeness: There was another PyPy Windows deadlock under #4305, for PyPy 3.9 (the flake here was for 3.8):

https://github.com/pybind/pybind11/actions/runs/3379830331/jobs/5611902870

For Windows there is no difference between that PR and this one.

I'm not updating #4305 anymore.

rwgk · 2022-11-02T19:35:04Z

The CI passed now without any issues.

I'm not expecting any deadlocks, except for PyPy Windows, which are now reported as such via pytest.skip().

After this PR is merged, deadlocks for other platforms will show up as CI failures, rather than getting skipped. I hope not seeing such CI failures will confirm over time that we're good. I don't know of another way to get that confirmation quicker. See #4105 (comment).

rwgk · 2022-11-03T13:56:36Z

@henryiii could you please help reviewing this tiny PR? I think it will take less than 5 minutes. I'm really keen to restore more rigorous testing after the 2.10.1 release went out.

I wasn't completely comfortable effectively silencing the deadlock detection, it was mainly to support the 2.10.1 release.

2 days later I learned that multiprocessing "fork" is generally not safe in combination with threads, IOW we were using it wrong the whole time. The fix is super easy. Getting this merged soon means that we will easily see again when deadlocks are detected. I have high hopes that there will be none. Being sure about it will be very useful.

rwgk · 2022-11-12T00:39:56Z

@henryiii I'd appreciate if you could please comment on this PR.

rwgk · 2022-11-21T10:55:46Z

@eacousineau It would be great if you could help out here as well. It's a really tiny PR. As I wrote above, I'm still very keen to restore more rigorous testing after the 2.10.1 release went out.

EricCousineau-TRI

Sounds good! First pass, PTAL!

EricCousineau-TRI · 2022-11-22T01:06:52Z

tests/conftest.py

@@ -15,6 +17,10 @@
 # Early diagnostic for failed imports
 import pybind11_tests

+if os.name != "nt":


nit Can you add a comment as to why we're doing this here?
git blame is decent for more context, but I think two small sentences can help explain, e.g.:

# We want to shy away from Python's default choice for Linux of "fork", as it can incur # post-fork resource contention issues (such as thread ids, etc.).

I'm vaguely familiar with this, and with our codebase we hit it with either sledgehammers of:

"yes fork, but pin down threading for fast-but-reckless multiprocessing spawns"

"no fork, please use spawn, and don't think about it" (I still need to read about forkserver and how it may interact with imports like numpy or cv2 that may configure / spawn threads at import)
EricCousineau-TRI/repro@2cff386

Can you add a comment as to why we're doing this here?

Done. I tried to keep it concise.

Hm... While this and the issue describes the symptoms, is there any way you can link to something that provides readers more context about the root cause of why threads + fork == bad / flaky?
(feel free to dismiss if you wanna add to issue discussion itself, or punt)

tests/test_gil_scoped.py

tests/conftest.py

Alternative to PR pybind#4305

… will resolve the deadlocks for good.

…-TRI

…-TRI

EricCousineau-TRI

Thanks! Remaining comment, but feel free to dismiss / punt / what not.

EricCousineau-TRI · 2022-11-23T01:01:32Z

tests/conftest.py

@@ -15,6 +17,10 @@
 # Early diagnostic for failed imports
 import pybind11_tests

+if os.name != "nt":


Hm... While this and the issue describes the symptoms, is there any way you can link to something that provides readers more context about the root cause of why threads + fork == bad / flaky?
(feel free to dismiss if you wanna add to issue discussion itself, or punt)

rwgk · 2022-11-23T01:10:53Z

is there any way you can link to something that provides readers more context about the root cause of why threads + fork == bad / flaky?

TBH I was going only by the Python documentation (link in PR description: "Note that safely forking a multithreaded process is problematic.") and what @jbms told me. But I just found this:

https://stackoverflow.com/questions/6078712/is-it-safe-to-fork-from-within-a-thread

I'll add that to the PR description here.

Thanks for the review!

rwgk · 2022-11-23T01:16:31Z

The one CI failure is a well-known PyPy flake: expected refcount mismatch.

rwgk mentioned this pull request Nov 1, 2022

[BUG]: ABI incompatibility between 2.9.2 and 2.10 #4105

Closed

3 tasks

rwgk added a commit to rwgk/pybind11 that referenced this pull request Nov 2, 2022

Use multiprocessing.set_start_method("spawn") in conftest.py (very …

14e6371

…similar to pybind#4306, the only difference between the two PRs is the `"spawn"` vs `"forkserver"` argument).

rwgk marked this pull request as ready for review November 2, 2022 17:51

rwgk requested review from henryiii and Skylion007 November 2, 2022 17:53

rwgk requested a review from EricCousineau-TRI November 21, 2022 10:53

EricCousineau-TRI requested changes Nov 22, 2022

View reviewed changes

rwgk added 5 commits November 22, 2022 15:54

Use multiprocessing start_method "forkserver"

0c38251

Alternative to PR pybind#4305

Add link to comment under PR pybind#4105

0b089f3

Unconditionally pytest.skip("DEADLOCK") for PyPy Windows

2bbea88

Remove SKIP_IF_DEADLOCK entirely, for simplicity. Hopefully this PR…

29cea64

… will resolve the deadlocks for good.

Add "In a nutshell" comment, in response to request by @EricCousineau…

74bbc04

…-TRI

rwgk force-pushed the conftest_multiprocessing_forkserver branch from d1b1edf to 74bbc04 Compare November 23, 2022 00:19

EricCousineau-TRI approved these changes Nov 23, 2022

View reviewed changes

rwgk merged commit 9c18a74 into pybind:master Nov 23, 2022

rwgk deleted the conftest_multiprocessing_forkserver branch November 23, 2022 01:17

github-actions bot added the needs changelog Possibly needs a changelog entry label Nov 23, 2022

rwgk removed the needs changelog Possibly needs a changelog entry label Nov 23, 2022

rwgk mentioned this pull request Nov 23, 2022

Use multiprocessing start_method "spawn" #4305

Closed

rwgk mentioned this pull request Dec 1, 2022

Tracking test_gil_scoped.py DEADLOCKs #4373

Open

Skylion007 mentioned this pull request Dec 1, 2022

chore: Convert direct multiprocessing.set_start_method("forkserver") call to a pytest fixture. #4377

Merged

rwgk mentioned this pull request Feb 11, 2023

FWD pybind11 google/pybind11k#4306

Closed

rwgk mentioned this pull request Mar 16, 2023

Change always_forkserver_on_unix() to use_multiprocessing_forkserver_on_linux() #4577

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `multiprocessing` `start_method` `"forkserver"` #4306

Use `multiprocessing` `start_method` `"forkserver"` #4306

rwgk commented Nov 1, 2022 •

edited

rwgk commented Nov 1, 2022

rwgk commented Nov 2, 2022

rwgk commented Nov 2, 2022

rwgk commented Nov 2, 2022

rwgk commented Nov 2, 2022

rwgk commented Nov 3, 2022

rwgk commented Nov 12, 2022

rwgk commented Nov 21, 2022

EricCousineau-TRI left a comment

EricCousineau-TRI Nov 22, 2022

rwgk Nov 23, 2022

EricCousineau-TRI Nov 23, 2022

EricCousineau-TRI left a comment

EricCousineau-TRI Nov 23, 2022

rwgk commented Nov 23, 2022

rwgk commented Nov 23, 2022

Use multiprocessing start_method "forkserver" #4306

Use multiprocessing start_method "forkserver" #4306

Conversation

rwgk commented Nov 1, 2022 • edited

Description

Suggested changelog entry:

rwgk commented Nov 1, 2022

rwgk commented Nov 2, 2022

rwgk commented Nov 2, 2022

rwgk commented Nov 2, 2022

rwgk commented Nov 2, 2022

rwgk commented Nov 3, 2022

rwgk commented Nov 12, 2022

rwgk commented Nov 21, 2022

EricCousineau-TRI left a comment

Choose a reason for hiding this comment

EricCousineau-TRI Nov 22, 2022

Choose a reason for hiding this comment

rwgk Nov 23, 2022

Choose a reason for hiding this comment

EricCousineau-TRI Nov 23, 2022

Choose a reason for hiding this comment

EricCousineau-TRI left a comment

Choose a reason for hiding this comment

EricCousineau-TRI Nov 23, 2022

Choose a reason for hiding this comment

rwgk commented Nov 23, 2022

rwgk commented Nov 23, 2022

Use `multiprocessing` `start_method` `"forkserver"` #4306

Use `multiprocessing` `start_method` `"forkserver"` #4306

rwgk commented Nov 1, 2022 •

edited