Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use multiprocessing start_method "forkserver" #4306

Merged
merged 5 commits into from
Nov 23, 2022

Conversation

rwgk
Copy link
Collaborator

@rwgk rwgk commented Nov 1, 2022

Description

Please see #4105 (comment) for full background.

Alternative to PR #4305

Quoting from

https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

  • fork

    ... Note that safely forking a multithreaded process is problematic.

    Available on Unix only. The default on Unix.

The idea to try "spawn" or "forkserver"is the result of deadlock debugging. After reviewing tracebacks @jbms wrote:

This actually looks like a general fork thread-safety issue, not related to pybind11. Using fork in a multithreaded program is generally not safe.

This stackoverflow posting provides some details:

Excerpts:

The problem is that fork() only copies the calling thread, and any mutexes held in child threads will be forever locked in the forked child.
...
The problem is that the order that pthread_atfork handlers are called for unrelated libraries is undefined (it depends on the order that the libraries are loaded by the program). So this means that technically a deadlock can happen inside of a prefork handler because of a race condition.


Note that this PR only changes how test_gil_scoped.py is run, therefore a changelog entry is not needed.

Suggested changelog entry:

@rwgk
Copy link
Collaborator Author

rwgk commented Nov 1, 2022

Both failures here are unrelated: test_iostream flake, eigen download problem. Trying again.

rwgk added a commit to rwgk/pybind11 that referenced this pull request Nov 2, 2022
…similar to pybind#4306, the only difference between the two PRs is the `"spawn"` vs `"forkserver"` argument).
@rwgk rwgk marked this pull request as ready for review November 2, 2022 17:51
@rwgk
Copy link
Collaborator Author

rwgk commented Nov 2, 2022

Windows PyPy is still giving us DEADLOCK grief:

https://github.com/pybind/pybind11/actions/runs/3379884767/jobs/5612023977

@mattip Do you have an interest and the tools for debugging these deadlocks? I think I've only ever seen this under Windows, but with with different PyPy versions (3.7, 3.8, 3.9).

test_gil_scoped.py on master (or in the 2.10.1 release) is already set up for easy deadlock debugging, all you'd need to do is change these lines:

-ALL_BASIC_TESTS_PLUS_INTENTIONAL_DEADLOCK = ALL_BASIC_TESTS + (_intentional_deadlock,)
+ALL_BASIC_TESTS_PLUS_INTENTIONAL_DEADLOCK = ALL_BASIC_TESTS # (_intentional_deadlock,)
-SKIP_IF_DEADLOCK = True  # See PR #4216
+SKIP_IF_DEADLOCK = False  # See PR #4216
...
-    timeout = 0.1 if test_fn is _intentional_deadlock else 10
+    timeout = 0.1 if test_fn is _intentional_deadlock else 1000

Then run test_gil_scoped.py and wait until it hits a deadlock, then attach a debugger to inspect the stack traces for the two threads waiting on the GIL.

I don't have a suitable Windows machine and don't know anything about debugging under Windows.

@rwgk
Copy link
Collaborator Author

rwgk commented Nov 2, 2022

@Skylion007 Sorry I still need to put back the skipif for Windows PyPy, until we find someone who can help debugging the deadlocks for that platform.

@rwgk
Copy link
Collaborator Author

rwgk commented Nov 2, 2022

For completeness: There was another PyPy Windows deadlock under #4305, for PyPy 3.9 (the flake here was for 3.8):

https://github.com/pybind/pybind11/actions/runs/3379830331/jobs/5611902870

For Windows there is no difference between that PR and this one.

I'm not updating #4305 anymore.

@rwgk
Copy link
Collaborator Author

rwgk commented Nov 2, 2022

The CI passed now without any issues.

I'm not expecting any deadlocks, except for PyPy Windows, which are now reported as such via pytest.skip().

After this PR is merged, deadlocks for other platforms will show up as CI failures, rather than getting skipped. I hope not seeing such CI failures will confirm over time that we're good. I don't know of another way to get that confirmation quicker. See #4105 (comment).

@rwgk
Copy link
Collaborator Author

rwgk commented Nov 3, 2022

@henryiii could you please help reviewing this tiny PR? I think it will take less than 5 minutes. I'm really keen to restore more rigorous testing after the 2.10.1 release went out.

I wasn't completely comfortable effectively silencing the deadlock detection, it was mainly to support the 2.10.1 release.

2 days later I learned that multiprocessing "fork" is generally not safe in combination with threads, IOW we were using it wrong the whole time. The fix is super easy. Getting this merged soon means that we will easily see again when deadlocks are detected. I have high hopes that there will be none. Being sure about it will be very useful.

@rwgk
Copy link
Collaborator Author

rwgk commented Nov 12, 2022

@henryiii I'd appreciate if you could please comment on this PR.

@rwgk
Copy link
Collaborator Author

rwgk commented Nov 21, 2022

@eacousineau It would be great if you could help out here as well. It's a really tiny PR. As I wrote above, I'm still very keen to restore more rigorous testing after the 2.10.1 release went out.

Copy link
Collaborator

@EricCousineau-TRI EricCousineau-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! First pass, PTAL!

@@ -15,6 +17,10 @@
# Early diagnostic for failed imports
import pybind11_tests

if os.name != "nt":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit Can you add a comment as to why we're doing this here?
git blame is decent for more context, but I think two small sentences can help explain, e.g.:

# We want to shy away from Python's default choice for Linux of "fork", as it can incur
# post-fork resource contention issues (such as thread ids, etc.).

I'm vaguely familiar with this, and with our codebase we hit it with either sledgehammers of:

  • "yes fork, but pin down threading for fast-but-reckless multiprocessing spawns"
  • "no fork, please use spawn, and don't think about it" (I still need to read about forkserver and how it may interact with imports like numpy or cv2 that may configure / spawn threads at import)
    EricCousineau-TRI/repro@2cff386

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment as to why we're doing this here?

Done. I tried to keep it concise.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... While this and the issue describes the symptoms, is there any way you can link to something that provides readers more context about the root cause of why threads + fork == bad / flaky?
(feel free to dismiss if you wanna add to issue discussion itself, or punt)

tests/test_gil_scoped.py Outdated Show resolved Hide resolved
tests/conftest.py Show resolved Hide resolved
@rwgk rwgk force-pushed the conftest_multiprocessing_forkserver branch from d1b1edf to 74bbc04 Compare November 23, 2022 00:19
Copy link
Collaborator

@EricCousineau-TRI EricCousineau-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Remaining comment, but feel free to dismiss / punt / what not.

@@ -15,6 +17,10 @@
# Early diagnostic for failed imports
import pybind11_tests

if os.name != "nt":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... While this and the issue describes the symptoms, is there any way you can link to something that provides readers more context about the root cause of why threads + fork == bad / flaky?
(feel free to dismiss if you wanna add to issue discussion itself, or punt)

@rwgk
Copy link
Collaborator Author

rwgk commented Nov 23, 2022

is there any way you can link to something that provides readers more context about the root cause of why threads + fork == bad / flaky?

TBH I was going only by the Python documentation (link in PR description: "Note that safely forking a multithreaded process is problematic.") and what @jbms told me. But I just found this:

https://stackoverflow.com/questions/6078712/is-it-safe-to-fork-from-within-a-thread

I'll add that to the PR description here.

Thanks for the review!

@rwgk
Copy link
Collaborator Author

rwgk commented Nov 23, 2022

The one CI failure is a well-known PyPy flake: expected refcount mismatch.

@rwgk rwgk merged commit 9c18a74 into pybind:master Nov 23, 2022
@rwgk rwgk deleted the conftest_multiprocessing_forkserver branch November 23, 2022 01:17
@github-actions github-actions bot added the needs changelog Possibly needs a changelog entry label Nov 23, 2022
@rwgk rwgk removed the needs changelog Possibly needs a changelog entry label Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants