Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiprocessing's default posix start method of 'fork' is broken: change to 'spawn' #84559

Open
itamarst mannequin opened this issue Apr 24, 2020 · 32 comments
Open

multiprocessing's default posix start method of 'fork' is broken: change to 'spawn' #84559

itamarst mannequin opened this issue Apr 24, 2020 · 32 comments
Assignees
Labels
3.13 bugs and security fixes topic-multiprocessing type-feature A feature request or enhancement

Comments

@itamarst
Copy link
Mannequin

itamarst mannequin commented Apr 24, 2020

BPO 40379
Nosy @pitrou, @mgorny, @Julian, @wimglenn, @applio, @itamarst

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2020-04-24.18:22:23.389>
labels = ['3.8', 'type-bug', '3.7', '3.9']
title = "multiprocessing's default start method of fork()-without-exec() is broken"
updated_at = <Date 2022-02-11.16:13:53.872>
user = 'https://bugs.python.org/itamarst'

bugs.python.org fields:

activity = <Date 2022-02-11.16:13:53.872>
actor = 'mgorny'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = []
creation = <Date 2020-04-24.18:22:23.389>
creator = 'itamarst'
dependencies = []
files = []
hgrepos = []
issue_num = 40379
keywords = []
message_count = 11.0
messages = ['367210', '367211', '368173', '380478', '392358', '392501', '392503', '392506', '392507', '392508', '413081']
nosy_count = 8.0
nosy_names = ['pitrou', 'mgorny', 'Julian', 'wim.glenn', 'itamarst', 'davin', 'itamarst2', 'aduncan']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue40379'
versions = ['Python 3.5', 'Python 3.6', 'Python 3.7', 'Python 3.8', 'Python 3.9']

Linked PRs

@itamarst
Copy link
Mannequin Author

itamarst mannequin commented Apr 24, 2020

By default, multiprocessing uses fork() without exec() on POSIX. For a variety of reasons this can lead to inconsistent state in subprocesses: module-level globals are copied, which can mess up logging, threads don't survive fork(), etc..

The end results vary, but quite often are silent lockups.

In real world usage, this results in users getting mysterious hangs they do not have the knowledge to debug.

The fix for these people is to use "spawn" by default, which is the default on Windows.

Just a small sample:

  1. Today I talked to a scientist who spent two weeks stuck, until she found my article on the subject (https://codewithoutrules.com/2018/09/04/python-multiprocessing/). Basically multiprocessing locked up, doing nothing forever. Switching to "spawn" fixed it.
  2. Default multiprocessing context is broken and should never be used dask/dask#3759 (comment) is someone who had issues fixed by "spawn".
  3. matmul operator @ can freeze / hang when used with default python multiprocessing using fork context instead of spawn numpy/numpy#15973 is a NumPy issue which apparently impacted scikit-learn.

I suggest changing the default on POSIX to match Windows.

@itamarst itamarst mannequin added 3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes type-bug An unexpected behavior, bug, or error labels Apr 24, 2020
@itamarst
Copy link
Mannequin Author

itamarst mannequin commented Apr 24, 2020

Looks like as of 3.8 this only impacts Linux/non-macOS-POSIX, so I'll amend the above to say this will also make it consistent with macOS.

@itamarst
Copy link
Mannequin Author

itamarst mannequin commented May 5, 2020

Just got an email from someone for whom switching to "spawn" fixed a problem. Earlier this week someone tweeted about this fixing things. This keeps hitting people in the real world.

@itamarst
Copy link
Mannequin Author

itamarst mannequin commented Nov 6, 2020

Another person with the same issue: https://twitter.com/volcan01010/status/1324764531139248128

@aduncan
Copy link
Mannequin

aduncan mannequin commented Apr 29, 2021

I just ran into and fixed (thanks to itamarst's blog post) a problem likely related to this.

Multiprocessing workers performing work and sending a logging message back with success/fail info. I had a few intermittent deadlocks that became a recurring problem when I sped up the process that skipped tasks which had previously completed (I think this shortened the time between forking and attempting to send messages causing the third process to deadlock). After changing that it deadlocked *every time*.

Switching to "spawn" at the top of the main function has fixed it.

@pitrou
Copy link
Member

pitrou commented Apr 30, 2021

The problem with changing the default is that this will break any application that depends on passing non-picklable data to the child process (in addition to the potentially unexpected performance impact).

The docs already contain a significant elaboration on the matter, but feel free to submit a PR that would make the various caveats more explicit:
https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

@itamarst
Copy link
Mannequin Author

itamarst mannequin commented Apr 30, 2021

This change was made on macOS at some point, so why not Linux? "spawn" is already the default on macOS and Windows.

@pitrou
Copy link
Member

pitrou commented Apr 30, 2021

The macOS change was required before "fork" simply ceased to work.
Windows has always used "spawn", because no other method can be implemented on Windows.

@itamarst
Copy link
Mannequin Author

itamarst mannequin commented Apr 30, 2021

Given people's general experience, I would not say that "fork" works on Linux either. More like "99% of the time it works, 1% it randomly breaks in mysterious way".

@pitrou
Copy link
Member

pitrou commented Apr 30, 2021

Agreed, but again, changing will break some applications.

We could switch to forkserver, but we should have a transition period where a FutureWarning will be displayed if people didn't explicitly set a start method.

@mgorny
Copy link
Mannequin

mgorny mannequin commented Feb 11, 2022

After updating PyPy3 to use Python 3.9's stdlib, we hit very bad hangs because of this — literally compiling a single file with "parallel" compileall could hang. In the end, we had to revert the change in how Python 3.9 starts workers because otherwise multiprocessing would be impossible to use:

https://foss.heptapod.net/pypy/pypy/-/commit/c594b6c48a48386e8ac1f3f52d4b82f9c3e34784

This is a very bad default and what's even worse is that it often causes deadlocks that are hard to reproduce or debug. Furthermore, since "fork" is the default, people are unintentionally relying on its support for passing non-pickleable projects and are creating non-portable code. The code often becomes complex and hard to change before they discover the problem.

Before we managed to figure out how to workaround the deadlocks in PyPy3, we were experimenting with switching the default to "spawn". Unfortunately, we've hit multiple projects that didn't work with this method, precisely because of pickling problems. Furthermore, they were surprised to learn that their code wouldn't work on macOS (in the end, many people perceive Python as a language for writing portable software).

Finally, back in 2018 I've made one of my projects do parallel work using multiprocessing. It gave its users great speedup but for some it caused deadlocks that I couldn't reproduce nor debug. In the end, I had to revert it. Now that I've learned about this problem, I'm wondering if this wasn't precisely because of "fork" method.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
davidselassie added a commit to bytewax/bytewax that referenced this issue Apr 18, 2022
Provide a way for the calling code to specify which "multiprocessing
context" to use to spawn subprocesses. See
https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

I'm using this to allow us to mock out multiprocessing with
multithreading in doctests. This will also let you more easily test
differences between "spawn" and "fork" modes.

I'm defaulting to using "spawn" because I think "fork" mode was the
cause of some mysterious hanging in tests. General consensus seems to
be "spawn" is less buggy:
python/cpython#84559 I've felt like tests
are consistently faster with it.

Also uses the `multiprocessing.Manager` as a context manager so it
gets cleaned up correctly. This might have been the cause of other
hanging in local cluster execution.
davidselassie added a commit to bytewax/bytewax that referenced this issue Apr 18, 2022
Provide a way for the calling code to specify which "multiprocessing
context" to use to spawn subprocesses. See
https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

I'm using this to allow us to mock out multiprocessing with
multithreading in doctests. This will also let you more easily test
differences between "spawn" and "fork" modes.

I'm defaulting to using "spawn" because I think "fork" mode was the
cause of some mysterious hanging in tests. General consensus seems to
be "spawn" is less buggy:
python/cpython#84559 I've felt like tests
are consistently faster with it.

Also uses the `multiprocessing.Manager` as a context manager so it
gets cleaned up correctly. This might have been the cause of other
hanging in local cluster execution.
@itamarst
Copy link

Another example: Nelson Elhage reports that "as of recently(?) pytorch silently deadlocks (even without GPUs involved at all) using method=fork so that's been fun to debug".

Examples he provided:

@ravwojdyla
Copy link

After updating a couple of libraries in a project we are working on, the code would hang without much explanation. After much debugging, I think one of the reasons for our issues is the forking default (this issue). Our business logic does not use multiprocessing, but the underlying execution engine does (in our case Luigi). Turns out that gRPC client (which was buried deep into one of our dependencies) can hang in some cases when forked grpc/grpc#18075. This was the case for us, and was very tricky to debug.

@gpshead gpshead changed the title multiprocessing's default start method of fork()-without-exec() is broken multiprocessing's default posix start method of fork()-without-exec() is broken: change the default so spawn Dec 13, 2022
@gpshead gpshead added type-feature A feature request or enhancement and removed 3.9 only security fixes 3.8 only security fixes 3.7 (EOL) end of life type-bug An unexpected behavior, bug, or error labels Dec 13, 2022
@gpshead
Copy link
Member

gpshead commented Dec 13, 2022

general plan:

  • A DeprecationWarning in 3.12 and 3.13 when the default not-explicitly-specified start method of fork is used on platforms where that is the default.
  • 3.14: flip the default for all platforms to spawn.

per https://discuss.python.org/t/switching-default-multiprocessing-context-to-spawn-on-posix-as-well/21868

@gpshead gpshead self-assigned this Dec 13, 2022
@guillaumematheron
Copy link

This issue can cause duplicate uuids to be generated when using uuid1.
It looks like the first call to uuid1 sets up a global state, which is then copied to all processes when spawning a multiprocessing Pool.
Then all processes generate the same sequence of uuids.

See ClickHouse/clickhouse-connect#194 for more context on the issue and an example.

@pitrou
Copy link
Member

pitrou commented Jun 5, 2023

@guillaumematheron Which issue? The fact that fork is being used?

Regardless, uuid is an inefficient way to generate random ids.

@guillaumematheron
Copy link

Yes, as outlined in this comment, switching to spawn instead of fork prevents duplicate uuids from being generated.

Of course uuid1 is not a good way to generate random ids, since it's not random at all if a network address and counter are available.

But it should be safe to assume that it is unique, especially if it explicitly returns a flag saying that the value is "generated by the platform in a multiprocessing-safe way". Maybe it's worth adding a caveat to the uuid1 documentation ?

@pitrou
Copy link
Member

pitrou commented Jun 5, 2023

Ping @warsaw on the UUID multiprocessing-safety issue.

@gpshead
Copy link
Member

gpshead commented Jun 5, 2023

Please file a separate issue for the uuid module. It could be a reasonable decision to refresh global state like that upon fork, but that isn't going to happen buried in this issue. You could probably do it yourself today via os.register_at_fork().

@erlend-aasland erlend-aasland added 3.13 bugs and security fixes and removed 3.12 bugs and security fixes labels Jan 5, 2024
zmedico added a commit to zmedico/portage that referenced this issue Feb 1, 2024
In order to migrate away from unsafe os.fork() usage in threaded
processes (python/cpython#84559), add a
returnproc parameter that is similar to returnpid, which causes
spawn to return Process objects instead of pids. The Process
API is a subset of asyncio.subprocess.Process.

In the future, spawn will return objects of a different type but
with a compatible interface to Process, in order to encapsulate
implementation-dependent objects like multiprocessing.Process which
are designed to manage the process lifecycle and need to persist
until it exits.

Trigger a UserWarning when the returnpid parameter is used, in
order to encourage migration to returnproc (do not use
DeprecationWarning since it is hidden by default). This warning
will be temporarily suppressed for portage internals, until they
finish migrating to returnproc. There are probably very few if
any external consumers of spawn with the returnpid parameter,
so it seems safe to move quickly with this deprecation.

Bug: https://bugs.gentoo.org/916566
Signed-off-by: Zac Medico <zmedico@gentoo.org>
zmedico added a commit to zmedico/portage that referenced this issue Feb 1, 2024
In order to migrate away from unsafe os.fork() usage in threaded
processes (python/cpython#84559), add a
returnproc parameter that is similar to returnpid, which causes
spawn to return Process objects instead of pids. The Process
API is a subset of asyncio.subprocess.Process.

In the future, spawn will return objects of a different type but
with a compatible interface to Process, in order to encapsulate
implementation-dependent objects like multiprocessing.Process which
are designed to manage the process lifecycle and need to persist
until it exits.

Trigger a UserWarning when the returnpid parameter is used, in
order to encourage migration to returnproc (do not use
DeprecationWarning since it is hidden by default). This warning
will be temporarily suppressed for portage internals, until they
finish migrating to returnproc. There are probably very few if
any external consumers of spawn with the returnpid parameter,
so it seems safe to move quickly with this deprecation.

Bug: https://bugs.gentoo.org/916566
Signed-off-by: Zac Medico <zmedico@gentoo.org>
zmedico added a commit to zmedico/portage that referenced this issue Feb 1, 2024
In order to migrate away from unsafe os.fork() usage in threaded
processes (python/cpython#84559), add a
returnproc parameter that is similar to returnpid, which causes
spawn to return a single Process object instead of a list of pids.
The Process API is a subset of asyncio.subprocess.Process. The
returnproc parameter conflicts with the logfile parameter, since
the caller is expected to use the fd_pipes parameter to implement
logging (this was also true for the returnpid parameter).

In the future, spawn will return objects of a different type but
with a compatible interface to Process, in order to encapsulate
implementation-dependent objects like multiprocessing.Process which
are designed to manage the process lifecycle and need to persist
until it exits.

Trigger a UserWarning when the returnpid parameter is used, in
order to encourage migration to returnproc (do not use
DeprecationWarning since it is hidden by default). This warning
will be temporarily suppressed for portage internals, until they
finish migrating to returnproc. There are probably very few if
any external consumers of spawn with the returnpid parameter,
so it seems safe to move quickly with this deprecation.

Bug: https://bugs.gentoo.org/916566
Signed-off-by: Zac Medico <zmedico@gentoo.org>
@mcepl
Copy link
Contributor

mcepl commented Mar 28, 2024

eltoder added a commit to eltoder/green that referenced this issue Apr 19, 2024
Use forkserver when available to avoid issues with forking multi-threaded
processes. See python/cpython#84559 for more
context.

This also removes a warning when running on python 3.12:

```
/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=1991) is multi-threaded, use of fork() may lead to deadlocks in the child.
  self.pid = os.fork()
```
eltoder added a commit to eltoder/green that referenced this issue Apr 19, 2024
Use forkserver when available to avoid issues with forking
multi-threaded processes. See [1] for more context.

This also removes a warning when running on python 3.12 that can be seen
in CI:

> /opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=1991) is multi-threaded, use of fork() may lead to deadlocks in the child.
>   self.pid = os.fork()

[1] python/cpython#84559
eltoder added a commit to eltoder/green that referenced this issue Apr 19, 2024
Use forkserver when available to avoid issues with forking
multi-threaded processes. See [1] for more context.

This also removes a warning when running on python 3.12 that can be seen
in CI:

> /opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=1991) is multi-threaded, use of fork() may lead to deadlocks in the child.
>   self.pid = os.fork()

[1] python/cpython#84559
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.13 bugs and security fixes topic-multiprocessing type-feature A feature request or enhancement
Projects
Status: In Progress
Development

No branches or pull requests