Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests with multiprocessing hang on Python 3.8 on macOS #11835

Closed
andyfaff opened this issue Apr 10, 2020 · 16 comments · Fixed by #11836
Closed

tests with multiprocessing hang on Python 3.8 on macOS #11835

andyfaff opened this issue Apr 10, 2020 · 16 comments · Fixed by #11836
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected
Milestone

Comments

@andyfaff
Copy link
Contributor

andyfaff commented Apr 10, 2020

I'm running the whole test suite on macOS on Python 3.8 (via Github Actions on my local machine (Catalina)). I'm seeing several issues with multiprocessing, such as #11827.

They're related to https://bugs.python.org/issue38501. The tests that hang are:

I think the best thing to do at the moment is to use the skipif decorator (possibly also coupled to Darwin):

    @pytest.mark.skipif(multiprocessing.get_start_method() != 'fork',
                        reason=('multiprocessing with spawn method is not'
                                ' compatible with pytest.'))
@andyfaff andyfaff added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label Apr 10, 2020
@rgommers
Copy link
Member

So much fun, I'll ping again on the CPython PR that seems to have introduced the issue.

@andyfaff
Copy link
Contributor Author

@larsoner @rgommers, there's something fishy going on here.
I copied the failing test into a separate file, test_vec.py:

import numpy as np
from numpy.testing import assert_allclose
from scipy.integrate import quad_vec


def _lorenzian(x):
    return 1 / (1 + x**2)


def test_quad_vec_pool():
    from multiprocessing.dummy import Pool

    f = _lorenzian
    res, err = quad_vec(f, -np.inf, np.inf, norm='max', epsabs=1e-4, workers=4)
    assert_allclose(res, np.pi, rtol=0, atol=1e-4)

    with Pool(10) as pool:
        f = lambda x: 1 / (1 + x**2)
        res, err = quad_vec(f, -np.inf, np.inf, norm='max', epsabs=1e-4, workers=pool.map)
        assert_allclose(res, np.pi, rtol=0, atol=1e-4)

If I do a pytest test_vec.py it passes. This is the same test for which I see a failure if I run the scipy tests via python runtests.py.

I tried to see if it was because a dummy Pool was being used, so I made it into a real Pool. Then I had a complaint about pickling the lambda, but when I made it into a regular function the cutdown test_vec.py still passed. Instead of the problem being in Python, I'm wondering if there's some other problem with our pytest and runtests.py setup.

@andyfaff
Copy link
Contributor Author

I can go into optimize.tests.test__differential_evolution, remove the known fails (which are supposed to be 'due to cleaning up MapWrapper on Python 3.8'), and that individual test file (after moving it somewhere else) runs to completion without issue if I call it via pytest test__differential_evolution.py.

(this is with Python 3.8.2 on macOS).

Also pinging @pv and @peterbell10 , because I think we've added too many skips for tests with multiprocessing (thinking it's Python's/MapWrapper's/"using 'spawn' on macOS" fault), when I think the fault lies somewhere in runtests.py or pytest

@andyfaff
Copy link
Contributor Author

test_mapwrapper_parallel in scipy._lib.test__util also works on macOS + Python3.8, which uses 'spawn' by default.

--> multiprocessing with spawn method is compatible with pytest (I've just done it), just not how we test it.

@larsoner
Copy link
Member

From the python issue it seems like if/when del is called can matter, which in turn can depend on garbage collection timing. So I'm not sure running an individual test necessarily rules in or out that test as the problem unless we check deletion states.

Maybe one option is to atexit.register the closing of the pool to ensure the threads are dealt with properly

@larsoner
Copy link
Member

And by "the pool" I mean the one created by quad_vec when "workers" is supplied. Basically we need to ensure that it's closed, and just relying on garbage collection is not safe enough

@larsoner
Copy link
Member

I can confirm that my MWE still fails on my system (running python whatever.py just hangs instead of exiting):

from multiprocessing import Pool

class A(object):
    def __init__(self):
        self.pool = Pool(processes=2)

solver = A()

But modifying it with atexit fixes things (python whatever.py runs and exits without issue):

import atexit
from multiprocessing import Pool

class A(object):
    def __init__(self):
        self.pool = Pool(processes=2)
        atexit.register(self.pool.close)

solver = A()

So I think doing this sort of thing wherever we create a Pool (without having a context manager to automatically close it for us) is a good idea.

@larsoner
Copy link
Member

Actually it looks like MapWrapper if used in a context manager should already close the pool, and that quad_vec uses it in this way (?). So maybe this isn't the same bug...

@andyfaff
Copy link
Contributor Author

I've just been trying to investigate this. In most of our test cases either MapWrapper or Pool is being used in a situation where the resources are being closed.

@andyfaff
Copy link
Contributor Author

andyfaff commented Apr 11, 2020

There is a test in _lib.tests.test__util called test_mapwrapper_parallel. It's currently got a skipif decorator on it. This is because the test hangs on Python 3.8 (macOS) if 'spawn' is used. Note that the MapWrapper implementation does attempt to clear up all resources.

To check if it's the scipy MapWrapperimplementation being faulty I then created another test function that uses multiprocessing directly:

# place in test__util.py
def test_pool():
    with Pool(2) as p:
        p.map(np.sin, [1, 2, 3, 4])

When I try to run this via runtests.py the test stalls as well (i.e. no fault with MapWrapper). When I run it in an interpreter (from scipy._lib.tests import test__util; test__util.test_pool()), or run it directly via pytest XXXX there is no hang.

when the test is:

def test_pool():
    with Pool(2) as p:
        pass

there is no hang when run via runtests.py.

So:

  • the hang is in the map call of a Pool object, not in the creation or destruction of resources.
  • the hang only occurs if a test uses multiprocessing, the default start method is 'spawn' (Python 3.8 and darwin), and the test is started via runtests.py.
  • if runtests.py isn't used the tests pass.

ergo it's how we use pytest in runtests.py

@andyfaff
Copy link
Contributor Author

andyfaff commented Apr 11, 2020

Well...
It seems that if I add multiprocessing.freeze_support() in runtests.py.__main__, then the multiprocessing hangs don't occur, see #11836

@peterbell10
Copy link
Member

peterbell10 commented Apr 11, 2020

It seems that if I add multiprocessing.freeze_support() in runtests.py.__main__, then the multiprocessing hangs don't occur

That's very weird. Compare to the documentation:

Calling freeze_support() has no effect when invoked on any operating system other than Windows.


it's how we use pytest in runtests.py

Are you using the -n flag with runtests.py? That runs the tests in a subprocesses so could be relevant.

@andyfaff
Copy link
Contributor Author

It's not the first time, and it certainly won't be the last, that documentation doesn't keep up.

The story goes as follows... On a GUI app I'm developing whenever I started multiprocessing the GUI would start generating a cascade of replicas of the main window. On investigating further the suggestion was that freeze_support was what was needed, because the script was doing something like recursively importing the main script at which point another window would appear. freeze_support fixed the issue.

When I was interrupting the hung runtests.py yesterday at some point I was seeing a stack trace where there was an attempt to do a recursive import of the runtests.py script. The recursion struck a bell, so I tried using freeze_support and then everything worked.

The Windows thing in the Python documentation may be referring to programs that spawn rather than fork. macOS+Python<3.8 used to use fork, now on Python3.8<= spawn is default.

I'm not using -n.

@peterbell10
Copy link
Member

The Windows thing in the Python documentation may be referring to programs that spawn rather than fork. macOS+Python<3.8 used to use fork, now on Python3.8<= spawn is default

Looking at the implementation for freeze_support, it literally does nothing if sys.platform != 'win32':
https://github.com/python/cpython/blob/61511488cf4e7a1cb57a38efba7e0a84a387fe58/Lib/multiprocessing/context.py#L148

@peterbell10
Copy link
Member

Pure speculation, but is it possible that adding import multiprocessing to runtests.py was what made the difference? It might change how the python module is initialised on the spawned process.

@andyfaff
Copy link
Contributor Author

@peterbell10, you're correct. Just tested out on local machine.

  • just the presence of import multiprocessing is enough to prevent hanging (freeze_support not required).
  • the tests don't hang if there's no import of multiprocessing and runtests.py uses parallelisation, I used -j 4.

It's a bit crufty that just the presence of the import is enough...

rht added a commit to rht/mesa that referenced this issue Mar 21, 2022
tylerjereddy added a commit to tylerjereddy/scipy that referenced this issue Jan 20, 2023
* the `setup.py` MacOS Python `3.8` CI job is starting
to fail with timeouts for multiprocessing `Pool`-related
code in recent PRs like scipygh-17777 and scipygh-17829

* nothing stands out in the build logs when I do a side-by-side
diff, and the GHA job history seems to suggest the failure doesn't
happen every time the CI is flushed, but perhaps more often lately

* make a few cleanups here in attempt to fix

* first off, we can simplify the Pythran config--there's only one
job left in this file after the splitting with `meson`, and it is
Python `3.8` (`3.8` support may be on the way out too, but I'm assuming we
delay that a tiny bit more perhaps)

* previous discussions like scipy#11835 with
MacOS Python 3.8 multiprocessing Pool hangs were complicated, but
sometimes using `runtests.py` (which is also on the way out I think..)
contributed to hangs b/c of multiprocessing import orders. So, try to
use `pytest` directly instead to see if it helps

* I suspect these changes are "just fine" to apply even
if we're going to switch to 3.9 minimum sooner than later
tylerjereddy added a commit to tylerjereddy/scipy that referenced this issue Jan 21, 2023
* the `setup.py` MacOS Python `3.8` CI job is starting
to fail with timeouts for `multiprocessing` `Pool`-related
code in recent PRs like scipygh-17777 and scipygh-17829

* nothing stands out in the build logs when I do a side-by-side
diff vs. succeeding versions of the job, and the GHA job history
seems to suggest the failure doesn't happen every time the CI is flushed,
but perhaps more often lately

* make a few cleanups here in attempt to fix
  - previous discussions like scipygh-11835 with MacOS Python 3.8
    `multiprocessing` `Pool` hangs were complicated, but
    sometimes using `runtests.py` contributed to hangs b/c
    of `multiprocessing` import orders. So, try to
    use `pytest` directly instead to see if it helps.
  - we can simplify the Pythran config--there's only one
    job left in this file after the splitting with `meson`, and it is
    Python `3.8`
  - I've also extended the timeout limit a little bit, and since
    GHA MacOS runners have 3 cores I've tried using all 3 for the
    testsuite here because I once had the job pass on my fork during
    testing, but it took 66 minutes: #65

* I think there's still something a bit weird here, so it is a
  combination of measures that seemed to help a bit on my fork
  but perhaps long-term the removal of Python 3.8 support along with
  sunsetting `runtests.py` means that this mess may disappear before
  we need to dig much deeper

[skip azp] [skip circle] [skip cirrus]
tylerjereddy added a commit to tylerjereddy/scipy that referenced this issue Feb 12, 2023
* the `setup.py` MacOS Python `3.8` CI job is starting
to fail with timeouts for `multiprocessing` `Pool`-related
code in recent PRs like scipygh-17777 and scipygh-17829

* nothing stands out in the build logs when I do a side-by-side
diff vs. succeeding versions of the job, and the GHA job history
seems to suggest the failure doesn't happen every time the CI is flushed,
but perhaps more often lately

* make a few cleanups here in attempt to fix
  - previous discussions like scipygh-11835 with MacOS Python 3.8
    `multiprocessing` `Pool` hangs were complicated, but
    sometimes using `runtests.py` contributed to hangs b/c
    of `multiprocessing` import orders. So, try to
    use `pytest` directly instead to see if it helps.
  - we can simplify the Pythran config--there's only one
    job left in this file after the splitting with `meson`, and it is
    Python `3.8`
  - I've also extended the timeout limit a little bit, and since
    GHA MacOS runners have 3 cores I've tried using all 3 for the
    testsuite here because I once had the job pass on my fork during
    testing, but it took 66 minutes: #65

* I think there's still something a bit weird here, so it is a
  combination of measures that seemed to help a bit on my fork
  but perhaps long-term the removal of Python 3.8 support along with
  sunsetting `runtests.py` means that this mess may disappear before
  we need to dig much deeper

[skip azp] [skip circle] [skip cirrus]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants