Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_multiprocessing_forkserver.test_processes: ResourceTracker.ensure_running() calls itself and hangs indirectly via the GC #109593

Closed
vstinner opened this issue Sep 20, 2023 · 8 comments
Labels
tests Tests in the Lib/test dir topic-multiprocessing

Comments

@vstinner
Copy link
Member

vstinner commented Sep 20, 2023

PPC64LE Fedora Stable Refleaks 3.x: test_wrapped_exception() of test_multiprocessing_forkserver.test_processes hanged while calling SemLock._cleanup():

    @staticmethod
    def _cleanup(name):
        from .resource_tracker import unregister
        sem_unlink(name)
        unregister(name, "semaphore")    # <=========== HERE

It seems like there is a reentrant call to multiprocessing ResourceTracker.ensure_running():

  • The first ensure_running() call triggers indirectly a SemLock._cleanup() call (via the GC?)
  • SemLock._cleanup() calls indirectly ensure_running() again: reentrant call

Sadly, ensure_running() is not reentrant, on purpose, there is a lock.

Traceback:

Thread 0x00007fff900c3560 (most recent call first):
  File "multiprocessing/resource_tracker.py", line 83 in ensure_running
  File "multiprocessing/resource_tracker.py", line 162 in _send
  File "multiprocessing/resource_tracker.py", line 159 in unregister
  File "multiprocessing/synchronize.py", line 88 in _cleanup
  File "multiprocessing/util.py", line 224 in __call__
  File "multiprocessing/resource_tracker.py", line 142 in _check_alive
  File "multiprocessing/resource_tracker.py", line 86 in ensure_running
  File "multiprocessing/resource_tracker.py", line 162 in _send
  File "multiprocessing/resource_tracker.py", line 155 in register
  File "multiprocessing/synchronize.py", line 80 in __init__
  File "multiprocessing/synchronize.py", line 169 in __init__
  File "multiprocessing/context.py", line 68 in Lock
  File "multiprocessing/queues.py", line 348 in __init__
  File "multiprocessing/context.py", line 113 in SimpleQueue
  File "multiprocessing/pool.py", line 196 in __init__
  File "multiprocessing/context.py", line 119 in Pool
  File "test/_test_multiprocessing.py", line 2759 in test_wrapped_exception

Logs:

3:13:36 load avg: 0.00 running (1): test.test_multiprocessing_forkserver.test_processes (3 hour 13 min)
3:14:06 load avg: 0.00 running (1): test.test_multiprocessing_forkserver.test_processes (3 hour 13 min)
3:14:36 load avg: 0.00 running (1): test.test_multiprocessing_forkserver.test_processes (3 hour 14 min)
3:15:06 load avg: 0.00 running (1): test.test_multiprocessing_forkserver.test_processes (3 hour 14 min)
3:15:14 load avg: 0.00 [463/463/1] test.test_multiprocessing_forkserver.test_processes process crashed (Exit code 1)
beginning 6 repetitions
123456
.Timeout (3:15:00)!
Thread 0x00007fff900c3560 (most recent call first):
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/multiprocessing/resource_tracker.py", line 83 in ensure_running
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/multiprocessing/resource_tracker.py", line 162 in _send
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/multiprocessing/resource_tracker.py", line 159 in unregister
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/multiprocessing/synchronize.py", line 88 in _cleanup
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/multiprocessing/util.py", line 224 in __call__
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/multiprocessing/resource_tracker.py", line 142 in _check_alive
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/multiprocessing/resource_tracker.py", line 86 in ensure_running
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/multiprocessing/resource_tracker.py", line 162 in _send
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/multiprocessing/resource_tracker.py", line 155 in register
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/multiprocessing/synchronize.py", line 80 in __init__
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/multiprocessing/synchronize.py", line 169 in __init__
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/multiprocessing/context.py", line 68 in Lock
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/multiprocessing/queues.py", line 348 in __init__
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/multiprocessing/context.py", line 113 in SimpleQueue
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/multiprocessing/pool.py", line 196 in __init__
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/multiprocessing/context.py", line 119 in Pool
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/test/_test_multiprocessing.py", line 2759 in test_wrapped_exception
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/unittest/case.py", line 589 in _callTestMethod
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/unittest/case.py", line 634 in run
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/unittest/case.py", line 690 in __call__
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/unittest/suite.py", line 122 in run
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/unittest/suite.py", line 84 in __call__
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/unittest/suite.py", line 122 in run
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/unittest/suite.py", line 84 in __call__
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/unittest/suite.py", line 122 in run
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/unittest/suite.py", line 84 in __call__
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/test/support/testresult.py", line 146 in run
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/test/support/__init__.py", line 1137 in _run_suite
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/test/support/__init__.py", line 1264 in run_unittest
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/test/libregrtest/single.py", line 36 in run_unittest
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/test/libregrtest/single.py", line 92 in test_func
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/test/libregrtest/refleak.py", line 98 in runtest_refleak
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/test/libregrtest/single.py", line 44 in regrtest_runner
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/test/libregrtest/single.py", line 95 in _load_run_test
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/test/libregrtest/single.py", line 138 in _runtest_env_changed_exc
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/test/libregrtest/single.py", line 238 in _runtest
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/test/libregrtest/single.py", line 266 in run_single_test
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/test/libregrtest/worker.py", line 82 in worker_process
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/test/libregrtest/worker.py", line 105 in main
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/test/libregrtest/worker.py", line 109 in <module>
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/runpy.py", line 88 in _run_code
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/runpy.py", line 198 in _run_module_as_main
/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 23 leaked semaphore objects to clean up at shutdown: {'/mp-ua2e965e', '/mp-f77qvmf8', '/mp-d90nu2aq', '/mp-cy9wze2d', '/mp-aaymn_uv', '/mp-2084fimf', '/mp-uikfm3xh', '/mp-tnyb9wy8', '/mp-25wh4qnj', '/mp-dqgkqocn', '/mp-t975recw', '/mp-_xlso5jp', '/mp-kpyjiqja', '/mp-8vxz_2x7', '/mp-ff194eer', '/mp-scudx30c', '/mp-5h88loxe', '/mp-vxtproxg', '/mp-_fsq5wjy', '/mp-c7vg72xx', '/mp-sh9waipj', '/mp-pyh_3m14', '/mp-g9xvzc65'}
  warnings.warn(
/home/buildbot/buildarea/3.x.cstratak-fedora-stable-ppc64le.refleak/build/Lib/multiprocessing/resource_tracker.py:238: UserWarning: resource_tracker: '/mp-d90nu2aq': [Errno 2] No such file or directory
  warnings.warn('resource_tracker: %r: %s' % (name, e))

build: https://buildbot.python.org/all/#/builders/280/builds/824

Linked PRs

@vstinner vstinner added tests Tests in the Lib/test dir topic-multiprocessing labels Sep 20, 2023
@vstinner
Copy link
Member Author

cc @pitrou

@vstinner
Copy link
Member Author

Last time I had to fix a reentrant call involving the GC, it was in threading.enumerate(). I fixed it by replacing _allocate_lock() with RLock(): use a reentrant lock.

See commit 243fd01 of issue #88588.

@pitrou
Copy link
Member

pitrou commented Sep 20, 2023

A reentrant lock would probably fix the hang, but it wouldn't make the code correct against reentrancy...

@pitrou
Copy link
Member

pitrou commented Sep 20, 2023

What we need is a lock that is reentrant and tells us that we've entered a reentrant call.
It would probably be easy to enhance the RLock API with that.

vstinner added a commit to vstinner/cpython that referenced this issue Sep 20, 2023
multiprocessing: Reduce the risk of reentrant calls to
ResourceTracker.ensure_running() by running explicitly all finalizers
before acquiring the ResourceTracker lock.
@vstinner
Copy link
Member Author

@pitrou:

A reentrant lock would probably fix the hang, but it wouldn't make the code correct against reentrancy...

Right, I agree.

What we need is a lock that is reentrant and tells us that we've entered a reentrant call. It would probably be easy to enhance the RLock API with that.

Sure, something like that would be doable, but it looks a little bit complicated.

I came up with a simpler approach: run explicitly all finalizers before acquiring the lock: PR #109620.

@vstinner
Copy link
Member Author

Script to trigger the bug:

import multiprocessing
import multiprocessing.synchronize
import multiprocessing.resource_tracker
import faulthandler
import gc

# Try values from 1 to 10 to trigger the bug
GC_THRESHOLD = 3

# spawn uses SemLock._cleanup()
ctx = multiprocessing.get_context('spawn')

faulthandler.dump_traceback_later(10, exit=True)

class RefCycle:
    pass

if 1:
    # Create a lock which will register its SemLock._cleanup() function
    # in util._finalizer_registry. This _cleanup() will be called
    # indirectly by the garbage collector.
    sem = multiprocessing.synchronize.Semaphore(ctx=ctx)

    cycle = RefCycle()
    cycle.cycle = cycle
    cycle.sem = sem
    sem = None
    cycle = None

# Start the process
print("ensure_running")
gc.set_threshold(GC_THRESHOLD)
multiprocessing.resource_tracker.ensure_running()
print("ensure_running--")

gc.collect()

# Stop explicitly the process
print("stop resource tracker")
multiprocessing.resource_tracker._resource_tracker._stop()

print("exit")

vstinner added a commit to vstinner/cpython that referenced this issue Sep 20, 2023
multiprocessing: Reduce the risk of reentrant calls to
ResourceTracker.ensure_running() by running explicitly a garbage
collection, to call pending finalizers, before acquiring the
ResourceTracker lock.
vstinner added a commit to vstinner/cpython that referenced this issue Sep 20, 2023
multiprocessing: Reduce the risk of reentrant calls to
ResourceTracker.ensure_running() by running explicitly a garbage
collection, to call pending finalizers, before acquiring the
ResourceTracker lock.
pitrou added a commit to pitrou/cpython that referenced this issue Sep 20, 2023
@vstinner
Copy link
Member Author

Another failure on x86-64 macOS 3.x: https://buildbot.python.org/all/#/builders/366/builds/5426

pitrou added a commit to pitrou/cpython that referenced this issue Sep 22, 2023
pitrou added a commit that referenced this issue Sep 26, 2023
…109629)

---------

Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Sep 26, 2023
…cker (pythonGH-109629)

---------

(cherry picked from commit 0eb9883)

Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Sep 26, 2023
…cker (pythonGH-109629)

---------

(cherry picked from commit 0eb9883)

Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
pitrou added a commit that referenced this issue Sep 26, 2023
…acker (GH-109629) (#109897)

gh-109593: Fix reentrancy issue in multiprocessing resource_tracker (GH-109629)

---------

(cherry picked from commit 0eb9883)

Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
csm10495 pushed a commit to csm10495/cpython that referenced this issue Sep 28, 2023
…cker (python#109629)

---------

Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
@vstinner vstinner changed the title multiprocessing. ResourceTracker.ensure_running() calls itself and hangs indirectly via the GC test_multiprocessing_forkserver.test_processes: ResourceTracker.ensure_running() calls itself and hangs indirectly via the GC Oct 2, 2023
Yhg1s pushed a commit that referenced this issue Oct 2, 2023
…acker (GH-109629) (#109898)

gh-109593: Fix reentrancy issue in multiprocessing resource_tracker (GH-109629)

---------

(cherry picked from commit 0eb9883)

Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
@pitrou
Copy link
Member

pitrou commented Oct 2, 2023

This has been fixed in #109629 and backported to 3.12 and 3.11.

@pitrou pitrou closed this as completed Oct 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests Tests in the Lib/test dir topic-multiprocessing
Projects
None yet
Development

No branches or pull requests

2 participants