Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_concurrent_futures: test_gh105829_should_not_deadlock_if_wakeup_pipe_full() hangs on ARM64 macOS 3.x #109917

Closed
vstinner opened this issue Sep 26, 2023 · 9 comments
Labels
tests Tests in the Lib/test dir

Comments

@vstinner
Copy link
Member

vstinner commented Sep 26, 2023

ARM64 macOS 3.x:

0:19:31 load avg: 2.16 [460/460/2] test.test_concurrent_futures.test_deadlock process crashed (Exit code 1)
Timeout (0:15:00)!
Thread 0x000000016ff7b000 (most recent call first):
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/threading.py", line 348 in wait
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/multiprocessing/queues.py", line 231 in _feed
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/threading.py", line 1003 in run
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/threading.py", line 1066 in _bootstrap_inner
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/threading.py", line 1023 in _bootstrap

Thread 0x000000016ef6f000 (most recent call first):
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/selectors.py", line 398 in select
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/multiprocessing/connection.py", line 1135 in wait
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/concurrent/futures/process.py", line 419 in wait_result_broken_or_wakeup
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/concurrent/futures/process.py", line 346 in run
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/test/test_concurrent_futures/test_deadlock.py", line 276 in mock_run
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/threading.py", line 1066 in _bootstrap_inner
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/threading.py", line 1023 in _bootstrap

Thread 0x00000001e2671300 (most recent call first):
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/threading.py", line 348 in wait
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/queue.py", line 140 in put
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/test/test_concurrent_futures/test_deadlock.py", line 285 in wakeup
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/concurrent/futures/process.py", line 857 in shutdown
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/concurrent/futures/_base.py", line 647 in __exit__
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/test/test_concurrent_futures/test_deadlock.py", line 299 in test_gh105829_should_not_deadlock_if_wakeup_pipe_full
  (...)

The test passed when re-run in verbose mode:

0:19:45 load avg: 2.10 [2/2/1] test.test_concurrent_futures.test_deadlock passed

build: https://buildbot.python.org/all/#/builders/725/builds/5749

Linked PRs

@vstinner vstinner added the tests Tests in the Lib/test dir label Sep 26, 2023
@vstinner vstinner changed the title test_concurrent_futures.test_deadlock failed with 15 min timeout (hang?) on ARM64 macOS 3.x test_concurrent_futures: test_gh105829_should_not_deadlock_if_wakeup_pipe_full() hangs on ARM64 macOS 3.x Sep 26, 2023
@elfstrom
Copy link
Contributor

I have only looked very briefly but I suspect this is a symptom of a race due to the wakeup mock that is used to force blocking behaviour. I will investigate further.

@elfstrom
Copy link
Contributor

If my guess for what goes wrong is correct then I think this would fix it:

--- a/Lib/test/test_concurrent_futures/test_deadlock.py
+++ b/Lib/test/test_concurrent_futures/test_deadlock.py
@@ -280,11 +280,12 @@ def wakeup(self):
                 super().wakeup()
 
             def clear(self):
+                super().clear()
                 try:
                     while True:
                         self._dummy_queue.get_nowait()
                 except queue.Empty:
-                    super().clear()
+                    pass
 
         with (unittest.mock.patch.object(futures.process._ExecutorManagerThread,
                                          'run', mock_run),

I cant reproduce the failure locally though so I might be completely wrong about what's going on.

@vstinner
Copy link
Member Author

The only purpose of _ThreadWakeup is to notify _ExecutorManagerThread when something bad happen:

    def wait_result_broken_or_wakeup(self):
        ...
        wakeup_reader = self.thread_wakeup._reader
        ...
        ready = mp.connection.wait(readers + worker_sentinels)

        ...
        elif wakeup_reader in ready:
            is_broken = False

The number of wakeup() calls doesn't matter: 1 or 1000 is the same.

I don't get test_gh105829_should_not_deadlock_if_wakeup_pipe_full(). It blocks if wakeup() is called more than once? Why?

_ThreadWakeup.wakeup() has a strange implementation: it works 0 bytes into the writer side of the pipe:

class _ThreadWakeup:
    def __init__(self):
        self._closed = False
        self._reader, self._writer = mp.Pipe(duplex=False)

    def wakeup(self):
        if not self._closed:
            self._writer.send_bytes(b"")

Why not sending a byte and ignoring when the pipe is full?

asyncio has a similar "self-pipe" pattern, but it just ignores when the pipe is full:

class BaseSelectorEventLoop(base_events.BaseEventLoop):
    def _write_to_self(self):
        try:
            self._csock.send(b'\0')
        except OSError:
            if self._debug:
                logger.debug("Fail to write a null byte into the "
                             "self-pipe socket",
                             exc_info=True)

Also, there is a simple way to not fill the pipe: write a single byte and then recall in an attribute that a byte was written.

@vstinner
Copy link
Member Author

See also issue gh-105829: concurrent.futures.ProcessPoolExecutor pool deadlocks when submitting many tasks.

vstinner added a commit to vstinner/cpython that referenced this issue Sep 29, 2023
…p.wakeup()

Remove test_gh105829_should_not_deadlock_if_wakeup_pipe_full() of
test_concurrent_futures.test_deadlock. The test is no longer
relevant.
@vstinner
Copy link
Member Author

I proposed PR gh-110129 to fix the issue.

@elfstrom
Copy link
Contributor

The only purpose of _ThreadWakeup is to notify _ExecutorManagerThread when something bad happen:

    def wait_result_broken_or_wakeup(self):
        ...
        wakeup_reader = self.thread_wakeup._reader
        ...
        ready = mp.connection.wait(readers + worker_sentinels)

        ...
        elif wakeup_reader in ready:
            is_broken = False

The number of wakeup() calls doesn't matter: 1 or 1000 is the same.

This is not true. When the main thread pushes a new work item it needs to make sure the _ExecutorManagerThread wakes up and moves the work item into the worker process call queue. The push of the work item is not sufficient for that, the _ExecutorManagerThread is only waiting on results and worker process sentinels and this wakeup pipe that can be used to "manually" trigger wakeup. In the current implementation the main thread pushes one object into the wakeup pipe for every new work item and the wakeup pipe is then fully drained when the _ExecutorManagerThread wakes up. This makes sure new work items are moved into the call queue and made available to the workers.

I don't get test_gh105829_should_not_deadlock_if_wakeup_pipe_full(). It blocks if wakeup() is called more than once? Why?

_ThreadWakeup.wakeup() has a strange implementation: it works 0 bytes into the writer side of the pipe:

class _ThreadWakeup:
    def __init__(self):
        self._closed = False
        self._reader, self._writer = mp.Pipe(duplex=False)

    def wakeup(self):
        if not self._closed:
            self._writer.send_bytes(b"")

Why not sending a byte and ignoring when the pipe is full?

We started with that type of fix in #108513, please read through the commits and discussion for the full background on why we ended up with the current state.

asyncio has a similar "self-pipe" pattern, but it just ignores when the pipe is full:

class BaseSelectorEventLoop(base_events.BaseEventLoop):
    def _write_to_self(self):
        try:
            self._csock.send(b'\0')
        except OSError:
            if self._debug:
                logger.debug("Fail to write a null byte into the "
                             "self-pipe socket",
                             exc_info=True)

Also, there is a simple way to not fill the pipe: write a single byte and then recall in an attribute that a byte was written.

The tradeoff done was to prefer removing the lock vs avoiding the write of the pipe. @tomMoral thinks it is possible to do both and that is likely true but we went with the safer option for the time being to get the deadlock fix out the door. There are some interesting corner cases surrounding what happens when the call queue is full and the management thread cannot not actually move any new work at that time. It is likely ok (assuming it is ok without the change) but at least I didn't feel like risking breaking anything.

One thing is very clear however, there is a lot of contention while new work is submitted from the main thread and the workers can easily get starved during this period because the _ExecutorManagerThread cannot move work into their queue fast enough. Improvements here is probably good but would require better tests and benchmarks.

vstinner added a commit to vstinner/cpython that referenced this issue Sep 30, 2023
…p.wakeup()

Replace test_gh105829_should_not_deadlock_if_wakeup_pipe_full() test
which was mocking too many concurrent.futures internals with a new
test_wakeup() functional test.

Co-Authored-By: elfstrom <elfstrom@users.noreply.github.com>
vstinner added a commit to vstinner/cpython that referenced this issue Sep 30, 2023
…p.wakeup()

Replace test_gh105829_should_not_deadlock_if_wakeup_pipe_full() test
which was mocking too many concurrent.futures internals with a new
test_wakeup() functional test.

Co-Authored-By: elfstrom <elfstrom@users.noreply.github.com>
vstinner added a commit to vstinner/cpython that referenced this issue Sep 30, 2023
Add a lock to _ThreadWakeup which is used internally by _ThreadWakeup
methods to serialize method calls to make the API thread safe.

No longer use the shutdown lock to access _ThreadWakeup.
@vstinner
Copy link
Member Author

This is not true. When the main thread pushes a new work item it needs to make sure the _ExecutorManagerThread wakes up and moves the work item into the worker process call queue. The push of the work item is not sufficient for that, the _ExecutorManagerThread is only waiting on results and worker process sentinels and this wakeup pipe that can be used to "manually" trigger wakeup. In the current implementation the main thread pushes one object into the wakeup pipe for every new work item and the wakeup pipe is then fully drained when the _ExecutorManagerThread wakes up. This makes sure new work items are moved into the call queue and made available to the workers.

Hum. You are describing a bad synchronization between two threads. Usually the solution to synchronization is adding a new lock. I wrote PR gh-110137 to add a lock to _ThreadWakeup to make it thread safe. Currently, the shutdown lock is used to access _ThreadWakeup, but IMO it's a bad idea to attempt to protect too many things with an unique lock.

@elfstrom
Copy link
Contributor

elfstrom commented Sep 30, 2023

This is not true. When the main thread pushes a new work item it needs to make sure the _ExecutorManagerThread wakes up and moves the work item into the worker process call queue. The push of the work item is not sufficient for that, the _ExecutorManagerThread is only waiting on results and worker process sentinels and this wakeup pipe that can be used to "manually" trigger wakeup. In the current implementation the main thread pushes one object into the wakeup pipe for every new work item and the wakeup pipe is then fully drained when the _ExecutorManagerThread wakes up. This makes sure new work items are moved into the call queue and made available to the workers.

Hum. You are describing a bad synchronization between two threads. Usually the solution to synchronization is adding a new lock. I wrote PR gh-110137 to add a lock to _ThreadWakeup to make it thread safe. Currently, the shutdown lock is used to access _ThreadWakeup, but IMO it's a bad idea to attempt to protect too many things with an unique lock.

The purpose of the wakeup object is not about synchronization (atomicity or mutual exclusiveness). The only goal is to communicate that new work may be available. This is currently done by sending messages on a Pipe which is thread safe. The only time the main thread and the _ExecutorManagerThread reads/writes shared state is during shutdown (the _closed bool) and this is protected by the shutdown lock so it is also thread safe. I may have missed something but to me it looks like there are no threading issues with the current wakeup implementation.

elfstrom added a commit to elfstrom/cpython that referenced this issue Oct 3, 2023
The test had an instability issue due to the ordering of the dummy
queue operation and the real wakeup pipe operations. Both primitives
are thread safe but not done atomically as a single update and may
interleave arbitrarily. With the old order of operations this can lead
to an incorrect state where the dummy queue is full but the wakeup
pipe is empty. By swapping the order in clear() I think this can no
longer happen in any possible operation interleaving (famous last
words).
cjw296 pushed a commit that referenced this issue Oct 3, 2023
The test had an instability issue due to the ordering of the dummy
queue operation and the real wakeup pipe operations. Both primitives
are thread safe but not done atomically as a single update and may
interleave arbitrarily. With the old order of operations this can lead
to an incorrect state where the dummy queue is full but the wakeup
pipe is empty. By swapping the order in clear() I think this can no
longer happen in any possible operation interleaving (famous last
words).
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Oct 3, 2023
…honGH-110306)

The test had an instability issue due to the ordering of the dummy
queue operation and the real wakeup pipe operations. Both primitives
are thread safe but not done atomically as a single update and may
interleave arbitrarily. With the old order of operations this can lead
to an incorrect state where the dummy queue is full but the wakeup
pipe is empty. By swapping the order in clear() I think this can no
longer happen in any possible operation interleaving (famous last
words).
(cherry picked from commit a376a72)

Co-authored-by: elfstrom <elfstrom@users.noreply.github.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Oct 3, 2023
…honGH-110306)

The test had an instability issue due to the ordering of the dummy
queue operation and the real wakeup pipe operations. Both primitives
are thread safe but not done atomically as a single update and may
interleave arbitrarily. With the old order of operations this can lead
to an incorrect state where the dummy queue is full but the wakeup
pipe is empty. By swapping the order in clear() I think this can no
longer happen in any possible operation interleaving (famous last
words).
(cherry picked from commit a376a72)

Co-authored-by: elfstrom <elfstrom@users.noreply.github.com>
cjw296 pushed a commit that referenced this issue Oct 3, 2023
…-110306) (#110316)

gh-109917: Fix test instability in test_concurrent_futures (GH-110306)

The test had an instability issue due to the ordering of the dummy
queue operation and the real wakeup pipe operations. Both primitives
are thread safe but not done atomically as a single update and may
interleave arbitrarily. With the old order of operations this can lead
to an incorrect state where the dummy queue is full but the wakeup
pipe is empty. By swapping the order in clear() I think this can no
longer happen in any possible operation interleaving (famous last
words).
(cherry picked from commit a376a72)

Co-authored-by: elfstrom <elfstrom@users.noreply.github.com>
cjw296 pushed a commit that referenced this issue Oct 3, 2023
…-110306) (#110315)

gh-109917: Fix test instability in test_concurrent_futures (GH-110306)

The test had an instability issue due to the ordering of the dummy
queue operation and the real wakeup pipe operations. Both primitives
are thread safe but not done atomically as a single update and may
interleave arbitrarily. With the old order of operations this can lead
to an incorrect state where the dummy queue is full but the wakeup
pipe is empty. By swapping the order in clear() I think this can no
longer happen in any possible operation interleaving (famous last
words).
(cherry picked from commit a376a72)

Co-authored-by: elfstrom <elfstrom@users.noreply.github.com>
@vstinner
Copy link
Member Author

vstinner commented Oct 4, 2023

Fixed by a376a72

@vstinner vstinner closed this as completed Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests Tests in the Lib/test dir
Projects
None yet
Development

No branches or pull requests

2 participants