test_concurrent_futures: test_gh105829_should_not_deadlock_if_wakeup_pipe_full() hangs on ARM64 macOS 3.x #109917

vstinner · 2023-09-26T19:15:35Z

ARM64 macOS 3.x:

0:19:31 load avg: 2.16 [460/460/2] test.test_concurrent_futures.test_deadlock process crashed (Exit code 1)
Timeout (0:15:00)!
Thread 0x000000016ff7b000 (most recent call first):
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/threading.py", line 348 in wait
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/multiprocessing/queues.py", line 231 in _feed
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/threading.py", line 1003 in run
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/threading.py", line 1066 in _bootstrap_inner
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/threading.py", line 1023 in _bootstrap

Thread 0x000000016ef6f000 (most recent call first):
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/selectors.py", line 398 in select
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/multiprocessing/connection.py", line 1135 in wait
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/concurrent/futures/process.py", line 419 in wait_result_broken_or_wakeup
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/concurrent/futures/process.py", line 346 in run
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/test/test_concurrent_futures/test_deadlock.py", line 276 in mock_run
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/threading.py", line 1066 in _bootstrap_inner
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/threading.py", line 1023 in _bootstrap

Thread 0x00000001e2671300 (most recent call first):
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/threading.py", line 348 in wait
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/queue.py", line 140 in put
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/test/test_concurrent_futures/test_deadlock.py", line 285 in wakeup
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/concurrent/futures/process.py", line 857 in shutdown
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/concurrent/futures/_base.py", line 647 in __exit__
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/test/test_concurrent_futures/test_deadlock.py", line 299 in test_gh105829_should_not_deadlock_if_wakeup_pipe_full
  (...)

The test passed when re-run in verbose mode:

0:19:45 load avg: 2.10 [2/2/1] test.test_concurrent_futures.test_deadlock passed

build: https://buildbot.python.org/all/#/builders/725/builds/5749

Linked PRs

The text was updated successfully, but these errors were encountered:

elfstrom · 2023-09-27T06:01:25Z

I have only looked very briefly but I suspect this is a symptom of a race due to the wakeup mock that is used to force blocking behaviour. I will investigate further.

elfstrom · 2023-09-27T19:50:09Z

If my guess for what goes wrong is correct then I think this would fix it:

--- a/Lib/test/test_concurrent_futures/test_deadlock.py
+++ b/Lib/test/test_concurrent_futures/test_deadlock.py
@@ -280,11 +280,12 @@ def wakeup(self):
                 super().wakeup()
 
             def clear(self):
+                super().clear()
                 try:
                     while True:
                         self._dummy_queue.get_nowait()
                 except queue.Empty:
-                    super().clear()
+                    pass
 
         with (unittest.mock.patch.object(futures.process._ExecutorManagerThread,
                                          'run', mock_run),

I cant reproduce the failure locally though so I might be completely wrong about what's going on.

vstinner · 2023-09-29T21:30:58Z

The only purpose of _ThreadWakeup is to notify _ExecutorManagerThread when something bad happen:

    def wait_result_broken_or_wakeup(self):
        ...
        wakeup_reader = self.thread_wakeup._reader
        ...
        ready = mp.connection.wait(readers + worker_sentinels)

        ...
        elif wakeup_reader in ready:
            is_broken = False

The number of wakeup() calls doesn't matter: 1 or 1000 is the same.

I don't get test_gh105829_should_not_deadlock_if_wakeup_pipe_full(). It blocks if wakeup() is called more than once? Why?

_ThreadWakeup.wakeup() has a strange implementation: it works 0 bytes into the writer side of the pipe:

class _ThreadWakeup:
    def __init__(self):
        self._closed = False
        self._reader, self._writer = mp.Pipe(duplex=False)

    def wakeup(self):
        if not self._closed:
            self._writer.send_bytes(b"")

Why not sending a byte and ignoring when the pipe is full?

asyncio has a similar "self-pipe" pattern, but it just ignores when the pipe is full:

class BaseSelectorEventLoop(base_events.BaseEventLoop):
    def _write_to_self(self):
        try:
            self._csock.send(b'\0')
        except OSError:
            if self._debug:
                logger.debug("Fail to write a null byte into the "
                             "self-pipe socket",
                             exc_info=True)

Also, there is a simple way to not fill the pipe: write a single byte and then recall in an attribute that a byte was written.

vstinner · 2023-09-29T21:41:51Z

See also issue gh-105829: concurrent.futures.ProcessPoolExecutor pool deadlocks when submitting many tasks.

…p.wakeup() Remove test_gh105829_should_not_deadlock_if_wakeup_pipe_full() of test_concurrent_futures.test_deadlock. The test is no longer relevant.

vstinner · 2023-09-29T21:47:40Z

I proposed PR gh-110129 to fix the issue.

elfstrom · 2023-09-30T08:21:35Z

The only purpose of _ThreadWakeup is to notify _ExecutorManagerThread when something bad happen:

    def wait_result_broken_or_wakeup(self):
        ...
        wakeup_reader = self.thread_wakeup._reader
        ...
        ready = mp.connection.wait(readers + worker_sentinels)

        ...
        elif wakeup_reader in ready:
            is_broken = False

The number of wakeup() calls doesn't matter: 1 or 1000 is the same.

This is not true. When the main thread pushes a new work item it needs to make sure the _ExecutorManagerThread wakes up and moves the work item into the worker process call queue. The push of the work item is not sufficient for that, the _ExecutorManagerThread is only waiting on results and worker process sentinels and this wakeup pipe that can be used to "manually" trigger wakeup. In the current implementation the main thread pushes one object into the wakeup pipe for every new work item and the wakeup pipe is then fully drained when the _ExecutorManagerThread wakes up. This makes sure new work items are moved into the call queue and made available to the workers.

I don't get test_gh105829_should_not_deadlock_if_wakeup_pipe_full(). It blocks if wakeup() is called more than once? Why?

_ThreadWakeup.wakeup() has a strange implementation: it works 0 bytes into the writer side of the pipe:
class _ThreadWakeup:
    def __init__(self):
        self._closed = False
        self._reader, self._writer = mp.Pipe(duplex=False)

    def wakeup(self):
        if not self._closed:
            self._writer.send_bytes(b"")
Why not sending a byte and ignoring when the pipe is full?

We started with that type of fix in #108513, please read through the commits and discussion for the full background on why we ended up with the current state.

asyncio has a similar "self-pipe" pattern, but it just ignores when the pipe is full:

class BaseSelectorEventLoop(base_events.BaseEventLoop):
    def _write_to_self(self):
        try:
            self._csock.send(b'\0')
        except OSError:
            if self._debug:
                logger.debug("Fail to write a null byte into the "
                             "self-pipe socket",
                             exc_info=True)

Also, there is a simple way to not fill the pipe: write a single byte and then recall in an attribute that a byte was written.

The tradeoff done was to prefer removing the lock vs avoiding the write of the pipe. @tomMoral thinks it is possible to do both and that is likely true but we went with the safer option for the time being to get the deadlock fix out the door. There are some interesting corner cases surrounding what happens when the call queue is full and the management thread cannot not actually move any new work at that time. It is likely ok (assuming it is ok without the change) but at least I didn't feel like risking breaking anything.

One thing is very clear however, there is a lot of contention while new work is submitted from the main thread and the workers can easily get starved during this period because the _ExecutorManagerThread cannot move work into their queue fast enough. Improvements here is probably good but would require better tests and benchmarks.

…p.wakeup() Replace test_gh105829_should_not_deadlock_if_wakeup_pipe_full() test which was mocking too many concurrent.futures internals with a new test_wakeup() functional test. Co-Authored-By: elfstrom <elfstrom@users.noreply.github.com>

Add a lock to _ThreadWakeup which is used internally by _ThreadWakeup methods to serialize method calls to make the API thread safe. No longer use the shutdown lock to access _ThreadWakeup.

vstinner · 2023-09-30T09:26:05Z

This is not true. When the main thread pushes a new work item it needs to make sure the _ExecutorManagerThread wakes up and moves the work item into the worker process call queue. The push of the work item is not sufficient for that, the _ExecutorManagerThread is only waiting on results and worker process sentinels and this wakeup pipe that can be used to "manually" trigger wakeup. In the current implementation the main thread pushes one object into the wakeup pipe for every new work item and the wakeup pipe is then fully drained when the _ExecutorManagerThread wakes up. This makes sure new work items are moved into the call queue and made available to the workers.

Hum. You are describing a bad synchronization between two threads. Usually the solution to synchronization is adding a new lock. I wrote PR gh-110137 to add a lock to _ThreadWakeup to make it thread safe. Currently, the shutdown lock is used to access _ThreadWakeup, but IMO it's a bad idea to attempt to protect too many things with an unique lock.

elfstrom · 2023-09-30T17:31:28Z

This is not true. When the main thread pushes a new work item it needs to make sure the _ExecutorManagerThread wakes up and moves the work item into the worker process call queue. The push of the work item is not sufficient for that, the _ExecutorManagerThread is only waiting on results and worker process sentinels and this wakeup pipe that can be used to "manually" trigger wakeup. In the current implementation the main thread pushes one object into the wakeup pipe for every new work item and the wakeup pipe is then fully drained when the _ExecutorManagerThread wakes up. This makes sure new work items are moved into the call queue and made available to the workers.

Hum. You are describing a bad synchronization between two threads. Usually the solution to synchronization is adding a new lock. I wrote PR gh-110137 to add a lock to _ThreadWakeup to make it thread safe. Currently, the shutdown lock is used to access _ThreadWakeup, but IMO it's a bad idea to attempt to protect too many things with an unique lock.

The purpose of the wakeup object is not about synchronization (atomicity or mutual exclusiveness). The only goal is to communicate that new work may be available. This is currently done by sending messages on a Pipe which is thread safe. The only time the main thread and the _ExecutorManagerThread reads/writes shared state is during shutdown (the _closed bool) and this is protected by the shutdown lock so it is also thread safe. I may have missed something but to me it looks like there are no threading issues with the current wakeup implementation.

The test had an instability issue due to the ordering of the dummy queue operation and the real wakeup pipe operations. Both primitives are thread safe but not done atomically as a single update and may interleave arbitrarily. With the old order of operations this can lead to an incorrect state where the dummy queue is full but the wakeup pipe is empty. By swapping the order in clear() I think this can no longer happen in any possible operation interleaving (famous last words).

…honGH-110306) The test had an instability issue due to the ordering of the dummy queue operation and the real wakeup pipe operations. Both primitives are thread safe but not done atomically as a single update and may interleave arbitrarily. With the old order of operations this can lead to an incorrect state where the dummy queue is full but the wakeup pipe is empty. By swapping the order in clear() I think this can no longer happen in any possible operation interleaving (famous last words). (cherry picked from commit a376a72) Co-authored-by: elfstrom <elfstrom@users.noreply.github.com>

…-110306) (#110316) gh-109917: Fix test instability in test_concurrent_futures (GH-110306) The test had an instability issue due to the ordering of the dummy queue operation and the real wakeup pipe operations. Both primitives are thread safe but not done atomically as a single update and may interleave arbitrarily. With the old order of operations this can lead to an incorrect state where the dummy queue is full but the wakeup pipe is empty. By swapping the order in clear() I think this can no longer happen in any possible operation interleaving (famous last words). (cherry picked from commit a376a72) Co-authored-by: elfstrom <elfstrom@users.noreply.github.com>

…-110306) (#110315) gh-109917: Fix test instability in test_concurrent_futures (GH-110306) The test had an instability issue due to the ordering of the dummy queue operation and the real wakeup pipe operations. Both primitives are thread safe but not done atomically as a single update and may interleave arbitrarily. With the old order of operations this can lead to an incorrect state where the dummy queue is full but the wakeup pipe is empty. By swapping the order in clear() I think this can no longer happen in any possible operation interleaving (famous last words). (cherry picked from commit a376a72) Co-authored-by: elfstrom <elfstrom@users.noreply.github.com>

vstinner · 2023-10-04T10:58:18Z

Fixed by a376a72

vstinner added the tests Tests in the Lib/test dir label Sep 26, 2023

vstinner changed the title ~~test_concurrent_futures.test_deadlock failed with 15 min timeout (hang?) on ARM64 macOS 3.x~~ test_concurrent_futures: test_gh105829_should_not_deadlock_if_wakeup_pipe_full() hangs on ARM64 macOS 3.x Sep 26, 2023

This was referenced Sep 26, 2023

gh-105829: Fix concurrent.futures.ProcessPoolExecutor deadlock #108513

Merged

gh-107888: Fix test_mmap.test_access_parameter() on macOS 14 #109928

Merged

bedevere-app bot mentioned this issue Sep 29, 2023

gh-109917, gh-105829: Fix concurrent.futures _ThreadWakeup.wakeup() #110129

Closed

bedevere-app bot mentioned this issue Sep 30, 2023

gh-109917: Fix concurrent.future ProcessPoolExecutor locks #110137

Closed

bedevere-app bot mentioned this issue Oct 3, 2023

gh-109917: Fix test instability in test_concurrent_futures #110306

Merged

bedevere-app bot mentioned this issue Oct 3, 2023

[3.12] gh-109917: Fix test instability in test_concurrent_futures (GH-110306) #110315

Merged

bedevere-app bot mentioned this issue Oct 3, 2023

[3.11] gh-109917: Fix test instability in test_concurrent_futures (GH-110306) #110316

Merged

vstinner closed this as completed Oct 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_concurrent_futures: test_gh105829_should_not_deadlock_if_wakeup_pipe_full() hangs on ARM64 macOS 3.x #109917

test_concurrent_futures: test_gh105829_should_not_deadlock_if_wakeup_pipe_full() hangs on ARM64 macOS 3.x #109917

vstinner commented Sep 26, 2023 •

edited by bedevere-app bot

elfstrom commented Sep 27, 2023

elfstrom commented Sep 27, 2023

vstinner commented Sep 29, 2023

vstinner commented Sep 29, 2023

vstinner commented Sep 29, 2023

elfstrom commented Sep 30, 2023

vstinner commented Sep 30, 2023

elfstrom commented Sep 30, 2023 •

edited

vstinner commented Oct 4, 2023

test_concurrent_futures: test_gh105829_should_not_deadlock_if_wakeup_pipe_full() hangs on ARM64 macOS 3.x #109917

test_concurrent_futures: test_gh105829_should_not_deadlock_if_wakeup_pipe_full() hangs on ARM64 macOS 3.x #109917

Comments

vstinner commented Sep 26, 2023 • edited by bedevere-app bot

Linked PRs

elfstrom commented Sep 27, 2023

elfstrom commented Sep 27, 2023

vstinner commented Sep 29, 2023

vstinner commented Sep 29, 2023

vstinner commented Sep 29, 2023

elfstrom commented Sep 30, 2023

vstinner commented Sep 30, 2023

elfstrom commented Sep 30, 2023 • edited

vstinner commented Oct 4, 2023

vstinner commented Sep 26, 2023 •

edited by bedevere-app bot

elfstrom commented Sep 30, 2023 •

edited