Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_concurrent_futures: ProcessPoolSpawnExecutorDeadlockTest.test_crash() fails with OSError: [Errno 9] Bad file descriptor #84176

Closed
vstinner opened this issue Mar 17, 2020 · 34 comments
Assignees
Labels
3.9 tests Tests in the Lib/test dir

Comments

@vstinner
Copy link
Member

vstinner commented Mar 17, 2020

BPO 39995
Nosy @pitrou, @vstinner, @tomMoral, @corona10, @aeros
PRs
  • bpo-39995: Split test_concurrent_futures.test_crash() into sub-tests #19739
  • bpo-39995: Fix race condition in ProcessPoolExecutor._ThreadWakeup #19751
  • bpo-39995: Fix concurrent.futures._ThreadWakeup race condition #19758
  • bpo-39995: Fix concurrent.futures _ThreadWakeup #19760
  • bpo-39995: CLN remove some locks in ProcessPoolExecutor #19788
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/vstinner'
    closed_at = <Date 2021-05-26.22:49:56.216>
    created_at = <Date 2020-03-17.17:01:54.705>
    labels = ['tests', '3.9']
    title = 'test_concurrent_futures: ProcessPoolSpawnExecutorDeadlockTest.test_crash() fails with OSError: [Errno 9] Bad file descriptor'
    updated_at = <Date 2021-05-29.02:25:33.225>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2021-05-29.02:25:33.225>
    actor = 'aeros'
    assignee = 'vstinner'
    closed = True
    closed_date = <Date 2021-05-26.22:49:56.216>
    closer = 'vstinner'
    components = ['Tests']
    creation = <Date 2020-03-17.17:01:54.705>
    creator = 'vstinner'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 39995
    keywords = ['patch']
    message_count = 34.0
    messages = ['364451', '364546', '364552', '367434', '367449', '367450', '367451', '367453', '367455', '367458', '367460', '367462', '367463', '367465', '367486', '367487', '367503', '367508', '367512', '367513', '367539', '367540', '367542', '367543', '367544', '367605', '367608', '367616', '367617', '367631', '367632', '367635', '394494', '394707']
    nosy_count = 5.0
    nosy_names = ['pitrou', 'vstinner', 'tomMoral', 'corona10', 'aeros']
    pr_nums = ['19739', '19751', '19758', '19760', '19788']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue39995'
    versions = ['Python 3.9']

    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 17, 2020

    AMD64 Ubuntu Shared 3.x:
    https://buildbot.python.org/all/#/builders/101/builds/532

    test_crash (test.test_concurrent_futures.ProcessPoolSpawnExecutorDeadlockTest) ...
    Stdout:
    15.51s

    Stderr:
    Warning -- threading_cleanup() failed to cleanup 0 threads (count: 0, dangling: 3)
    Dangling thread: <Thread(QueueFeederThread, stopped daemon 140540322510592)>
    Dangling thread: <_MainThread(MainThread, started 140540525950528)>
    Dangling thread: <_ExecutorManagerThread(Thread-111, stopped daemon 140540401628928)>

    (...)

    ======================================================================
    ERROR: test_crash (test.test_concurrent_futures.ProcessPoolSpawnExecutorDeadlockTest) [exit at task unpickle]
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_concurrent_futures.py", line 1119, in test_crash
        executor.shutdown(wait=True)
      File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 721, in shutdown
        self._executor_manager_thread_wakeup.wakeup()
      File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 93, in wakeup
        self._writer.send_bytes(b"")
      File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 205, in send_bytes
        self._send_bytes(m[offset:offset + size])
      File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 416, in _send_bytes
        self._send(header + buf)
      File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 373, in _send
        n = write(self._handle, buf)
    OSError: [Errno 9] Bad file descriptor

    Stdout:
    15.51s

    Stderr:
    Warning -- threading_cleanup() failed to cleanup 0 threads (count: 0, dangling: 3)
    Dangling thread: <Thread(QueueFeederThread, stopped daemon 140540322510592)>
    Dangling thread: <_MainThread(MainThread, started 140540525950528)>
    Dangling thread: <_ExecutorManagerThread(Thread-111, stopped daemon 140540401628928)>

    --

    On the same build, test_concurrent_futures timed out after 15 min, while running test_ressources_gced_in_workers():

    0:29:08 load avg: 1.46 Re-running test_concurrent_futures in verbose mode
    (...)
    test_map_exception (test.test_concurrent_futures.ProcessPoolForkProcessPoolExecutorTest) ... ok
    test_map_timeout (test.test_concurrent_futures.ProcessPoolForkProcessPoolExecutorTest) ... ok
    test_max_workers_negative (test.test_concurrent_futures.ProcessPoolForkProcessPoolExecutorTest) ... ok
    test_max_workers_too_large (test.test_concurrent_futures.ProcessPoolForkProcessPoolExecutorTest) ... skipped 'Windows-only process limit'
    test_no_stale_references (test.test_concurrent_futures.ProcessPoolForkProcessPoolExecutorTest) ... ok
    Timeout (0:15:00)!
    Thread 0x00007f38bf766700 (most recent call first):
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/threading.py", line 303 in wait
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/queues.py", line 227 in _feed
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/threading.py", line 882 in run
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/threading.py", line 944 in _bootstrap_inner
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/threading.py", line 902 in _bootstrap

    Thread 0x00007f38bff67700 (most recent call first):
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/selectors.py", line 415 in select
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 936 in wait
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 372 in wait_result_broken_or_wakeup
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 319 in run
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/threading.py", line 944 in _bootstrap_inner
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/threading.py", line 902 in _bootstrap

    Thread 0x00007f38c7128640 (most recent call first):
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/threading.py", line 303 in wait
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/base.py", line 434 in result
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_concurrent_futures.py", line 955 in test_ressources_gced_in_workers
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 616 in _callTestMethod
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 659 in run
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 719 in __call
    _
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 122 in run
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 84 in __call__
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 122 in run
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 84 in __call__
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 122 in run
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 84 in __call__
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/runner.py", line 176 in run
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/support/init.py", line 2079 in _run_suite
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/support/init.py", line 2201 in run_unittest
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/libregrtest/runtest.py", line 209 in _test_module
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/libregrtest/runtest.py", line 234 in _runtest_inner2
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/libregrtest/runtest.py", line 270 in _runtest_inner
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/libregrtest/runtest.py", line 153 in _runtest
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/libregrtest/runtest.py", line 193 in runtest
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/libregrtest/main.py", line 318 in rerun_failed_tests
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/libregrtest/main.py", line 691 in _main
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/libregrtest/main.py", line 634 in main
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/libregrtest/main.py", line 712 in main
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/main.py", line 2 in <module>
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/runpy.py", line 87 in _run_code
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/runpy.py", line 194 in _run_module_as_main
    test_ressources_gced_in_workers (test.test_concurrent_futures.ProcessPoolForkProcessPoolExecutorTest) ... Makefile:1171: recipe for target 'buildbottest' failed
    make: *** [buildbottest] Error 1

    command timed out: 1200 seconds without output running [b'make', b'buildbottest', b'TESTOPTS=-j2 --junit-xml test-results.xml ${BUILDBOT_TESTOPTS}', b'TESTPYTHONOPTS=', b'TESTTIMEOUT=900'], attempting to kill
    program finished with exit code 2
    elapsedTime=3849.667593

    @vstinner vstinner added 3.9 tests Tests in the Lib/test dir labels Mar 17, 2020
    @vstinner vstinner changed the title test_concurrent_futures: ProcessPoolSpawnExecutorDeadlockTest.test_crash() fails with test_concurrent_futures: ProcessPoolSpawnExecutorDeadlockTest.test_crash() fails with OSError: [Errno 9] Bad file descriptor Mar 17, 2020
    @vstinner vstinner changed the title test_concurrent_futures: ProcessPoolSpawnExecutorDeadlockTest.test_crash() fails with test_concurrent_futures: ProcessPoolSpawnExecutorDeadlockTest.test_crash() fails with OSError: [Errno 9] Bad file descriptor Mar 17, 2020
    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 18, 2020

    Same bug on AMD64 FreeBSD Non-Debug 3.x:
    https://buildbot.python.org/all/#/builders/214/builds/472

    ======================================================================
    ERROR: test_crash (test.test_concurrent_futures.ProcessPoolSpawnExecutorDeadlockTest) [crash during func execution on worker]
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/usr/home/buildbot/python/3.x.koobs-freebsd-9e36.nondebug/build/Lib/test/test_concurrent_futures.py", line 1119, in test_crash
        executor.shutdown(wait=True)
      File "/usr/home/buildbot/python/3.x.koobs-freebsd-9e36.nondebug/build/Lib/concurrent/futures/process.py", line 721, in shutdown
        self._executor_manager_thread_wakeup.wakeup()
      File "/usr/home/buildbot/python/3.x.koobs-freebsd-9e36.nondebug/build/Lib/concurrent/futures/process.py", line 93, in wakeup
        self._writer.send_bytes(b"")
      File "/usr/home/buildbot/python/3.x.koobs-freebsd-9e36.nondebug/build/Lib/multiprocessing/connection.py", line 205, in send_bytes
        self._send_bytes(m[offset:offset + size])
      File "/usr/home/buildbot/python/3.x.koobs-freebsd-9e36.nondebug/build/Lib/multiprocessing/connection.py", line 416, in _send_bytes
        self._send(header + buf)
      File "/usr/home/buildbot/python/3.x.koobs-freebsd-9e36.nondebug/build/Lib/multiprocessing/connection.py", line 373, in _send
        n = write(self._handle, buf)
    OSError: [Errno 9] Bad file descriptor

    Stdout:
    8.35s

    Stderr:
    Warning -- threading_cleanup() failed to cleanup 0 threads (count: 0, dangling: 3)
    Dangling thread: <_MainThread(MainThread, started 34369908736)>
    Dangling thread: <_ExecutorManagerThread(Thread-114, stopped daemon 34396434432)>
    Dangling thread: <Thread(QueueFeederThread, stopped daemon 34377508352)>

    ----------------------------------------------------------------------

    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 18, 2020

    Same bug on AMD64 FreeBSD Non-Debug 3.x:
    https://buildbot.python.org/all/#/builders/214/builds/472

    Oh, test_crash failed twice, but not on the same test case:

    • ERROR: test_crash (test.test_concurrent_futures.ProcessPoolSpawnExecutorDeadlockTest) [crash during func execution on worker]
    • ERROR: test_crash (test.test_concurrent_futures.ProcessPoolForkExecutorDeadlockTest) [crash during func execution on worker]

    The second failure was when test_concurrent_futures was re-run sequentially.

    @vstinner
    Copy link
    Member Author

    vstinner commented Apr 27, 2020

    See also bpo-30966 "Add multiprocessing.SimpleQueue.close()".

    @vstinner
    Copy link
    Member Author

    vstinner commented Apr 27, 2020

    New changeset 5d1f32d by Victor Stinner in branch 'master':
    bpo-39995: Split test_concurrent_futures.test_crash() into sub-tests (GH-19739)
    5d1f32d

    @vstinner
    Copy link
    Member Author

    vstinner commented Apr 27, 2020

    AMD64 Ubuntu Shared 3.x:
    https://buildbot.python.org/all/#/builders/101/builds/809

    ======================================================================
    ERROR: test_shutdown_no_wait (test.test_concurrent_futures.ProcessPoolForkserverProcessPoolShutdownTest)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_concurrent_futures.py", line 542, in test_shutdown_no_wait
        executor.shutdown(wait=False)
      File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 724, in shutdown
        self._executor_manager_thread_wakeup.wakeup()
      File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 80, in wakeup
        self._writer.send_bytes(b"")
      File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 188, in send_bytes
        self._check_closed()
      File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 141, in _check_closed
        raise OSError("handle is closed")
    OSError: handle is closed

    (...)
    0:32:37 load avg: 1.64 Re-running test_concurrent_futures in verbose mode
    (...)

    ======================================================================
    ERROR: test_shutdown_no_wait (test.test_concurrent_futures.ProcessPoolForkProcessPoolShutdownTest)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_concurrent_futures.py", line 542, in test_shutdown_no_wait
        executor.shutdown(wait=False)
      File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 724, in shutdown
        self._executor_manager_thread_wakeup.wakeup()
      File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 80, in wakeup
        self._writer.send_bytes(b"")
      File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 205, in send_bytes
        self._send_bytes(m[offset:offset + size])
      File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 416, in _send_bytes
        self._send(header + buf)
      File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 373, in _send
        n = write(self._handle, buf)
    OSError: [Errno 9] Bad file descriptor

    @vstinner
    Copy link
    Member Author

    vstinner commented Apr 27, 2020

    See also bpo-30966 "Add multiprocessing.SimpleQueue.close()".

    I pushed a commit 1a27501:
    "Process.shutdown(wait=True) of concurrent.futures now closes
    explicitly the result queue."

    test_shutdown_deadlock_pickle() still rely on the queue to be closed implicitly. Queue created at:

    (...)
    File "/home/vstinner/python/master/Lib/test/test_concurrent_futures.py", lineno 1196
    with self.executor_type(max_workers=2,
    File "/home/vstinner/python/master/Lib/concurrent/futures/process.py", lineno 637
    self._result_queue = mp_context.SimpleQueue()
    File "/home/vstinner/python/master/Lib/multiprocessing/context.py", lineno 113
    return SimpleQueue(ctx=self.get_context())
    File "/home/vstinner/python/master/Lib/multiprocessing/queues.py", lineno 341
    self._reader, self._writer = connection.Pipe(duplex=False)
    File "/home/vstinner/python/master/Lib/multiprocessing/connection.py", lineno 539
    c2 = Connection(fd2, readable=False)

    @vstinner
    Copy link
    Member Author

    vstinner commented Apr 27, 2020

    AMD64 Fedora Stable Clang Installed 3.x:
    https://buildbot.python.org/all/#/builders/127/builds/679

    0:04:21 load avg: 1.29 [423/423/1] test_concurrent_futures failed (2 min 39 sec)
    Warning -- threading_cleanup() failed to cleanup -1 threads (count: 0, dangling: 3)
    Warning -- Dangling thread: <_MainThread(MainThread, started 139673296918336)>
    Warning -- Dangling thread: <Thread(QueueFeederThread, stopped daemon 139673045362432)>
    Warning -- Dangling thread: <_ExecutorManagerThread(Thread-145, stopped 139673053914880)>
    Warning -- threading_cleanup() failed to cleanup 0 threads (count: 0, dangling: 3)
    Warning -- Dangling thread: <_MainThread(MainThread, started 139673296918336)>
    Warning -- Dangling thread: <Thread(QueueFeederThread, stopped daemon 139673045362432)>
    Warning -- Dangling thread: <_ExecutorManagerThread(Thread-145, stopped 139673053914880)>
    /home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 5 leaked semaphore objects to clean up at shutdown
      warnings.warn('resource_tracker: There appear to be %d '
    Warning -- multiprocessing.process._dangling was modified by test_concurrent_futures
    Warning -- threading._dangling was modified by test_concurrent_futures
    test_cancel (test.test_concurrent_futures.FutureTests) ... ok
    test_cancelled (test.test_concurrent_futures.FutureTests) ... ok
    test_done (test.test_concurrent_futures.FutureTests) ... ok
    (...)
    test_first_exception_some_already_complete (test.test_concurrent_futures.ThreadPoolWaitTests) ... 1.60s ok
    test_pending_calls_race (test.test_concurrent_futures.ThreadPoolWaitTests) ... 0.11s ok
    test_timeout (test.test_concurrent_futures.ThreadPoolWaitTests) ... 6.11s ok
    Traceback (most recent call last):
      File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
        finalizer()
      File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 224, in __call__
        res = self._callback(*self._args, **self._kwargs)
      File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
        sem_unlink(name)
    FileNotFoundError: [Errno 2] No such file or directory
    Traceback (most recent call last):
      File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
        finalizer()
      File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 224, in __call__
        res = self._callback(*self._args, **self._kwargs)
      File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
        sem_unlink(name)
    FileNotFoundError: [Errno 2] No such file or directory
    Traceback (most recent call last):
      File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
        finalizer()
      File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 224, in __call__
        res = self._callback(*self._args, **self._kwargs)
      File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
        sem_unlink(name)
    FileNotFoundError: [Errno 2] No such file or directory
    Traceback (most recent call last):
      File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
        finalizer()
      File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 224, in __call__
        res = self._callback(*self._args, **self._kwargs)
      File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
        sem_unlink(name)
    FileNotFoundError: [Errno 2] No such file or directory
    Traceback (most recent call last):
      File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
        finalizer()
      File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 224, in __call__
        res = self._callback(*self._args, **self._kwargs)
      File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
        sem_unlink(name)
    FileNotFoundError: [Errno 2] No such file or directory

    ======================================================================
    ERROR: test_killed_child (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/test/test_concurrent_futures.py", line 130, in tearDown
        self.executor.shutdown(wait=True)
      File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/concurrent/futures/process.py", line 724, in shutdown
        self._executor_manager_thread_wakeup.wakeup()
      File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/concurrent/futures/process.py", line 80, in wakeup
        self._writer.send_bytes(b"")
      File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/connection.py", line 205, in send_bytes
        self._send_bytes(m[offset:offset + size])
      File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/connection.py", line 416, in _send_bytes
        self._send(header + buf)
      File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/connection.py", line 373, in _send
        n = write(self._handle, buf)
    OSError: [Errno 9] Bad file descriptor

    Ran 193 tests in 159.805s

    FAILED (errors=1, skipped=6)
    Before: set()
    After: {<weakref at 0x7f0828b17bd0; to 'SpawnProcess' at 0x7f082ace7400>, <weakref at 0x7f0828b17b80; to 'SpawnProcess' at 0x7f082ace7880>, <weakref at 0x7f0828b17b30; to 'SpawnProcess' at 0x7f0828bbc0a0>, <weakref at 0x7f0828b17720; to 'SpawnProcess' at 0x7f082ace72e0>, <weakref at 0x7f0828b17a90; to 'SpawnProcess' at 0x7f082ace7910>}
    Before: {<weakref at 0x7f082b5ad400; to '_MainThread' at 0x7f082bcd8850>}
    After: {<weakref at 0x7f0828b4d900; to 'Thread' at 0x7f0828bbc4c0>, <weakref at 0x7f0828b4d0e0; to '_MainThread' at 0x7f082bcd8850>, <weakref at 0x7f0828b17a40; to '_ExecutorManagerThread' at 0x7f0828bbc8e0>}
    test test_concurrent_futures failed

    == Tests result: FAILURE ==

    @vstinner
    Copy link
    Member Author

    vstinner commented Apr 27, 2020

    x86 Gentoo Installed with X 3.x:
    https://buildbot.python.org/all/#/builders/128/builds/726

    test_del_shutdown (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolShutdownTest) ...

    Warning -- Unraisable exception
    Exception ignored in: <function _ExecutorManagerThread.__init__.<locals>.weakref_cb at 0xb5067898>
    Traceback (most recent call last):
      File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/concurrent/futures/process.py", line 281, in weakref_cb
        thread_wakeup.wakeup()
      File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/concurrent/futures/process.py", line 80, in wakeup
        self._writer.send_bytes(b"")
      File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/multiprocessing/connection.py", line 205, in send_bytes
        self._send_bytes(m[offset:offset + size])
      File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/multiprocessing/connection.py", line 416, in _send_bytes
        self._send(header + buf)
      File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/multiprocessing/connection.py", line 373, in _send
        n = write(self._handle, buf)
    OSError: [Errno 9] Bad file descriptor

    0.04s ok

    @vstinner
    Copy link
    Member Author

    vstinner commented Apr 27, 2020

    ERROR: test_killed_child (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest)
    (...)
    File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/connection.py", line 373, in _send
    n = write(self._handle, buf)
    OSError: [Errno 9] Bad file descriptor

    It seems like Connection.close() was called while Connection._send() was called. I added debug logs:

    • self._handle was equal to 4 at the function entry
    • self._handle was equal to None when write() was called

    @vstinner
    Copy link
    Member Author

    vstinner commented Apr 27, 2020

    It seems like Connection.close() was called while Connection._send() was called. I added debug logs:

    The connection was closed by terminate_broken() called by _ExecutorManagerThread.run() thread:

    test_killed_child (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... close handle 4
    File "/home/vstinner/python/master/Lib/threading.py", line 908, in _bootstrap
    self._bootstrap_inner()
    File "/home/vstinner/python/master/Lib/threading.py", line 950, in _bootstrap_inner
    self.run()
    File "/home/vstinner/python/master/Lib/concurrent/futures/process.py", line 313, in run
    self.terminate_broken(cause)
    File "/home/vstinner/python/master/Lib/concurrent/futures/process.py", line 456, in terminate_broken
    self.join_executor_internals()
    File "/home/vstinner/python/master/Lib/concurrent/futures/process.py", line 503, in join_executor_internals
    self.thread_wakeup.close()
    File "/home/vstinner/python/master/Lib/concurrent/futures/process.py", line 75, in close
    self._writer.close()

    @vstinner
    Copy link
    Member Author

    vstinner commented Apr 27, 2020

    terminate_broken() method was added by:

    commit 0e89076
    Author: Thomas Moreau <thomas.moreau.2010@gmail.com>
    Date: Sun Mar 1 21:49:14 2020 +0100

    bpo-39678: refactor queue manager thread (GH-18551)
    

    @vstinner
    Copy link
    Member Author

    vstinner commented Apr 27, 2020

    ERROR: test_killed_child (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest)

    The patch below makes this test failure more likely:

    diff --git a/Lib/multiprocessing/connection.py b/Lib/multiprocessing/connection.py
    index 510e4b5aba..63518e55d9 100644
    --- a/Lib/multiprocessing/connection.py
    +++ b/Lib/multiprocessing/connection.py
    @@ -370,6 +370,7 @@ class Connection(_ConnectionBase):
         def _send(self, buf, write=_write):
             remaining = len(buf)
             while True:
    +            time.sleep(0.050)
                 n = write(self._handle, buf)
                 remaining -= n
                 if remaining == 0:

    @vstinner
    Copy link
    Member Author

    vstinner commented Apr 27, 2020

    It seems like test_killed_child() race condition was introduced by:

    commit a5cbab5 (refs/bisect/bad)
    Author: Thomas Moreau <thomas.moreau.2010@gmail.com>
    Date: Sun Feb 16 19:09:26 2020 +0100

    bpo-39104: Fix hanging ProcessPoolExecutor on shutdown nowait with pickling failure (GH-17670)
    

    @aeros
    Copy link
    Member

    aeros commented Apr 28, 2020

    I can't be certain for the other failures, but I'm currently exploring a potential solution for addressing the test_killed_child failure. To me, it seems to be a race condition with attempting to call _ThreadWakeup.close() while there are still bytes being sent. IMO, we should wait until closing the pipe's reader and writer until it's finished sending or receiving bytes. Here's one way to implement that w/ threading.Event:

    diff --git a/Lib/concurrent/futures/process.py b/Lib/concurrent/futures/process.py
    index 8e9b69a8f0..9bf073fc34 100644
    --- a/Lib/concurrent/futures/process.py
    +++ b/Lib/concurrent/futures/process.py
    @@ -68,21 +68,30 @@ class _ThreadWakeup:
         def __init__(self):
             self._closed = False
             self._reader, self._writer = mp.Pipe(duplex=False)
    +        # Used to ensure pipe is not closed while sending or receiving bytes
    +        self._not_running = threading.Event()
    +        # Initialize event as True
    +        self._not_running.set()
     
         def close(self):
             if not self._closed:
    +            self._not_running.wait()
                 self._closed = True
                 self._writer.close()
                 self._reader.close()
     
         def wakeup(self):
             if not self._closed:
    +            self._not_running.clear()
                 self._writer.send_bytes(b"")
    +            self._not_running.set()
     
         def clear(self):
             if not self._closed:
    +            self._not_running.clear()
                 while self._reader.poll():
                     self._reader.recv_bytes()
    +            self._not_running.set()

    From using Victor's method of replicating the failure with inserting a time.sleep(0.050) in multiprocessing.Connection._send(), it appears to fix the failure in test_killed_child. I believe it would also fix the other failures since they appear to be caused by the same core issue, but I've been unable to replicate them locally so I'm not 100% certain.

    @aeros
    Copy link
    Member

    aeros commented Apr 28, 2020

    After writing the above out and a bit of further consideration, I think it might make more sense to wait for the event after setting self._closed = True so that it prevents future wakeup() and clear() calls from reading/writing to the pipe, while still allowing ones that are currently ongoing to finish.

    Thoughts?

    @tomMoral
    Copy link
    Mannequin

    tomMoral mannequin commented Apr 28, 2020

    Sorry I just saw this. It seems that I introduced this regression.

    One of the goal of having a ThreadWakeup and not a SimpleQueue is to avoid using locks that can hinder the performance of the executor. I don't remember the exact details but I think I did some benchmark and it was giving lower performances for large set of small tasks (not so sure anymore). If a fully synchronous method is chosen, maybe it is safer to rely on a SimpleQueue as it will be lower maintenance. If the solution proposed by aeros is chosen, the event can probably be replaced by a lock no? It would be easier to understand I guess.

    From the failures, it seems to be a race condition between shutdown and terminate_broken. The race condition should not be possible in submit as in principle, the join_executor_internals is only called when no new task can be submitted.
    One solution would be to use the self._shutdown_lock from the executor to protect the call to close in terminate_broken and the call to self._thread_wakeup.wakeup in shutdown. That way, the lock is only acquired at critical points without being used all the time. This could also be done by adding lock=True/False to only lock the potentially dangerous calls.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 28, 2020

    How about the following (untested):

    diff --git a/Lib/concurrent/futures/process.py b/Lib/concurrent/futures/process.py
    index 8e9b69a8f0..c0c2eb3032 100644
    --- a/Lib/concurrent/futures/process.py
    +++ b/Lib/concurrent/futures/process.py
    @@ -66,23 +66,29 @@ _global_shutdown = False
     
     class _ThreadWakeup:
         def __init__(self):
    -        self._closed = False
             self._reader, self._writer = mp.Pipe(duplex=False)
     
         def close(self):
    -        if not self._closed:
    -            self._closed = True
    -            self._writer.close()
    -            self._reader.close()
    +        r, w = self._reader, self._writer
    +        self._reader = self._writer = None
    +        if r is not None:
    +            r.close()
    +            w.close()
     
         def wakeup(self):
    -        if not self._closed:
    +        try:
                 self._writer.send_bytes(b"")
    +        except AttributeError:
    +            # Closed
    +            pass
     
         def clear(self):
    -        if not self._closed:
    +        try:
                 while self._reader.poll():
                     self._reader.recv_bytes()
    +        except AttributeError:
    +            # Closed
    +            pass
     
     
     def _python_exit():

    @aeros
    Copy link
    Member

    aeros commented Apr 28, 2020

    Oops, it seems that I opened PR-19751 a bit preemptively. When I get the chance, I'll see if Antoine's implementation can address the failures and do some comparisons.

    @aeros
    Copy link
    Member

    aeros commented Apr 28, 2020

    I decided to close PR-19751. Both because it does not correctly address the race condition (due to an oversight on my part) and it would add substantial overhead to _ThreadWakeup. Instead, I agree that we should explore a non-locking solution.

    @vstinner
    Copy link
    Member Author

    vstinner commented Apr 28, 2020

    Antoine Pitrou: "How about the following (untested): (...)"

    Using Antoine's patch, test_killed_child() still fails (I used my msg367463 patch to make the failure more likely).

    @pitrou
    Copy link
    Member

    pitrou commented Apr 28, 2020

    With the same traceback and error message?

    @vstinner
    Copy link
    Member Author

    vstinner commented Apr 28, 2020

    With my msg367463 patch (add sleep), test_cancel_futures() fails. Example:

    ======================================================================
    FAIL: test_cancel_futures (test.test_concurrent_futures.ProcessPoolForkProcessPoolShutdownTest)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home/vstinner/python/master/Lib/test/test_concurrent_futures.py", line 353, in test_cancel_futures
        self.assertTrue(len(cancelled) >= 35, msg=f"{len(cancelled)=}")
    AssertionError: False is not true : len(cancelled)=0

    @vstinner
    Copy link
    Member Author

    vstinner commented Apr 28, 2020

    Thomas Moreau: "One solution would be to use the self._shutdown_lock from the executor to protect the call to close in terminate_broken and the call to self._thread_wakeup.wakeup in shutdown. That way, the lock is only acquired at critical points without being used all the time. This could also be done by adding lock=True/False to only lock the potentially dangerous calls."

    I wrote a conservative PR 19760 which always lock ProcessPoolExecutor._shutdown_lock while accessing _ThreadWakeup.

    PR 19760 fix test_killed_child(): it doesn't fail anymore, even with my msg367463 patch (add sleep).

    @vstinner
    Copy link
    Member Author

    vstinner commented Apr 28, 2020

    With my msg367463 patch (add sleep), test_cancel_futures() fails.

    The test uses sleep() as a synchronization primitive:

    executor.submit(time.sleep, .1).

    That's bad, but it doesn't *have to* be fixed now. My msg367463 patch adds an artifical sleep: the test looks fine in practice. I prefer to wait until it fails on a buildbot worker before spending time to make the test more reliable.

    @vstinner
    Copy link
    Member Author

    vstinner commented Apr 29, 2020

    New changeset a4dfe8e by Victor Stinner in branch 'master':
    bpo-39995: Fix concurrent.futures _ThreadWakeup (GH-19760)
    a4dfe8e

    @vstinner
    Copy link
    Member Author

    vstinner commented Apr 29, 2020

    I'm still getting more and more buildbot emails about test_concurrent_futures, so I merged my PR 19760 to fix buildbots.

    Please revert or modify my PR 19760 if you have a better approach, but please check that test_killed_child() and ProcessPoolForkExecutorDeadlockTest tests don't fail with my msg367463 patch.

    I would still appreciated a post-commit review of my change, since I don't know well concurrent.futures code :

    bpo-39995: Fix concurrent.futures _ThreadWakeup (GH-19760)
    a4dfe8e

    @aeros
    Copy link
    Member

    aeros commented Apr 29, 2020

    I am a bit concerned about the performance implication of accessing the thread wakeup through a lock in the call queue, but for now I think it's reasonable to address the buildbot failure with a locking solution while we try to find a better one.

    I'm not certain if we'll be able to find one that correctly addresses the failures with zero additional locking, but I think we may be able to reduce the number of times we use the lock compared to the implementation in #63959. I'll spend some time exploring that as I find the time to, and report back with any significant findings.

    @aeros
    Copy link
    Member

    aeros commented Apr 29, 2020

    Also, thanks to Victor for coming up with an interim solution, and for the feedback from Antoine and Thomas. #63959 is a significant improvement from my initial proposal in #63950.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 29, 2020

    I looked at the change and it seemed ok to me. Perhaps Thomas can give it a look too.

    @pitrou pitrou assigned vstinner and unassigned aeros Apr 29, 2020
    @tomMoral
    Copy link
    Mannequin

    tomMoral mannequin commented Apr 29, 2020

    I think this is a reasonable way to move on.Some of the locks can probably be removed but this needs careful investigation and in the mean time, it hinders everyone. Thanks victor for the fast fix up!

    To me, an interesting observation is that the failure seems to only happen when the executor is in broken state. If we can find a way to adapt the behavior to be more conservative on these states (which are not impacting perf) that would be nice.

    I will try to look at it today and see if I can remove some of the locks while still not failing with Victor's patch. We can probably remove the lock on clear safely. For the others, it might be more complex.

    @tomMoral
    Copy link
    Mannequin

    tomMoral mannequin commented Apr 29, 2020

    I did GH 19788 with a few modifications. There is only one lock that seems to mater for the perf, and I actually added one other (the one in _python_exit, which necessitate another bug fix for fork context).

    I did not benchmark to see if it was worth it in term of perf.

    @vstinner
    Copy link
    Member Author

    vstinner commented May 26, 2021

    "test_concurrent_futures: ProcessPoolSpawnExecutorDeadlockTest.test_crash() fails with OSError: [Errno 9] Bad file descriptor"

    I didn't see this failure recently, I close the issue. Since changes were pushed, I mark the issue as fixed.

    If someone has ideas to enhance the code, I suggest to open a new more specific issue. I consider the initial issue (buildot failure) as fixed.

    @aeros
    Copy link
    Member

    aeros commented May 29, 2021

    Thanks for closing up the issue, Victor :)

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 tests Tests in the Lib/test dir
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants