Temporarily Disable OpenMP support for libsox #1026

mthrok · 2020-11-13T14:14:21Z

Sox effect functions and subprocess cause issues like #1021.
In this PR, I added the tests that mimic this issue. This test properly fails in my development env (see the log bellow), however, it somehow does not fail in our CI. Furthermore, all the tests in test/torchaudio_unittest/sox_effects/dataset_test.py can fail with segmentation fault on my local env, but it does not happen in our CI. My environment is very similar to the CI settings (Docker/Ubuntu + Anaconda), and I can reproduce the issue on my working env but I have not figured out the key difference between my local env and CI.

What are common to these failure cases include;

Run sox_effects in subprocess.
Subprocesses are launched with fork method.

What we know is that;

With fork method, setting OMP_NUM_THREDAS=1 does not resolve the segmentation fault issue.
Launching subprocesses with spawn would circumvent the segmentation fault.
- but this incurs the initialization cost and slow down the data loading processes. use mp_context=multiprocessing.get_context("spawn") in ProcessPoolExecutor will crash lhotse-speech/lhotse#126
- Setting OMP_NUM_THREADS=1 helps the slowdown prepare.py slow speed k2-fsa/snowfall#18 but this affects all the operations of the application code (including PyTorch)
Disabling OpenMP when compiling libsox resolves the issue. (fork works too)
- This will remove -fopenmp flag when compiling libsox.

The above suggests that there is some sort of interference OpenMP of PyTorch (or MKL) and libsox.
This PR disables OpenMP support of libsox so that, at least users won't have an issue using these functionality while we investigate the issue further.

error report

============================================================================================================ test session starts =============================================================================================================
platform linux -- Python 3.8.3, pytest-5.4.3, py-1.9.0, pluggy-0.13.1
rootdir: /scratch/moto/torchaudio
plugins: hypothesis-5.18.0
collected 3 items

test/torchaudio_unittest/sox_effect/dataset_test.py FFFatal Python error: Segmentation fault

Current thread 0x00007f55ba56a740 (most recent call first):
  File "/scratch/moto/torchaudio/torchaudio/sox_effects/sox_effects.py", line 150 in apply_effects_tensor
  File "/scratch/moto/torchaudio/test/torchaudio_unittest/sox_effect/dataset_test.py", line 134 in speed
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/concurrent/futures/process.py", line 239 in _process_worker
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/multiprocessing/process.py", line 108 in run
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/multiprocessing/process.py", line 315 in _bootstrap
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/multiprocessing/popen_fork.py", line 75 in _launch
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/multiprocessing/popen_fork.py", line 19 in __init__
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/multiprocessing/context.py", line 276 in _Popen
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/multiprocessing/process.py", line 121 in start
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/concurrent/futures/process.py", line 608 in _adjust_process_count
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/concurrent/futures/process.py", line 584 in _start_queue_management_thread
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/concurrent/futures/process.py", line 645 in submit
  File "/scratch/moto/torchaudio/test/torchaudio_unittest/sox_effect/dataset_test.py", line 156 in <listcomp>
  File "/scratch/moto/torchaudio/test/torchaudio_unittest/sox_effect/dataset_test.py", line 156 in test_executor
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/unittest/case.py", line 633 in _callTestMethod
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/unittest/case.py", line 676 in run
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/unittest/case.py", line 736 in __call__
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/_pytest/unittest.py", line 231 in runtest
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/_pytest/runner.py", line 135 in pytest_runtest_call
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/_pytest/runner.py", line 217 in <lambda>
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/_pytest/runner.py", line 244 in from_call
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/_pytest/runner.py", line 216 in call_runtest_hook
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/_pytest/runner.py", line 186 in call_and_report
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/_pytest/runner.py", line 100 in runtestprotocol
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/_pytest/runner.py", line 85 in pytest_runtest_protocol
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/_pytest/main.py", line 272 in pytest_runtestloop
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/_pytest/main.py", line 247 in _main
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/_pytest/main.py", line 191 in wrap_session
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/_pytest/main.py", line 240 in pytest_cmdline_main
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/_pytest/config/__init__.py", line 124 in main
  File "/home/moto/conda/envs/PY3.8-cuda101/bin/pytest", line 11 in <module>
F                                                                                                                                                                                [100%]

================================================================================================================== FAILURES ==================================================================================================================
_______________________________________________________________________________________________ TestSoxEffectsDataset.test_apply_effects_file ________________________________________________________________________________________________

self = <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0x7f559e9a4430>, timeout = 5.0

    def _try_get_data(self, timeout=_utils.MP_STATUS_CHECK_INTERVAL):
        # Tries to fetch data from `self._data_queue` once for a given timeout.
        # This can also be used as inner loop of fetching without timeout, with
        # the sender status as the loop condition.
        #
        # This raises a `RuntimeError` if any worker died expectedly. This error
        # can come from either the SIGCHLD handler in `_utils/signal_handling.py`
        # (only for non-Windows platforms), or the manual check below on errors
        # and timeouts.
        #
        # Returns a 2-tuple:
        #   (bool: whether successfully get data, any: data if successful else None)
        try:
>           data = self._data_queue.get(timeout=timeout)

/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/torch/utils/data/dataloader.py:956:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <multiprocessing.queues.Queue object at 0x7f559e9a4550>, block = True, timeout = 4.999965248629451

    def get(self, block=True, timeout=None):
        if self._closed:
            raise ValueError(f"Queue {self!r} is closed")
        if block and timeout is None:
            with self._rlock:
                res = self._recv_bytes()
            self._sem.release()
        else:
            if block:
                deadline = time.monotonic() + timeout
            if not self._rlock.acquire(block, timeout):
                raise Empty
            try:
                if block:
                    timeout = deadline - time.monotonic()
>                   if not self._poll(timeout):

/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/multiprocessing/queues.py:107:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <multiprocessing.connection.Connection object at 0x7f559e9a45e0>, timeout = 4.999965248629451

    def poll(self, timeout=0.0):
        """Whether there is any input available to be read"""
        self._check_closed()
        self._check_readable()
>       return self._poll(timeout)

/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/multiprocessing/connection.py:257:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <multiprocessing.connection.Connection object at 0x7f559e9a45e0>, timeout = 4.999965248629451

    def _poll(self, timeout):
>       r = wait([self], timeout)

/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/multiprocessing/connection.py:424:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

object_list = [<multiprocessing.connection.Connection object at 0x7f559e9a45e0>], timeout = 4.999965248629451

    def wait(object_list, timeout=None):
        '''
        Wait till an object in object_list is ready/readable.

        Returns list of those objects in object_list which are ready/readable.
        '''
        with _WaitSelector() as selector:
            for obj in object_list:
                selector.register(obj, selectors.EVENT_READ)

            if timeout is not None:
                deadline = time.monotonic() + timeout

            while True:
>               ready = selector.select(timeout)

/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/multiprocessing/connection.py:931:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <selectors.PollSelector object at 0x7f559e9617c0>, timeout = 5000

    def select(self, timeout=None):
        # This is shared between poll() and epoll().
        # epoll() has a different signature and handling of timeout parameter.
        if timeout is None:
            timeout = None
        elif timeout <= 0:
            timeout = 0
        else:
            # poll() has a resolution of 1 millisecond, round away from
            # zero to wait *at least* timeout seconds.
            timeout = math.ceil(timeout * 1e3)
        ready = []
        try:
>           fd_event_list = self._selector.poll(timeout)

/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/selectors.py:415:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

signum = 17, frame = <frame at 0x7f559e96a640, file '/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/selectors.py', line 417, code select>

    def handler(signum, frame):
        # This following call uses `waitid` with WNOHANG from C side. Therefore,
        # Python can still get and update the process status successfully.
>       _error_if_any_worker_fails()
E       RuntimeError: DataLoader worker (pid 17123) is killed by signal: Segmentation fault.

/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py:66: RuntimeError

The above exception was the direct cause of the following exception:

self = <torchaudio_unittest.sox_effect.dataset_test.TestSoxEffectsDataset testMethod=test_apply_effects_file>

    def test_apply_effects_file(self):
        sample_rate = 12000
        flist = self._generate_dataset()
        dataset = RandomPerturbationFile(flist, sample_rate)
        loader = torch.utils.data.DataLoader(
            dataset, batch_size=32, num_workers=16,
            worker_init_fn=init_random_seed,
        )
>       for batch in loader:

test/torchaudio_unittest/sox_effect/dataset_test.py:104:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/torch/utils/data/dataloader.py:519: in __next__
    data = self._next_data()
/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/torch/utils/data/dataloader.py:1152: in _next_data
    idx, data = self._get_data()
/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/torch/utils/data/dataloader.py:1118: in _get_data
    success, data = self._try_get_data()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0x7f559e9a4430>, timeout = 5.0

    def _try_get_data(self, timeout=_utils.MP_STATUS_CHECK_INTERVAL):
        # Tries to fetch data from `self._data_queue` once for a given timeout.
        # This can also be used as inner loop of fetching without timeout, with
        # the sender status as the loop condition.
        #
        # This raises a `RuntimeError` if any worker died expectedly. This error
        # can come from either the SIGCHLD handler in `_utils/signal_handling.py`
        # (only for non-Windows platforms), or the manual check below on errors
        # and timeouts.
        #
        # Returns a 2-tuple:
        #   (bool: whether successfully get data, any: data if successful else None)
        try:
            data = self._data_queue.get(timeout=timeout)
            return (True, data)
        except Exception as e:
            # At timeout and error, we manually check whether any worker has
            # failed. Note that this is the only mechanism for Windows to detect
            # worker failures.
            failed_workers = []
            for worker_id, w in enumerate(self._workers):
                if self._workers_status[worker_id] and not w.is_alive():
                    failed_workers.append(w)
                    self._mark_worker_as_unavailable(worker_id)
            if len(failed_workers) > 0:
                pids_str = ', '.join(str(w.pid) for w in failed_workers)
>               raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
E               RuntimeError: DataLoader worker (pid(s) 17123, 17124) exited unexpectedly

/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/torch/utils/data/dataloader.py:969: RuntimeError
------------------------------------------------------------------------------------------------------------ Captured stderr call ------------------------------------------------------------------------------------------------------------
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.

______________________________________________________________________________________________ TestSoxEffectsDataset.test_apply_effects_tensor _______________________________________________________________________________________________

self = <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0x7f559c8abb80>, timeout = 5.0

    def _try_get_data(self, timeout=_utils.MP_STATUS_CHECK_INTERVAL):
        # Tries to fetch data from `self._data_queue` once for a given timeout.
        # This can also be used as inner loop of fetching without timeout, with
        # the sender status as the loop condition.
        #
        # This raises a `RuntimeError` if any worker died expectedly. This error
        # can come from either the SIGCHLD handler in `_utils/signal_handling.py`
        # (only for non-Windows platforms), or the manual check below on errors
        # and timeouts.
        #
        # Returns a 2-tuple:
        #   (bool: whether successfully get data, any: data if successful else None)
        try:
>           data = self._data_queue.get(timeout=timeout)

/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/torch/utils/data/dataloader.py:956:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <multiprocessing.queues.Queue object at 0x7f559c8abd00>, block = True, timeout = 4.999982295557857

    def get(self, block=True, timeout=None):
        if self._closed:
            raise ValueError(f"Queue {self!r} is closed")
        if block and timeout is None:
            with self._rlock:
                res = self._recv_bytes()
            self._sem.release()
        else:
            if block:
                deadline = time.monotonic() + timeout
            if not self._rlock.acquire(block, timeout):
                raise Empty
            try:
                if block:
                    timeout = deadline - time.monotonic()
>                   if not self._poll(timeout):

/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/multiprocessing/queues.py:107:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <multiprocessing.connection.Connection object at 0x7f559c8abd60>, timeout = 4.999982295557857

    def poll(self, timeout=0.0):
        """Whether there is any input available to be read"""
        self._check_closed()
        self._check_readable()
>       return self._poll(timeout)

/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/multiprocessing/connection.py:257:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <multiprocessing.connection.Connection object at 0x7f559c8abd60>, timeout = 4.999982295557857

    def _poll(self, timeout):
>       r = wait([self], timeout)

/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/multiprocessing/connection.py:424:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

object_list = [<multiprocessing.connection.Connection object at 0x7f559c8abd60>], timeout = 4.999982295557857

    def wait(object_list, timeout=None):
        '''
        Wait till an object in object_list is ready/readable.

        Returns list of those objects in object_list which are ready/readable.
        '''
        with _WaitSelector() as selector:
            for obj in object_list:
                selector.register(obj, selectors.EVENT_READ)

            if timeout is not None:
                deadline = time.monotonic() + timeout

            while True:
>               ready = selector.select(timeout)

/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/multiprocessing/connection.py:931:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <selectors.PollSelector object at 0x7f559c624ee0>, timeout = 5000

    def select(self, timeout=None):
        # This is shared between poll() and epoll().
        # epoll() has a different signature and handling of timeout parameter.
        if timeout is None:
            timeout = None
        elif timeout <= 0:
            timeout = 0
        else:
            # poll() has a resolution of 1 millisecond, round away from
            # zero to wait *at least* timeout seconds.
            timeout = math.ceil(timeout * 1e3)
        ready = []
        try:
>           fd_event_list = self._selector.poll(timeout)

/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/selectors.py:415:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

signum = 17, frame = <frame at 0x7f559c82e440, file '/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/selectors.py', line 417, code select>

    def handler(signum, frame):
        # This following call uses `waitid` with WNOHANG from C side. Therefore,
        # Python can still get and update the process status successfully.
>       _error_if_any_worker_fails()
E       RuntimeError: DataLoader worker (pid 17143) is killed by signal: Segmentation fault.

/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py:66: RuntimeError

The above exception was the direct cause of the following exception:

self = <torchaudio_unittest.sox_effect.dataset_test.TestSoxEffectsDataset testMethod=test_apply_effects_tensor>

    def test_apply_effects_tensor(self):
        sample_rate = 12000
        signals = self._generate_signals()
        dataset = RandomPerturbationTensor(signals, sample_rate)
        loader = torch.utils.data.DataLoader(
            dataset, batch_size=32, num_workers=16,
            worker_init_fn=init_random_seed,
        )
>       for batch in loader:

test/torchaudio_unittest/sox_effect/dataset_test.py:124:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/torch/utils/data/dataloader.py:519: in __next__
    data = self._next_data()
/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/torch/utils/data/dataloader.py:1152: in _next_data
    idx, data = self._get_data()
/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/torch/utils/data/dataloader.py:1118: in _get_data
    success, data = self._try_get_data()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0x7f559c8abb80>, timeout = 5.0

    def _try_get_data(self, timeout=_utils.MP_STATUS_CHECK_INTERVAL):
        # Tries to fetch data from `self._data_queue` once for a given timeout.
        # This can also be used as inner loop of fetching without timeout, with
        # the sender status as the loop condition.
        #
        # This raises a `RuntimeError` if any worker died expectedly. This error
        # can come from either the SIGCHLD handler in `_utils/signal_handling.py`
        # (only for non-Windows platforms), or the manual check below on errors
        # and timeouts.
        #
        # Returns a 2-tuple:
        #   (bool: whether successfully get data, any: data if successful else None)
        try:
            data = self._data_queue.get(timeout=timeout)
            return (True, data)
        except Exception as e:
            # At timeout and error, we manually check whether any worker has
            # failed. Note that this is the only mechanism for Windows to detect
            # worker failures.
            failed_workers = []
            for worker_id, w in enumerate(self._workers):
                if self._workers_status[worker_id] and not w.is_alive():
                    failed_workers.append(w)
                    self._mark_worker_as_unavailable(worker_id)
            if len(failed_workers) > 0:
                pids_str = ', '.join(str(w.pid) for w in failed_workers)
>               raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
E               RuntimeError: DataLoader worker (pid(s) 17143, 17144, 17145, 17146) exited unexpectedly

/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/site-packages/torch/utils/data/dataloader.py:969: RuntimeError
------------------------------------------------------------------------------------------------------------ Captured stderr call ------------------------------------------------------------------------------------------------------------
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.

___________________________________________________________________________________________________ TestProcessPoolExecutor.test_executor ____________________________________________________________________________________________________

self = <torchaudio_unittest.sox_effect.dataset_test.TestProcessPoolExecutor testMethod=test_executor>

    def test_executor(self):
        """Test that apply_effects_tensor with speed + rate does not crush

        https://github.com/pytorch/audio/issues/1021
        """
        executor = ProcessPoolExecutor(1)
        futures = [executor.submit(speed, path) for path in self.flist]
        for future in futures:
>           future.result()

test/torchaudio_unittest/sox_effect/dataset_test.py:158:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/concurrent/futures/_base.py:439: in result
    return self.__get_result()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <Future at 0x7f559c6ef6d0 state=finished raised BrokenProcessPool>

    def __get_result(self):
        if self._exception:
>           raise self._exception
E           concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

/home/moto/conda/envs/PY3.8-cuda101/lib/python3.8/concurrent/futures/_base.py:388: BrokenProcessPool
========================================================================================================== short test summary info ===========================================================================================================
FAILED test/torchaudio_unittest/sox_effect/dataset_test.py::TestSoxEffectsDataset::test_apply_effects_file - RuntimeError: DataLoader worker (pid(s) 17123, 17124) exited unexpectedly
FAILED test/torchaudio_unittest/sox_effect/dataset_test.py::TestSoxEffectsDataset::test_apply_effects_tensor - RuntimeError: DataLoader worker (pid(s) 17143, 17144, 17145, 17146) exited unexpectedly
FAILED test/torchaudio_unittest/sox_effect/dataset_test.py::TestProcessPoolExecutor::test_executor - concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pe...
============================================================================================================= 3 failed in 1.42s =============================================================================================================

This one should go to 0.7.1 release.

vincentqb

Thanks for adding a test to catch this locally. As we discussed, disabling OpenMP temporarily in sox seems like the right temporary workaround.

vincentqb · 2020-11-18T19:53:12Z

third_party/CMakeLists.txt

@@ -76,5 +76,5 @@ ExternalProject_Add(libsox
  DOWNLOAD_DIR ${ARCHIVE_DIR}
  URL https://downloads.sourceforge.net/project/sox/sox/14.4.2/sox-14.4.2.tar.bz2
  URL_HASH SHA256=81a6956d4330e75b5827316e44ae381e6f1e8928003c6aa45896da9041ea149c
-  CONFIGURE_COMMAND ${CMAKE_CURRENT_SOURCE_DIR}/build_codec_helper.sh ${CMAKE_CURRENT_SOURCE_DIR}/src/libsox/configure ${COMMON_ARGS} --with-lame --with-flac --with-mad --with-oggvorbis --without-alsa --without-coreaudio --without-png --without-oss --without-sndfile --with-opus
+  CONFIGURE_COMMAND ${CMAKE_CURRENT_SOURCE_DIR}/build_codec_helper.sh ${CMAKE_CURRENT_SOURCE_DIR}/src/libsox/configure ${COMMON_ARGS} --with-lame --with-flac --with-mad --with-oggvorbis --without-alsa --without-coreaudio --without-png --without-oss --without-sndfile --with-opus --disable-openmp


Can you add a comment linking to this issue here?

Good point. added the comment to this issue.

cpuhrsch · 2020-11-19T17:47:15Z

When we're bringing this back we need to figure out how to link against the version of openmp that pytorch is linking against.

mthrok · 2020-11-20T15:24:19Z

When we're bringing this back we need to figure out how to link against the version of openmp that pytorch is linking against.

I am not fully aware of how to do this, but there is a similar conversation going in vision, this might be the way.
pytorch/vision#2783 (comment)

Currently `libsox` on Linux is compiled with GPU OpenMP and it interferes with the version PyTorch uses (Intel in case of binary distribution). This PR disables OpenMP support for `libsox`, while we investigate the way to use the same OpenMP as PyTorch's version.

facebook-github-bot added the CLA Signed label Nov 13, 2020

This was referenced Nov 13, 2020

use mp_context=multiprocessing.get_context("spawn") in ProcessPoolExecutor will crash lhotse-speech/lhotse#126

Closed

SoX effect "rate" crashing or hanging in multiprocessing #1021

Closed

mthrok force-pushed the disable-openmp-sox branch 7 times, most recently from b728dd5 to 44420fd Compare November 13, 2020 19:01

Add test for pytorch#1021

6808cfa

mthrok force-pushed the disable-openmp-sox branch from 44420fd to 6808cfa Compare November 13, 2020 19:47

mthrok added 2 commits November 13, 2020 20:09

fixup! Add test for pytorch#1021

840f0ce

fixup! fixup! Add test for pytorch#1021

a23cb49

mthrok marked this pull request as ready for review November 16, 2020 14:28

mthrok marked this pull request as draft November 16, 2020 14:28

mthrok added 3 commits November 16, 2020 15:50

Set CMAKE_PREFIX_PATH

0658e78

fixup! Set CMAKE_PREFIX_PATH

a999093

Disable OpenMP

e5ece8f

mthrok changed the title ~~Disable OpenMP support for libsox~~ Temporarily Disable OpenMP support for libsox Nov 17, 2020

mthrok marked this pull request as ready for review November 17, 2020 00:22

vincentqb approved these changes Nov 18, 2020

View reviewed changes

vincentqb reviewed Nov 18, 2020

View reviewed changes

cpuhrsch approved these changes Nov 19, 2020

View reviewed changes

Add comment to issue

5c5547d

mthrok merged commit 7580485 into pytorch:master Nov 20, 2020

mthrok deleted the disable-openmp-sox branch November 20, 2020 16:52

mthrok mentioned this pull request Nov 20, 2020

prepare.py slow speed k2-fsa/snowfall#18

Open

faroit mentioned this pull request Feb 25, 2021

soxbindings fails when multithreading pseeth/soxbindings#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Temporarily Disable OpenMP support for libsox #1026

Temporarily Disable OpenMP support for libsox #1026

mthrok commented Nov 13, 2020 •

edited

Loading

vincentqb left a comment

vincentqb Nov 18, 2020

mthrok Nov 20, 2020

cpuhrsch commented Nov 19, 2020

mthrok commented Nov 20, 2020

Temporarily Disable OpenMP support for libsox #1026

Temporarily Disable OpenMP support for libsox #1026

Conversation

mthrok commented Nov 13, 2020 • edited Loading

vincentqb left a comment

Choose a reason for hiding this comment

vincentqb Nov 18, 2020

Choose a reason for hiding this comment

mthrok Nov 20, 2020

Choose a reason for hiding this comment

cpuhrsch commented Nov 19, 2020

mthrok commented Nov 20, 2020

mthrok commented Nov 13, 2020 •

edited

Loading