-
Notifications
You must be signed in to change notification settings - Fork 652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SoX effect "rate" crashing or hanging in multiprocessing #1021
Comments
Hi @pzelasko I confirm I could reproduce the issue. I will take a look into it. error
env
|
Could you check what initialization method is used to launch subprocess? ( In my env, |
Thanks @mthrok, that is a valid work-around that un-blocks Lhotse :) FYI you mentioned in that other thread that libsox was not built with OpenMP, but the three top frames from the call stack I reported suggest otherwise. Note that it broke at line 190 in update_fft_cache, that calls a macro that only does anything when |
Glad it helped.
In pytorch/pytorch#46409, the issue was focusing on macOS environment, and the binary distributions of torchaudio for macOS does not include OpenMP. The binary distributions for Linux have them. I was not sure why there was MKL on the stack trace you shared, but that makes sense. Thanks for letting me know. |
I will add the workaround to the documentation. |
On Ubuntu, disabling OpenMP support for libsox seems to resolve the issue. |
As a first step you could try OMP_NUM_THREADS=1 and see if this still causes the segfault. This will disable any openmp parallelization, which is especially important in a multiprocessing environment. If this segfault is caused by sox itself, then this segfault should also be reproducible outside of a multiprocessing environment. |
Another thing we should make sure is that torchaudio's sox is using PyTorch's openmp (PyTorch statically links openmp and ships with it). Or maybe we decide to disable openmp for sox entirely, but let's do some perf investigation before we do that. |
@cpuhrsch I tested with
|
I talked with @malfet, and it is most like that Intel's OpenMP and GNU Open MP are conflicting. |
@pzelasko I have disabled the OpenMP support for |
@pzelasko Can we close the issue? I believe now it works fine on "folk" method too. |
🐛 Bug
This time I'm pretty sure it's a bug :P
When running torchaudio speed + rate SoX effect chain inside of a
ProcessPoolExecutor
on the CLSP grid, the subprocess experiences segmentation fault inside theapply_effects_tensor
function. I managed to make the subprocess wait and attached gdb to it, and this is the native stack trace I got:The same test seems to be hanging in Lhotse's GitHub Actions CI: https://github.com/lhotse-speech/lhotse/pull/124/checks?check_run_id=1391378614
To Reproduce
Steps to reproduce the behavior:
git clone https://github.com/lhotse-speech/lhotse && cd lhotse && git checkout feature/augmentation-refactoring && pip install -e '.[dev]'
pytest test/known_issues/test_augment_with_executor.py
Expected behavior
No crash
Environment
torchaudio.__version__
print? (If applicable)Collecting environment information...
PyTorch version: 1.7.0
Is debug build: True
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: Debian GNU/Linux 9.13 (stretch) (x86_64)
GCC version: (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
Clang version: 3.8.1-24 (tags/RELEASE_381/final)
CMake version: version 3.7.2
Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti
GPU 2: GeForce GTX 1080 Ti
GPU 3: GeForce GTX 1080 Ti
Nvidia driver version: 440.33.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.18.5
[pip3] torch==1.7.0
[pip3] torchaudio==0.7.0a0+ac17b64
[pip3] torchvision==0.8.1
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.2.89 hfd86e86_1
[conda] mkl 2020.1 217
[conda] mkl-service 2.3.0 py37he904b0f_0
[conda] mkl_fft 1.1.0 py37h23d657b_0
[conda] mkl_random 1.1.1 py37h0573a6f_0
[conda] numpy 1.18.5 py37ha1c710e_0
[conda] numpy-base 1.18.5 py37hde5b4d6_0
[conda] pytorch 1.7.0 py3.7_cuda10.2.89_cudnn7.6.5_0 pytorch
[conda] torchaudio 0.5.1 pypi_0 pypi
[conda] torchvision 0.8.1 py37_cu102 pytorch
Additional context
The text was updated successfully, but these errors were encountered: