Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault in DataLoader worker in PyTorch 1.8.0 if set_num_threads is called beforehand #54752

Closed
zhenlin opened this issue Mar 26, 2021 · 15 comments
Assignees
Labels
high priority module: dataloader Related to torch.utils.data.DataLoader and Sampler module: multiprocessing Related to torch.multiprocessing module: multithreading Related to issues that occur when running on multiple CPU threads triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@zhenlin
Copy link

zhenlin commented Mar 26, 2021

🐛 Bug

A segmentation fault occurs if one uses DataLoader with num_workers > 0 after calling set_num_threads with a sufficiently high value.
I observed this behaviour in PyTorch 1.8.0 and 1.8.1, but I am unable to reproduce it with PyTorch 1.7.1.

To Reproduce

import torch

def main():
    torch.set_num_threads(4)

    dataloader = torch.utils.data.DataLoader([1, 2, 3], num_workers=1)
    iter(dataloader).next()

    return

if __name__ == '__main__':
    main()

The above code crashes when set_num_threads is called with 4 or more as its argument.
Incidentally (or maybe not) 4 is the number of vCPUs in the AWS EC2 instance I am using.

Expected behavior

No crash.

Environment

PyTorch version: 1.8.1+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: Tesla K80
Nvidia driver version: 460.32.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.1
[pip3] torch==1.8.1
[conda] Could not collect

Additional context

Perhaps this issue is related to #53894 and #54716.

cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @anjali411 @ssnl @VitalyFedyunin @ejguan

@VitalyFedyunin VitalyFedyunin added high priority module: dataloader Related to torch.utils.data.DataLoader and Sampler labels Mar 26, 2021
@VitalyFedyunin VitalyFedyunin self-assigned this Mar 26, 2021
@ezyang ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 26, 2021
@ezyang
Copy link
Contributor

ezyang commented Mar 26, 2021

Isn't this just the thing where you can't fork after spawning threads? cc @malfet

@malfet
Copy link
Contributor

malfet commented Mar 26, 2021

DataLoader calls torch.set_num_threads(1) right after the fork, so it should not be the issue.
@zhenlin , can you please share a backtrace from 1.8.1?
Update: Above example does not crash using latest nightly on OSX

@maxidl
Copy link

maxidl commented Mar 26, 2021

Thanks for posting this! I was stuck the whole day because of this. After removing torch.set_num_threads(32) the segfaults are gone.
My env:

conda create -n wav2vec python=3.8
conda install -c pytorch -c nvidia pytorch torchaudio cudatoolkit=11.1

The above example creates a crash for me: ERROR: Unexpected segmentation fault encountered in worker.

@zhenlin
Copy link
Author

zhenlin commented Mar 26, 2021

@malfet Sorry, I don't have a (useful) backtrace. The Python backtrace starts in _try_get_data on the receiving side. I tried to get a gdb backtrace from the worker process core dump but it seems the stack is corrupted – the displayed caller is 0x00...00.

@ezyang I thought so too but torch.set_num_threads(3) doesn't segfault for me.

@zhenlin
Copy link
Author

zhenlin commented Mar 26, 2021

I tried to reproduce the issue on macOS. No segfault, but instead this warning appears:

[W ParallelNative.cpp:206] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)

The message is repeated n_workers times. I guess this must be related to what @malfet said about the worker process calling set_num_threads after forking. This message does not appear on the Linux machine I was using.

@imaginary-person

This comment has been minimized.

@zhenlin
Copy link
Author

zhenlin commented Mar 29, 2021

I managed to obtain a backtrace of the segfault following the instructions in #53894. It seems to happen in set_num_threads. I'm guessing that pthreadpool_destroy is being called on a thread pool that no longer exists after forking... but if that's the case why does it not crash when the initial set_num_threads is called with a low number? Anyway, I hope this helps narrow down the problem.

#0  0x00007ffff7fa1aab in __pthread_clockjoin_ex (threadid=140735666181888, thread_return=0x0, clockid=0, abstime=0x0, block=true) at pthread_join_common.c:89
        pd = 0x7fff9363d700
        self = <optimized out>
        result = <optimized out>
        pd_result = <optimized out>
#1  0x00007fffe70b88db in pthreadpool_destroy () from /home/ubuntu/.local/share/virtualenvs/test-2nMYMFG7/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
No symbol table info available.
#2  0x00007fffe4cc6a07 in caffe2::PThreadPool::set_thread_count(unsigned long) () from /home/ubuntu/.local/share/virtualenvs/test-2nMYMFG7/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
No symbol table info available.
#3  0x00007fffe38dacbf in at::set_num_threads(int) () from /home/ubuntu/.local/share/virtualenvs/test-2nMYMFG7/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
No symbol table info available.
#4  0x00007ffff571a9d6 in THPModule_setNumThreads(_object*, _object*) () from /home/ubuntu/.local/share/virtualenvs/test-2nMYMFG7/lib/python3.7/site-packages/torch/lib/libtorch_python.so
No symbol table info available.
#5  0x000055555566e803 in _PyCFunction_FastCallKeywords ()
No symbol table info available.
#6  0x0000555555703ed4 in ?? ()
No symbol table info available.
#7  0x0000555555700fe2 in _PyEval_EvalFrameDefault ()
No symbol table info available.
#8  0x00005555556716ba in _PyFunction_FastCallDict ()
No symbol table info available.
#9  0x00005555556fe0fd in _PyEval_EvalFrameDefault ()
No symbol table info available.
#10 0x0000555555670a0a in _PyFunction_FastCallKeywords ()
No symbol table info available.
#11 0x0000555555703dcb in ?? ()
No symbol table info available.
#12 0x00005555556fcc78 in _PyEval_EvalFrameDefault ()
No symbol table info available.
#13 0x0000555555670a0a in _PyFunction_FastCallKeywords ()
No symbol table info available.
#14 0x0000555555703dcb in ?? ()
No symbol table info available.
#15 0x00005555556fcc78 in _PyEval_EvalFrameDefault ()
No symbol table info available.
#16 0x0000555555670a0a in _PyFunction_FastCallKeywords ()
No symbol table info available.
#17 0x0000555555703dcb in ?? ()
No symbol table info available.
#18 0x00005555556fcc78 in _PyEval_EvalFrameDefault ()
No symbol table info available.
#19 0x0000555555671cfa in _PyObject_Call_Prepend ()
No symbol table info available.
#20 0x00005555556c6d1c in ?? ()
No symbol table info available.
#21 0x00005555556c2f19 in ?? ()
No symbol table info available.
#22 0x000055555566fd65 in _PyObject_FastCallKeywords ()
No symbol table info available.
#23 0x0000555555703f51 in ?? ()
No symbol table info available.
#24 0x00005555556fcbe8 in _PyEval_EvalFrameDefault ()
No symbol table info available.
#25 0x0000555555670a0a in _PyFunction_FastCallKeywords ()
No symbol table info available.
#26 0x0000555555703dcb in ?? ()
No symbol table info available.
#27 0x0000555555700fe2 in _PyEval_EvalFrameDefault ()
No symbol table info available.
#28 0x0000555555670a0a in _PyFunction_FastCallKeywords ()
No symbol table info available.
#29 0x0000555555703dcb in ?? ()
No symbol table info available.
#30 0x0000555555700fe2 in _PyEval_EvalFrameDefault ()
No symbol table info available.

@imaginary-person

This comment has been minimized.

@imaginary-person
Copy link
Contributor

imaginary-person commented Mar 29, 2021

why does it not crash when the initial set_num_threads is called with a low number?

I'm pretty sure that whatever thread data structures responsible for this segfault always have their first three entries in a virtual memory page that's shared (mapped to the same physical memory page) between the parent process & child process due to the Copy-on-write technique used by fork. However, the rest of the entries pertaining to the rest of the parent process' threads (fourth & beyond) in the parent process seem to be in another virtual memory page that isn't shared between the parent process & the child process, as not all pages are shared in the current fork implementation. So, the child process seems to read some gibberish instead of data structures for the parent process' 4th thread , as it seems to have a different physical memory page mapped to the corresponding virtual memory page, than does the parent process.

While I haven't verified my hypothesis, since the segfault occurs in nptl, a workaround would have to be used to resolve this issue anyway. Perhaps a new variant of set_num_threads for a newly forked process in ParallelOpenMP.cpp, that only handles its initial set_num_threads(1)?

@ezyang
Copy link
Contributor

ezyang commented Mar 29, 2021

@malfet says when we do the fork, we should destroy all thread pools. (In this case, the caffe2 threadpool is the thing that is causing the problem)

@ezyang ezyang assigned malfet and unassigned VitalyFedyunin Mar 29, 2021
@ezyang ezyang added module: multiprocessing Related to torch.multiprocessing module: multithreading Related to issues that occur when running on multiple CPU threads and removed triage review labels Mar 29, 2021
@imaginary-person
Copy link
Contributor

imaginary-person commented Mar 29, 2021

when we do the fork, we should destroy all thread pools. In this case, the caffe2 threadpool is the thing that is causing the problem)

@ezyang, please clarify if you mean that a fix would entail somehow resetting Caffe2's thread-pool in the child process before calling set_num_threads(1) for it. Or do you mean that before forking, the parent process would initially do set_num_threads(1), then fork, and after fork, do set_num_threads(old_value) (in the parent process)? Thank you!

EDIT: Removed info duplicated from my previous comment.

@hiyyg
Copy link

hiyyg commented Mar 29, 2021

I met the same error using 1.8.1/1.8.0, for now I have to stick with 1.7.1.

@imaginary-person

This comment has been minimized.

@ezyang
Copy link
Contributor

ezyang commented Mar 30, 2021

@malfet should clarify but I think it's get rid of the threads before forking.

@malfet
Copy link
Contributor

malfet commented Mar 30, 2021

Yes, from the crash it looks like we are trying to call pthread_join from a child process, where threads do not exist.
Idea for the fix is to register pthread_atfork that would leak pthread pool in https://github.com/pytorch/pytorch/blob/master/caffe2/utils/threadpool/pthreadpool-cpp.cc or propose a similar fix directly against https://github.com/Maratyszcza/pthreadpool.git

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: dataloader Related to torch.utils.data.DataLoader and Sampler module: multiprocessing Related to torch.multiprocessing module: multithreading Related to issues that occur when running on multiple CPU threads triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
7 participants