New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random hangs and failures when sending tensors that are split using torch.split in a JoinableQueue #95606
Comments
Might be related: #95278: outputs from torch.split are view onto the giant storage. So worth checking if any copying of the full storage is happening... If the case, might be worth opening a feature request for something like split_copy / chunk_copy or introducing (copy = True) argument so that individual chunks can be copied without holding onto original storage. |
Maybe I don't fully understand but why would it hang even if it was copying the entire tensor? The tensor is tiny (64 doubles). |
Inserting
Is it possible for Pytorch to make |
This issue was caused by PR #85389. That change improved performance in some cases, but this issue shows it has also caused crashes in other code that worked in PyTorch 1.12 and earlier. (Mentioning @ezhang887 and @albanD who apparently discussed that PR.) Reverting the change (as below) makes the code in this issue work again (tested on current git HEAD).
|
Thanks for the detailed repro. I can reproduce locally but not sure what is the root cause tbh. I re-checked and storage_copy is safe to do without the GIL. |
How are you checking that storage_copy is threadsafe? (Coincidentally our team hit the same issue recently too, lol) |
It is a pure C++ implementation that doesn't rely on any python object: pytorch/aten/src/ATen/StorageUtils.cpp Lines 23 to 33 in 1e69615
|
I mean that even in pure C++, perhaps e.g. set_() is not a threadsafe operation? The GIL was implicitly locking access to this C++ code so perhaps this code was never threadsafe and we're only now seeing it because of the GIL release. It's not obvious to me why it would be unsafe though. |
The issue is that In the original code in this issue, the Another way to trigger the issue is to send the same However, considering views and the share code being called twice is probably a distraction. Any attempt to use
|
I suggest we yank the optimization PR for now. |
Note that The root problem here is that The fact that calling While it is not fixing the entire problem, I propose #96664 to keep the benefits of parallel per-storage reduction without the issue raised at the top of this issue. |
To achieve this, I have a per-StorageImpl (was data_ptr in the previous version of this PR, but moved to StorageImpl to ensure stability of the key before/after sharing) lock created when we are about to share a storage and make sure that all other calls to share memory wait on this lock before moving forward. This does NOT make this call generally thread safe as any call that is not sharing memory will race and lead to UB. This makes ensures that the sample from @RobertoLat in #95606 works fine. This does NOT fix the example from @imurray in that same issue as the call still race with the `.sum()` call. This race is expected and there is no easy way for us to make it work I'm afraid (see issue for more details). Pull Request resolved: #96664 Approved by: https://github.com/colesbury
To achieve this, I have a per-StorageImpl (was data_ptr in the previous version of this PR, but moved to StorageImpl to ensure stability of the key before/after sharing) lock created when we are about to share a storage and make sure that all other calls to share memory wait on this lock before moving forward. This does NOT make this call generally thread safe as any call that is not sharing memory will race and lead to UB. This makes ensures that the sample from @RobertoLat in pytorch/pytorch#95606 works fine. This does NOT fix the example from @imurray in that same issue as the call still race with the `.sum()` call. This race is expected and there is no easy way for us to make it work I'm afraid (see issue for more details). Pull Request resolved: pytorch/pytorch#96664 Approved by: https://github.com/colesbury
Would #83623 be relevant, too? |
To achieve this, I have a per-StorageImpl (was data_ptr in the previous version of this PR, but moved to StorageImpl to ensure stability of the key before/after sharing) lock created when we are about to share a storage and make sure that all other calls to share memory wait on this lock before moving forward. This does NOT make this call generally thread safe as any call that is not sharing memory will race and lead to UB. This makes ensures that the sample from @RobertoLat in pytorch/pytorch#95606 works fine. This does NOT fix the example from @imurray in that same issue as the call still race with the `.sum()` call. This race is expected and there is no easy way for us to make it work I'm afraid (see issue for more details). Pull Request resolved: pytorch/pytorch#96664 Approved by: https://github.com/colesbury
馃悰 Random hangs and failures when sending tensors that are split using torch.split in a JoinableQueue
Splitting tensors using torch.split and sending them to processes using a JoinableQueue seems to cause random errors and hangs in 2.0.0.dev20230130+cu116, while works perfectly fine on 1.9.1+cu102
I tried to make the code to reproduce as small as I could. The key ingredients are torch.split and JoinableQueue.
The following script hangs on my CUDA machine using PyTorch 2.0 while it completes successfully on PyTorch 1.9.
On CPU the behaviour is more random I sometimes observe the following error after some runtime:
while sometime the code runs successfully.
I verified that the code runs fine in 1.9.1+cu102 in both CPU and GPU but don't know about other versions.
Versions
Cuda environment:
PyTorch version: 2.0.0.dev20230130+cu116
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.25.0
Libc version: glibc-2.27
Python version: 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-4.14.301-224.520.amzn2.x86_64-x86_64-with-glibc2.17
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping: 1
CPU MHz: 2700.202
CPU max MHz: 3000.0000
CPU min MHz: 1200.0000
BogoMIPS: 4600.03
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 46080K
NUMA node0 CPU(s): 0-31
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtmrdseed adx xsaveopt
Versions of relevant libraries:
[pip3] numpy==1.24.1
[pip3] numpyro==0.6.0
[pip3] pytorch-triton==2.0.0+0d7e753227
[pip3] torch==2.0.0.dev20230130+cu116
[pip3] torchaudio==2.0.0.dev20230130+cu116
[pip3] torchvision==0.15.0.dev20230130+cu116
[conda] blas 1.0 mkl
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py38h7f8727e_0
[conda] mkl_fft 1.3.1 py38hd3c417c_0
[conda] mkl_random 1.2.2 py38h51133e4_0
[conda] numpy 1.24.1 pypi_0 pypi
[conda] numpyro 0.6.0 pypi_0 pypi
[conda] pytorch-triton 2.0.0+0d7e753227 pypi_0 pypi
[conda] torch 2.0.0.dev20230130+cu116 pypi_0 pypi
[conda] torchaudio 2.0.0.dev20230130+cu116 pypi_0 pypi
[conda] torchvision 0.15.0.dev20230130+cu116 pypi_0 pypi
CPU environment:
PyTorch version: 2.0.0.dev20230130+cu116
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.25.0
Libc version: glibc-2.27
Python version: 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.15.49-linuxkit-x86_64-with-glibc2.17
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 6
On-line CPU(s) list: 0-5
Thread(s) per core: 1
Core(s) per socket: 6
Socket(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 158
Model name: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
Stepping: 10
CPU MHz: 2591.608
BogoMIPS: 5183.21
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 12288K
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch fsgsbase bmi1 avx2 smep bmi2 erms rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 arat
Versions of relevant libraries:
[pip3] numpy==1.24.1
[pip3] numpyro==0.6.0
[pip3] pytorch-triton==2.0.0+0d7e753227
[pip3] torch==2.0.0.dev20230130+cu116
[pip3] torchaudio==2.0.0.dev20230130+cu116
[pip3] torchvision==0.15.0.dev20230130+cu116
[conda] blas 1.0 mkl
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py38h7f8727e_0
[conda] mkl_fft 1.3.1 py38hd3c417c_0
[conda] mkl_random 1.2.2 py38h51133e4_0
[conda] numpy 1.24.1 pypi_0 pypi
[conda] numpyro 0.6.0 pypi_0 pypi
[conda] pytorch-triton 2.0.0+0d7e753227 pypi_0 pypi
[conda] torch 2.0.0.dev20230130+cu116 pypi_0 pypi
[conda] torchaudio 2.0.0.dev20230130+cu116 pypi_0 pypi
[conda] torchvision 0.15.0.dev20230130+cu116 pypi_0 pypi
cc @ezyang @gchanan @zou3519 @VitalyFedyunin @ejguan
The text was updated successfully, but these errors were encountered: