New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-gpu example freeze and is not killable #24081
Comments
I've just tested with pytorch 1.2.0 it freezes in the same way. |
If someone wants to help us debug this issue, if you can try reproducing this issue that would be helpful; it would also be helpful to know if the problem goes away if you (1) upgrade your CUDA version and (2) run on different GPUs. BTW, I notice in your environment, you have
You should delete the old torch 0.4.1 at some point. |
Same on |
Upgrading priority |
We face a similar issue. The environment below is with a single GPU mounted into the container, which is currently tested. Before, 2 GPUs were mounted in the container. When some GPU-specific code was run, such as Though, I am not completely sure whether this issue is related or whether we have a hardware failure. Just thought it sounds so similar that I provide some information. I hope it helps. Environment OS: Ubuntu 16.04.6 LTS Python version: 3.6 Versions of relevant libraries: |
I've realized a strace on this script : https://gist.github.com/Dubrzr/b058b54947b2688b7e02e64f6bdf78b8 It froze after those lines:
Hope this will help you :) |
Is this one relatable: #1637? The solution: #1637 (comment) |
Keeping high pri as investigation is required. |
I'm encountering the same issue as well- not even SIGKILL will work to stop a script using PyTorch. Honestly, if SIGKILL isn't working, I doubt that this issue is specific to PyTorch. I'd be more inclined to believe that the NVIDIA driver / tainted Linux kernel is at fault- anything outside the kernel space really shouldn't be able to mess up SIGKILL. Similar issues have occurred in the past with nvidia-smi (see here, here, here, and here). [edit]: fastai also seemed to exhibit a similar problem around October, 2018 (see here) For added context, my strace output is similar to @Dubrzr 's:
The NVIDIA driver version is 418.87.00, cuda is 10.0.130, cudnn is 7.6.1.34-10.0, and PyTorch is 1.1.0. I'm not the chief developer of the script producing the behavior so I don't know many details about how PyTorch is being used. |
For quick reference, here's a set of tallies for some of the relevant software/version number pairings: NVIDIA Driver Summary Count:
CUDA Summary Count:
cudNN Summary Count:
PyTorch Summary Count:
|
I was able to replicate this issue using an environment identical to my earlier one except for the CUDA version, which was 9.2.88. |
As another data point, the version of fastai exhibiting a similar issue that I linked to in my earlier comment would have also used CUDA 9.2 (based off a contemporary copy of the README here). |
Well, the syscall where PyTorch is getting messed up is in our strace output already:
ioctl is used for performing I/O calls outside the universal file I/O calls and is documented here. File descriptor 5 was pointing at /dev/nvidia-uvm on the host I'm working with. This is the unified virtual memory module, which seems to be described in more detail here and here. Based off section 6.1.1 here: 0 is the direction, defined as _IOC_NONE, which means no data transfer is occurring. The number (25) seems to correspond to the following operation (from the MIT-licensed UVM source code- located at /usr/src/nvidia-*/nvidia-uvm/uvm_ioctl.h on a Linux install):
As near as I can tell, this means that there's some issue with allocating blocks of RAM on the GPU. I'm not a kernel programmer (and I'm definitely not intimately familiar with GPU programming beyond this), so take all of what I just said with a grain of salt. |
As yet another bit of info, I ran memtestG80 on each of the GPUs on my system. Pretty much all GPUs were fine except for the first one (index 0). On this one, memtestG80 hangs. When I run strace on it, I get some familiar output:
File descriptor 4, in this case, is also /dev/nvidia-uvm. After a reboot, memtestG80 runs fine on this GPU. Then I run the PyTorch program and it gets stuck again at the same syscall. Consequently, if I run memtestG80 again on GPU 0 after PyTorch gets stuck, it also gets stuck at ioctl. |
I ran pdb on one of the consistently unkillable PyTorch programs. The point at which the script became unkillable was when it ran something similar to
Here's the exact output of pdb before it became unkillable:
|
From the SuperMicro X11DPG-OT-CPU. "Select Enable to program Access Control Services (ACS) to Chipset PCI-E Root Port Bridges. Select Disable to program Access Control Services to all PCI-E Root Port Bridges.The options are Enable and Disable." Are you saying that for this board it doesn't work? Just curious, I have several X10DRG-OT+-CPU and it work fine there. BIOS Version: 3.1 Release Date: 07/13/2018 |
If this can provide more info, it also freezes with this motherboard : ProLiant XL270d Gen10. |
Is your ProLiant XL270d Gen10 with the NVlink tray or the PCIe tray? Is it p2pBandwidthLatencyTest that is hanging.... I think this is no longer really a pytorch issues at this point but a system setup for p2p isuue... |
The "Select Enable to program Access Control Services (ACS)" option only shows up if VT-d is enabled, which it isn't on my system. It would seem logical that since the ACS toggle is nested under the VT-d option (and disappears when VT-d is disabled), the ACS toggle would inherit the VT-d toggle's disabled state. This doesn't seem to be the case, however. This is confirmed by a fresh boot with my configuration (VT-d disabled, ACS shows as enabled when VT-d enabled) indicating that the ACS register values are populated:
If I go into the UEFI/BIOS, temporarily enable VT-d to make the ACS toggle visible, disable ACS, and then toggle VT-d off again, I get the expected results after reboot:
Overall, seems to be a mixture of user error and interface confusion :) |
Same problem. I've tested on Ryzen 3700 with two gtx 1060 6gb.
I'm using |
I managed to solve this issue in my case by disabling IOMMU in bios. |
I have the same issues with (2x1080ti + AMD CPU).
|
Works for me! |
I'm downgrading the priority of this issue as it looks like people have found a workaround, and it seems there is not much we can do about it in PyTorch side. I'll keep it open for visibility, though. |
Can confirm the workaround of disabling IOMMU also works for me. |
Disabling IOMMU can be a security issue, by disabling it you authorize GPUs to access memory addresses of all other GPUs. This bug is probably in Nvidia drivers. Thanks. |
DMA attacks aren't (at present) as much of a concern in our environment (the machines have very limited access and aren't publicly accessible), although it's certainly a far from an ideal scenario. I closed out the NVBug post a while ago. I'd suggest that you re-open an issue with them and frame this as a security issue. |
I've posted an issue, here is the answer from Nvidia :
|
¯\(ツ)/¯ Lock up your Linux gamer workstations in a closet I guess. If you're concerned about a DMA attack from some sort of rogue peripheral, it's probably easier to block peripheral ports in some way- either via epoxy resin on the low-tech end of things or at the firmware/driver level (I believe this is fairly common practice in fintech and some healthcare institutions). It seems that there are a couple other steps you could take, even you can't do anything with IOMMU. |
Works for me. Many thanks! |
The concern is not only about plugging a rogue device. It's also about the device itself failing to respect the host system security policy. We have no information on the NVIDIA GPU internals and how it enforces security constraints to user-submitted GPU workloads. If not enforced correctly, a user submitted GPU task could write to kernel memory at arbitrary addresses and escalate privileges. I guess the security minded solution here is to use VMs as the isolation mechanism. A VM for each user, or a VM for several users, depending on the information that can or can't be shared between them per their access rights. |
Not all other GPUs, but rather, all available main system memory on the system. GPUs have their own separate memory subsystem. |
on my HP Envy dual boot laptop, I had to disable virtulization in the BIOS settings. Seems to work now. |
…lti-gpu training pytorch/pytorch#24081" This reverts commit eca86a0.
Is there a better solution here than disabling IOMMU? Does upgrading CUDA/NCCL help? |
Try Example: |
Thanks, setting NCCL_P2P_DISABLE=1 solved a similar issue I was having with RTX A5000s. |
I faced the same problem, all the solutions described here are not suitable for me. I have AMD FX(tm)-4100 Quad-Core Processor Any help is much appreciated! |
🐛 Bug
Running pytorch with multiple P40 gpus freeze and is not killable (even kill -9 by root). Only a reboot removes this process.
Inside docker container (with nvidia-docker2) it freezes docker. NVIDIA/nvidia-docker#1010
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The training
Environment
Collecting environment information...
PyTorch version: 1.0.1.post2
Is debug build: No
CUDA used to build PyTorch: 10.0.130
OS: Ubuntu 18.04.2 LTS
GCC version: (crosstool-NG fa8859cb) 7.2.0
CMake version: Could not collect
Python version: 3.5
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration:
GPU 0: Tesla P40
GPU 1: Tesla P40
GPU 2: Tesla P40
GPU 3: Tesla P40
GPU 4: Tesla P40
GPU 5: Tesla P40
GPU 6: Tesla P40
GPU 7: Tesla P40
Nvidia driver version: 410.79
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.4.2
Versions of relevant libraries:
[pip3] numpy==1.15.2
[conda] mkl 2018.0.3 1 defaults
[conda] mkl_fft 1.0.6 py35_0 conda-forge
[conda] mkl_random 1.0.1 py35_0 conda-forge
[conda] nomkl 2.0 0 defaults
[conda] numexpr 2.6.5 py35_nomklhaa809a4_0 [nomkl] defaults
[conda] pytorch 1.0.1 py3.5_cuda10.0.130_cudnn7.4.2_2 pytorch
[conda] torch 0.4.1
[conda] torchvision 0.2.2 py_3 pytorch
cc @ezyang @gchanan @zou3519 @ngimel
The text was updated successfully, but these errors were encountered: