-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Description
🐛 Bug
Running pytorch with multiple P40 gpus freeze and is not killable (even kill -9 by root). Only a reboot removes this process.
Inside docker container (with nvidia-docker2) it freezes docker. NVIDIA/nvidia-docker#1010
To Reproduce
Steps to reproduce the behavior:
- Install pytorch 1.0.2
- Run the following code on multiple P40 Gpus
import os
###tutorial from https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html
###no error with only 1 gpu
# os.environ['CUDA_VISIBLE_DEVICES'] = '0'
#### to reproduce error allow multi gpu
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3'
import torch
torch.cuda.device_count()
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
# Parameters and DataLoaders
input_size = 5000 #increased input size (works with 500 on multi gpu)
output_size = 2000 #increased output size (works with 200 on multi gpu)
batch_size = 300
data_size = 100
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)
class RandomDataset(Dataset):
def __init__(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
batch_size=batch_size, shuffle=True)
class Model(nn.Module):
# Our model
def __init__(self, input_size, output_size):
super(Model, self).__init__()
self.fc = nn.Linear(input_size, output_size)
def forward(self, input):
output = self.fc(input)
return output
model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
# dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
model = nn.DataParallel(model)
model.to(device)
for i in range(10000):
for data in rand_loader:
input = data.to(device)
output = model(input)
Expected behavior
The training
Environment
Collecting environment information...
PyTorch version: 1.0.1.post2
Is debug build: No
CUDA used to build PyTorch: 10.0.130
OS: Ubuntu 18.04.2 LTS
GCC version: (crosstool-NG fa8859cb) 7.2.0
CMake version: Could not collect
Python version: 3.5
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration:
GPU 0: Tesla P40
GPU 1: Tesla P40
GPU 2: Tesla P40
GPU 3: Tesla P40
GPU 4: Tesla P40
GPU 5: Tesla P40
GPU 6: Tesla P40
GPU 7: Tesla P40
Nvidia driver version: 410.79
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.4.2
Versions of relevant libraries:
[pip3] numpy==1.15.2
[conda] mkl 2018.0.3 1 defaults
[conda] mkl_fft 1.0.6 py35_0 conda-forge
[conda] mkl_random 1.0.1 py35_0 conda-forge
[conda] nomkl 2.0 0 defaults
[conda] numexpr 2.6.5 py35_nomklhaa809a4_0 [nomkl] defaults
[conda] pytorch 1.0.1 py3.5_cuda10.0.130_cudnn7.4.2_2 pytorch
[conda] torch 0.4.1
[conda] torchvision 0.2.2 py_3 pytorch