Skip to content

Multi-gpu example freeze and is not killable #24081

@Dubrzr

Description

@Dubrzr

🐛 Bug

Running pytorch with multiple P40 gpus freeze and is not killable (even kill -9 by root). Only a reboot removes this process.

Inside docker container (with nvidia-docker2) it freezes docker. NVIDIA/nvidia-docker#1010

To Reproduce

Steps to reproduce the behavior:

  1. Install pytorch 1.0.2
  2. Run the following code on multiple P40 Gpus
import os


###tutorial from https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html
###no error with only 1 gpu
# os.environ['CUDA_VISIBLE_DEVICES'] = '0'

#### to reproduce error allow multi gpu
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3'

import torch


torch.cuda.device_count()

import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# Parameters and DataLoaders
input_size = 5000 #increased input size (works with 500 on multi gpu)
output_size = 2000 #increased output size (works with 200 on multi gpu)

batch_size = 300
data_size = 100

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
                         batch_size=batch_size, shuffle=True)

class Model(nn.Module):
    # Our model

    def __init__(self, input_size, output_size):
        super(Model, self).__init__()
        self.fc = nn.Linear(input_size, output_size)

    def forward(self, input):
        output = self.fc(input)


        return output

model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
  model = nn.DataParallel(model)

model.to(device)

for i in range(10000):
    for data in rand_loader:
        input = data.to(device)
        output = model(input)

Expected behavior

The training

Environment

Collecting environment information...
PyTorch version: 1.0.1.post2
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 18.04.2 LTS
GCC version: (crosstool-NG fa8859cb) 7.2.0
CMake version: Could not collect

Python version: 3.5
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration:
GPU 0: Tesla P40
GPU 1: Tesla P40
GPU 2: Tesla P40
GPU 3: Tesla P40
GPU 4: Tesla P40
GPU 5: Tesla P40
GPU 6: Tesla P40
GPU 7: Tesla P40

Nvidia driver version: 410.79
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.4.2

Versions of relevant libraries:
[pip3] numpy==1.15.2
[conda] mkl 2018.0.3 1 defaults
[conda] mkl_fft 1.0.6 py35_0 conda-forge
[conda] mkl_random 1.0.1 py35_0 conda-forge
[conda] nomkl 2.0 0 defaults
[conda] numexpr 2.6.5 py35_nomklhaa809a4_0 [nomkl] defaults
[conda] pytorch 1.0.1 py3.5_cuda10.0.130_cudnn7.4.2_2 pytorch
[conda] torch 0.4.1
[conda] torchvision 0.2.2 py_3 pytorch

cc @ezyang @gchanan @zou3519 @ngimel

Metadata

Metadata

Assignees

No one assigned

    Labels

    has workaroundmodule: cudaRelated to torch.cuda, and CUDA support in generalmodule: data parallelmodule: deadlockProblems related to deadlocks (hang without exiting)module: dependency bugProblem is not caused by us, but caused by an upstream library we usemodule: multi-gpuProblem is related to running on multiple GPUsmodule: multiprocessingRelated to torch.multiprocessingquansight-nackHigh-prio issues that have been reviewed by Quansight and are judged to be not actionable.triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions