-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Description
🐛 Bug
I am not sure if this is reproducible for every environment, but I hit the following error when trying to set cuda device in processes. What is weird is that the error disappears if I remove the line x = torch.rand(20, 2).cuda()
right after the for loop.
Traceback (most recent call last):
File "/home/shenli/local/miniconda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/shenli/local/miniconda/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "dist_bug.py", line 5, in run
torch.cuda.set_device(rank)
File "/home/shenli/project/pytorch/torch/cuda/__init__.py", line 265, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (3) : initialization error at ../torch/csrc/cuda/Module.cpp:33
Edit: based on the discussion below, the solution should be fixing the "bad fork" error detection, which is duplicated with #17359.
To Reproduce
import torch
from torch.multiprocessing import Process
def run(rank):
torch.cuda.set_device(rank)
if __name__ == "__main__":
size = 2
processes = []
for rank in range(size):
# it would work fine without the line below
x = torch.rand(20, 2).cuda()
p = Process(target=run, args=(rank,))
p.start()
processes.append(p)
for p in processes:
p.join()
Environment
PyTorch version: 1.1.0a0+63214b5
Is debug build: No
CUDA used to build PyTorch: 9.2.88
OS: CentOS Linux 7 (Core)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28)
CMake version: version 3.12.2
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 9.2.88
GPU models and configuration:
GPU 0: Tesla M40
GPU 1: Tesla M40
Nvidia driver version: 396.26
cuDNN version: Could not collect
Versions of relevant libraries:
[pip] numpy==1.15.4
[pip] torch==1.1.0a0+63214b5
[conda] blas 1.0 mkl
[conda] mkl 2019.1 144
[conda] mkl-include 2019.1 144
[conda] mkl_fft 1.0.6 py37hd81dba3_0
[conda] mkl_random 1.0.2 py37hd81dba3_0
[conda] torch 1.1.0a0+63214b5 dev_0