CUDA error (3): initialization error (multiprocessing) #2517

xiahouzuoxin · 2017-08-23T09:04:27Z

Summary by @ezyang. If you are using multiprocessing code, and are in Python 3, you can work around this problem by adding mp.set_start_method('spawn') to your script. Otherwise, to work around, you need to make sure there are no CUDA calls prior to starting your subprocesses; torch.cuda.is_available() and torch.manual_seed both count as a CUDA calls in PyTorch 0.2.0.

Run https://github.com/ikostrikov/pytorch-a3c using pytorch 0.2.0, python 2.7 with error

terminate called after throwing an instance of 'std::runtime_error'
  what():  CUDA error (3): initialization error

occur on this those lines:
https://github.com/ikostrikov/pytorch-a3c/blob/842ec4da82df3d2dda02943d881c6b577f078ab9/train.py#L105-L111

while using pytorch 0.12.0 works fine

The text was updated successfully, but these errors were encountered:

floringogianu · 2017-08-25T15:09:10Z

Got this too with pytorch 0.2.0 and python 3.6.2, also with code involving multiprocessing and shared torch objects. Working on figuring out a smaller example to post here.

soumith · 2017-08-25T15:10:10Z

hmmm maybe our lazy cuda initialization is leaking somewhere. @colesbury this is worth looking into.

floringogianu · 2017-08-25T15:12:59Z

Strange is that I don't think I am actually initializing any cuda objects, everything is on the cpu. Old codebase though, will post something a bit more clearer as soon as possible.

floringogianu · 2017-08-25T21:32:42Z

OK so I narrowed it down to torch.manual_seed of all things. Here is a minimal script reproducing the issue.

import torch
import torch.nn as nn
import torch.multiprocessing as mp
import torch.nn.functional as F
from torch.autograd import Variable


def task(pid, model):
    x = Variable(torch.rand(64, 10))
    y = model(x)
    t = y.clone() * 0.99
    loss = F.smooth_l1_loss(y, t)

    # here it breaks
    loss.backward()

    print("Process %d finished" % pid)


if __name__ == "__main__":

    # comment manual_seed and the CUDA initialization error is gone.
    torch.manual_seed(23)

    net = nn.Linear(10, 4)
    net.share_memory()

    processes = []
    for pid in range(8):
        p = mp.Process(target=task, args=(pid, net))
        p.start()

    for p in processes:
        p.join()

    print("Done.")

edit: this can be solved by setting mp.set_start_method('spawn') before setting the rng seed which in turn calls cuda. Although I am not sure it is ideal.

colesbury · 2017-08-25T22:10:23Z

The problem is that torch.manual_seed initializes the CUDA driver, which breaks horribly when re-initialized across forks. Any cuda call, including cudaGetDeviceCount, seems to initialize the driver.

We need to avoid the cudaGetDeviceCount call in manual_seed. We should check if we've already initialized CUDA (i.e. THCState *state is not null). If we haven't created a THCState yet, we should just store the seed somwehere internally.

There's a similar problem in engine.cpp because it calls cudaGetDeviceCount. We'll need to avoid that call too if the THCState* is null.

btekin · 2017-09-20T18:12:12Z

I've had the same problem with Python2.7 and multi-GPU training when I had to ctrl-c a running code. This somehow crashed the NVIDIA driver. Running " sudo fuser -v /dev/nvidia* " and killing whatever there is resolved the problem.

ezyang · 2017-09-20T22:27:53Z

Note: it seems some how our test initialization code also accidentally initializes the CUDA driver, so we may need to write a regression test for this in a file that gets its own Python process.

Instead of initializing CUDA immediately and executing them, we wait until we actually initialize CUDA before executing. To keep things debuggable, we also keep track of the original backtrace when these functions are called, so we can inform users where they actually called the seeding/state functions (as opposed to the first time they actually initialized the RNG). Fixes pytorch#2517 Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ezyang · 2017-09-21T15:10:17Z

@colesbury Did you want to fix torch.cuda.is_available() to avoid initializing the CUDA driver as well? Because we actually have to return yes or no in this case (unlike seed which doesn't have to return anything), it will be difficult to do it unless we're willing to spawn a subprocess to do the test for us (and if that subprocess dies, we assume CUDA was already initialized and go ahead and call cudaGetDeviceCount.)

Instead of initializing CUDA immediately and executing them, we wait until we actually initialize CUDA before executing. To keep things debuggable, we also keep track of the original backtrace when these functions are called, so we can inform users where they actually called the seeding/state functions (as opposed to the first time they actually initialized the RNG). Fixes #2517 Signed-off-by: Edward Z. Yang <ezyang@fb.com>

soumith · 2017-09-22T16:37:35Z

this is fixed in master.

Amir-Arsalan · 2018-11-13T04:29:41Z

Related --> #13883 -- Unable to do .cuda() when using multiprocessing.Process with fork

mimoralea · 2018-12-27T03:51:30Z

I was still having this issue even while avoiding cuda or manual_seed calls. Since I'm not using CUDA for this Notebook, a quick workaround is to set the CUDA_VISIBLE_DEVICES to null.

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]=""

sbelharbi · 2019-01-07T14:56:06Z

Hello,
I run to the same issue using Python 3.7.1 and Pytorch 1.0.0.
Error: RuntimeError: CUDA error: initialization error.

Here is my setup:

I created a class that inherits from torch.utils.data.Dataset to make my own dataset. Inside this class, I do some preprocessing on samples in the GPU. This preprocessing is performed by some neural network that I created that was instantiated inside the class and sent to the GPU. torch.cuda.is_available() is called inside the class. The class gets the device: self.DEVICE = torch.device(device) and maintains it for future use (to send samples to be processed to the GPU). The class was tested (alone) and works fine. The issue starts when using this class with torch.utils.data.DataLoader. (see (2)).
My dataset class is instantiated, gets the device, create the model that does the prerocessing, did some preprocessing to validation set samples. It works fine. Then, Pytorch data loader is called. No issue until now. The issue raised when starts looping over the samples:

for i, (data, labels) in enumerate(train_loader):
    pass

--> It raises the error at the moment when my dataset class tries to send the sample to the its device:

x.to(self.DEVICE)
RuntimeError: CUDA error: initialization error

I have read all the above comments, and other forums (1, 2, 3).
I tried to remove torch.cuda.is_available() within my dataset class to avoid CUDA initialization, and use torch.multiprocessing.set_start_method("spawn"), but it didn't help (but I am not sure if I am missing something). However, I think CUDA needs to be initialized at that class, because it starts using it.

My dataset class is the first class to use CUDA. In somehow, the dataloader may have re-initialized CUDA and messed it for the dataset class. I am not sure if dataset class should get the device a second time in case CUDA has been reinitialized.

Note: I do not explicitly use torch.multiprocessing anywhere in the code. I do not modify torch.manual_seed for now. I have only one GPU on the computer, so it is the same device whenever the GPU is used.

Not really an expert in CUDA. Any suggestions? Thank you!

Updates:

(P.S.) The code is split into many files. One file as main entry. The dataset is in a different file than the main.
After adding torch.multiprocessing.set_start_method('spawn', force="True") to the top of the main file, wrapping the main code in __name__ == "__main__ (otherwise, the main entry starts to call itself = the main code executes itself twice which is not expected ...), and setting num_workers=1 for the data loader (i.e., no mlutiprocessing in the data loader) things seem to work fine as expected. However, once turning num_workers > 1 things went south when arriving to

for i, (data, labels) in enumerate(train_loader):
    pass

: nvidia-smi starts showing new processes (I suppose each one concerns one worker which is the result from forking in the data loader), increase in the GPU memory use (each process takes up to 450 MiB). In a GPU with small memory, it runs out of memory quickly. In a GPU with large memory, after a while (it does take time to create the subprocesses = extremely slow) things work fine, but ends with waning: python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown.
3. I am not really sure what happens when the data loader starts forking (here). I assume that the whole process is duplicated (including the dataset? class (which use cuda) and all its belongings including the network that performs the preprocessing). I am not sure if this is a dead end (3).
4. One option it to switch to CPU. Forking seems way faster.
5. There is some work that uses GPU for data processing. (here).

Still looking for a solution.

Updates:

Currently, I use only one worker in the data loader (reduce the time of creating the worker, and the GPU memory usage). This seems practical. The worker does the preprocessing on the GPU.

**How to hide the warning

python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))

**? see here.

ezyang · 2019-01-08T02:31:58Z

python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdow

This is benign, don't worry about it.

It sounds like you got PyTorch to use spawn for multiprocess invocations, which solves the CUDA initialization problem. Regarding subsequent memory errors, PyTorch can't help you design your multiprocessing pipeline. If you don't have that much GPU memory, you'll have to design appropriately not to use too much in each worker. Some basic tips include forking only once (rather than repeatedly, because each new process needs to initialize CUDA which will take time)

sbelharbi · 2019-01-08T20:47:42Z

Thank you for the pointers!!! very helpful!

For now, I decided to use only one worker in the data loader: num_wokers=0 or num_workers=1 which seems practical in terms of the time of creating the worker and memory usage. I updated my comment above.
P.S.
The worker does the preprocessing described in the dataset class (when called using dataset.__getitem__() on the GPU.

rabeehk · 2019-01-17T16:21:21Z

I do not think any of the solutions provided were helpful and actually work!

fehiepsi · 2019-01-19T03:21:17Z

Thank you @mimoralea. Your solution works like a charm.

EvenOldridge · 2019-05-03T20:53:32Z

I'm in a similar situation, trying to do preprocessing on GPU via RAPIDS and then passing the already on GPU tensors to the dataloader and running into this same issue. I've tried spawn (and forkserver) but haven't been able to get either working. Everything works fine with num_workers=0, but the throughput is less than half that of a dataloader that's using CPU memory and 4 workers.

It would be nice if spawn or forkserver worked in this context, or if there were some other way of doing in gpu memory dataloading. I'm surprised it's so much slower than the CPU multiprocessed version, but that's been my experience. My next step is to rewrite my own dataloader routine that passes the batches directly in the hopes that's faster, but I'm guessing the pytorch devs went down this route early on when comparing options for dataloading and that multi-CPU is faster rather than having cached batches on GPU?

ezyang · 2019-05-06T11:04:55Z

@EvenOldridge What you describe, should work. If you can make a small script that repros the issue, please feel free to post it in a new issue. I did a quick search for dataloader spawn/forkserver issues but didn't see anything relevant.

EvenOldridge · 2019-05-07T19:54:17Z

I'll try to strip down the code and put something together. Thanks @ezyang.

tshrjn · 2019-05-16T23:59:45Z

Hi @EvenOldridge were you able to solve it or reproduce it on stripped down code?

I'm having a similar issue when using a custom collate_fn with num_workers>0, I get the following error on the line which just makes zero tensor with appropriate device like this src = torch.zeros((bs, max_src_len), dtype=torch.long, device=self.device) :

  File "/path/to/miniconda3/envs/myenv/lib/python3.7/site-packages/tqdm/_tqdm.py", line 979, in __iter__
    for obj in iterable:
  File "/path/to/miniconda3/envs/myenv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 637, in __next__
    return self._process_next_batch(batch)
  File "/path/to/miniconda3/envs/myenv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
  File "/path/to/miniconda3/envs/myenv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/path/to/mycodebase/src/data.py", line 636, in _batch_helper
    (bs, max_src_len), dtype=torch.long, device=self.device)
RuntimeError: CUDA error: initialization error

It's works when num_workers>0 on cpu and also works on gpu with num_workers=0,
but not with num_workers>0 on gpu which makes the throughput quite low.

Santiago810 · 2019-05-21T13:12:52Z

alternative solution: use threading to replace multiprocessing in inference function(which will call model.forward()), other data pre/post-processing function still use multiprocessing module.

tshrjn · 2019-05-25T13:24:02Z

Thanks for the alternate solution, @Santiago810. Though, I'm not using multiprocessing or threading explicitly and only pytorch's dataloader and would expect it to handle the multiprocessing.

kalufinnle · 2019-08-10T04:38:07Z

can be solved by setting mp.set_start_method('spawn')

hainow · 2019-12-06T04:57:00Z

In my case mp.set_start_method('spawn') works, but not with Python 3.7 and Pytorch 1.1.0. I finally had to downgrade Python to 3.6.9 and things work.

ezyang · 2019-12-06T14:56:41Z

@hainow Can you file a new issue for this?

hainow · 2019-12-06T20:07:56Z

@ezyang yes, please take a look at #30900

Haydnspass · 2020-04-28T09:57:17Z

Maybe it is helpful for somebody: I found out that in my case the tensor was already on a CUDA device by accident. As soon as one of the workers tried to access the tensor the dataloader failed with the initialization error.

vodanhbk95 · 2021-04-01T16:09:29Z

For me, instead of using multi GPUs, swap to using single GPU and process works. I dont really know abt this issue but if not necessary, you can try my way.

vdraceil · 2021-06-22T03:01:54Z

For anyone facing this issue with celery, setting worker_pool = 'solo' in celeryconfig would help.
With this setting, celery shall not use "fork" to spin off workers

Alternatively, you can do the same via the CLI - ex. celery -A app worker -Q queue-name -P solo

ParamsRaman · 2021-11-11T22:17:36Z

I am facing a related issue. Details here: https://discuss.pytorch.org/t/runtimeerror-cuda-error-initialization-error-when-calling-torch-distributed-init-process-group-using-torch-multiprocessing/136625
Wondering if anybody has suggestions on how to go about fixing this problem?

@soumith @Amir-Arsalan - I found that you had some earlier discussion threads around this problem. Do you have any resolutions/tips?

Thanks in advance!

f4z3k4s · 2021-12-01T09:25:41Z

#2517 (comment)

@vdraceil Thanks. Exactly what I've needed.

albro96 · 2024-02-29T15:28:53Z

I was facing a similar issue with a torch dataset that does some preprocessing on the GPU using cuda. If num_workers in the Dataloader was set to 0 there was no problem. However setting it to != 0 left me with this error:
RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.

The problem was caused by import open3d as o3d in the header of the script. Be aware of your imports!

soumith added this to Uncategorized in Issue Status Aug 23, 2017

soumith added bug high priority labels Aug 28, 2017

soumith added this to Performance in Issue Categories Aug 30, 2017

inoryy mentioned this issue Sep 4, 2017

fix CUDA error (3): initialization error dgriff777/rl_a3c_pytorch#12

Closed

ezyang self-assigned this Sep 20, 2017

ezyang mentioned this issue Sep 20, 2017

Lazier CUDA initialization #2811

Merged

ezyang changed the title ~~CUDA error (3): initialization error~~ CUDA error (3): initialization error (multiprocessing) Sep 21, 2017

soumith closed this as completed in #2811 Sep 22, 2017

soumith added the nightly-announce label Sep 22, 2017

soumith removed this from Performance in Issue Categories Sep 25, 2017

soumith mentioned this issue Oct 13, 2017

PyTorch 0.2.0_1 Freezes at nn.Conv2d() #2496

Closed

sfriedo mentioned this issue Dec 27, 2017

[WIP] Use cupy for color conversion HPI-MachineIntelligence-MetaLearning/multi-building-detector#7

Open

boyuangong mentioned this issue Dec 17, 2018

RuntimeError: CUDA error (3): initialization error (google colab) keras-team/autokeras#355

Closed

louismartin mentioned this issue Mar 7, 2019

Cuda Initialization error facebookresearch/ParlAI#1531

Closed

olix86 mentioned this issue Oct 29, 2019

Automate hyperparameter optimisation ivadomed/ivadomed#75

Closed

hainow mentioned this issue Dec 6, 2019

CUDA error: initialization error (multiprocessing) with Python 3.7 #30900

Open

fehiepsi mentioned this issue Jan 22, 2020

MCMC with parallel chains get stuck in jupyter notebook on Ubuntu pyro-ppl/pyro#2276

Open

rothn mentioned this issue Feb 12, 2020

PyTorch 1.4.0 CUDA initialization error with CPU-only (multiprocessing) on Python 3.7.5 #33248

Open

fehiepsi mentioned this issue Mar 4, 2020

MCMC num_chains>1 does not work on Windows pyro-ppl/pyro#2315

Open

PascalIversen mentioned this issue Sep 30, 2020

#898 causes cuda initialization error awslabs/gluonts#1054

Closed

ryh95 mentioned this issue Jun 8, 2021

[Advise: The API call failed because the CUDA driver and runtime could not be initialized. ] jina-ai/jina#2514

Closed

IvanYashchuk pushed a commit to IvanYashchuk/pytorch that referenced this issue Mar 7, 2023

Make contiguity ignore broadcasts (pytorch#2517)

4ad1055

sjkoelle mentioned this issue Jun 30, 2023

Pytorch not calling to C code from a docker container #103752

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error (3): initialization error (multiprocessing) #2517

CUDA error (3): initialization error (multiprocessing) #2517

xiahouzuoxin commented Aug 23, 2017 •

edited by ezyang

floringogianu commented Aug 25, 2017

soumith commented Aug 25, 2017

floringogianu commented Aug 25, 2017

floringogianu commented Aug 25, 2017 •

edited

colesbury commented Aug 25, 2017

btekin commented Sep 20, 2017

ezyang commented Sep 20, 2017

ezyang commented Sep 21, 2017

soumith commented Sep 22, 2017

Amir-Arsalan commented Nov 13, 2018 •

edited

mimoralea commented Dec 27, 2018

sbelharbi commented Jan 7, 2019 •

edited

ezyang commented Jan 8, 2019

sbelharbi commented Jan 8, 2019 •

edited

rabeehk commented Jan 17, 2019

fehiepsi commented Jan 19, 2019

EvenOldridge commented May 3, 2019

ezyang commented May 6, 2019

EvenOldridge commented May 7, 2019

tshrjn commented May 16, 2019 •

edited

Santiago810 commented May 21, 2019

tshrjn commented May 25, 2019

kalufinnle commented Aug 10, 2019

hainow commented Dec 6, 2019

ezyang commented Dec 6, 2019

hainow commented Dec 6, 2019

Haydnspass commented Apr 28, 2020

vodanhbk95 commented Apr 1, 2021

vdraceil commented Jun 22, 2021 •

edited

ParamsRaman commented Nov 11, 2021 •

edited

f4z3k4s commented Dec 1, 2021

albro96 commented Feb 29, 2024

CUDA error (3): initialization error (multiprocessing) #2517

CUDA error (3): initialization error (multiprocessing) #2517

Comments

xiahouzuoxin commented Aug 23, 2017 • edited by ezyang

floringogianu commented Aug 25, 2017

soumith commented Aug 25, 2017

floringogianu commented Aug 25, 2017

floringogianu commented Aug 25, 2017 • edited

colesbury commented Aug 25, 2017

btekin commented Sep 20, 2017

ezyang commented Sep 20, 2017

ezyang commented Sep 21, 2017

soumith commented Sep 22, 2017

Amir-Arsalan commented Nov 13, 2018 • edited

mimoralea commented Dec 27, 2018

sbelharbi commented Jan 7, 2019 • edited

ezyang commented Jan 8, 2019

sbelharbi commented Jan 8, 2019 • edited

rabeehk commented Jan 17, 2019

fehiepsi commented Jan 19, 2019

EvenOldridge commented May 3, 2019

ezyang commented May 6, 2019

EvenOldridge commented May 7, 2019

tshrjn commented May 16, 2019 • edited

Santiago810 commented May 21, 2019

tshrjn commented May 25, 2019

kalufinnle commented Aug 10, 2019

hainow commented Dec 6, 2019

ezyang commented Dec 6, 2019

hainow commented Dec 6, 2019

Haydnspass commented Apr 28, 2020

vodanhbk95 commented Apr 1, 2021

vdraceil commented Jun 22, 2021 • edited

ParamsRaman commented Nov 11, 2021 • edited

f4z3k4s commented Dec 1, 2021

albro96 commented Feb 29, 2024

xiahouzuoxin commented Aug 23, 2017 •

edited by ezyang

floringogianu commented Aug 25, 2017 •

edited

Amir-Arsalan commented Nov 13, 2018 •

edited

sbelharbi commented Jan 7, 2019 •

edited

sbelharbi commented Jan 8, 2019 •

edited

tshrjn commented May 16, 2019 •

edited

vdraceil commented Jun 22, 2021 •

edited

ParamsRaman commented Nov 11, 2021 •

edited