Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error (3): initialization error (multiprocessing) #2517

Closed
xiahouzuoxin opened this issue Aug 23, 2017 · 32 comments · Fixed by #2811
Closed

CUDA error (3): initialization error (multiprocessing) #2517

xiahouzuoxin opened this issue Aug 23, 2017 · 32 comments · Fixed by #2811
Assignees

Comments

@xiahouzuoxin
Copy link

xiahouzuoxin commented Aug 23, 2017

Summary by @ezyang. If you are using multiprocessing code, and are in Python 3, you can work around this problem by adding mp.set_start_method('spawn') to your script. Otherwise, to work around, you need to make sure there are no CUDA calls prior to starting your subprocesses; torch.cuda.is_available() and torch.manual_seed both count as a CUDA calls in PyTorch 0.2.0.


Run https://github.com/ikostrikov/pytorch-a3c using pytorch 0.2.0, python 2.7 with error

terminate called after throwing an instance of 'std::runtime_error'
  what():  CUDA error (3): initialization error

occur on this those lines:
https://github.com/ikostrikov/pytorch-a3c/blob/842ec4da82df3d2dda02943d881c6b577f078ab9/train.py#L105-L111

while using pytorch 0.12.0 works fine

@soumith soumith added this to Uncategorized in Issue Status Aug 23, 2017
@floringogianu
Copy link

Got this too with pytorch 0.2.0 and python 3.6.2, also with code involving multiprocessing and shared torch objects. Working on figuring out a smaller example to post here.

@soumith
Copy link
Member

soumith commented Aug 25, 2017

hmmm maybe our lazy cuda initialization is leaking somewhere. @colesbury this is worth looking into.

@floringogianu
Copy link

Strange is that I don't think I am actually initializing any cuda objects, everything is on the cpu. Old codebase though, will post something a bit more clearer as soon as possible.

@floringogianu
Copy link

floringogianu commented Aug 25, 2017

OK so I narrowed it down to torch.manual_seed of all things. Here is a minimal script reproducing the issue.

import torch
import torch.nn as nn
import torch.multiprocessing as mp
import torch.nn.functional as F
from torch.autograd import Variable


def task(pid, model):
    x = Variable(torch.rand(64, 10))
    y = model(x)
    t = y.clone() * 0.99
    loss = F.smooth_l1_loss(y, t)

    # here it breaks
    loss.backward()

    print("Process %d finished" % pid)


if __name__ == "__main__":

    # comment manual_seed and the CUDA initialization error is gone.
    torch.manual_seed(23)

    net = nn.Linear(10, 4)
    net.share_memory()

    processes = []
    for pid in range(8):
        p = mp.Process(target=task, args=(pid, net))
        p.start()

    for p in processes:
        p.join()

    print("Done.")

edit: this can be solved by setting mp.set_start_method('spawn') before setting the rng seed which in turn calls cuda. Although I am not sure it is ideal.

@colesbury
Copy link
Member

The problem is that torch.manual_seed initializes the CUDA driver, which breaks horribly when re-initialized across forks. Any cuda call, including cudaGetDeviceCount, seems to initialize the driver.

We need to avoid the cudaGetDeviceCount call in manual_seed. We should check if we've already initialized CUDA (i.e. THCState *state is not null). If we haven't created a THCState yet, we should just store the seed somwehere internally.

There's a similar problem in engine.cpp because it calls cudaGetDeviceCount. We'll need to avoid that call too if the THCState* is null.

@btekin
Copy link

btekin commented Sep 20, 2017

I've had the same problem with Python2.7 and multi-GPU training when I had to ctrl-c a running code. This somehow crashed the NVIDIA driver. Running " sudo fuser -v /dev/nvidia* " and killing whatever there is resolved the problem.

@ezyang ezyang self-assigned this Sep 20, 2017
@ezyang
Copy link
Contributor

ezyang commented Sep 20, 2017

Note: it seems some how our test initialization code also accidentally initializes the CUDA driver, so we may need to write a regression test for this in a file that gets its own Python process.

ezyang added a commit to ezyang/pytorch that referenced this issue Sep 21, 2017
Instead of initializing CUDA immediately and executing them,
we wait until we actually initialize CUDA before executing.

To keep things debuggable, we also keep track of the original
backtrace when these functions are called, so we can inform
users where they actually called the seeding/state functions
(as opposed to the first time they actually initialized the
RNG).

Fixes pytorch#2517

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
@ezyang
Copy link
Contributor

ezyang commented Sep 21, 2017

@colesbury Did you want to fix torch.cuda.is_available() to avoid initializing the CUDA driver as well? Because we actually have to return yes or no in this case (unlike seed which doesn't have to return anything), it will be difficult to do it unless we're willing to spawn a subprocess to do the test for us (and if that subprocess dies, we assume CUDA was already initialized and go ahead and call cudaGetDeviceCount.)

@ezyang ezyang changed the title CUDA error (3): initialization error CUDA error (3): initialization error (multiprocessing) Sep 21, 2017
soumith pushed a commit that referenced this issue Sep 22, 2017
Instead of initializing CUDA immediately and executing them,
we wait until we actually initialize CUDA before executing.

To keep things debuggable, we also keep track of the original
backtrace when these functions are called, so we can inform
users where they actually called the seeding/state functions
(as opposed to the first time they actually initialized the
RNG).

Fixes #2517

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
@soumith
Copy link
Member

soumith commented Sep 22, 2017

this is fixed in master.

@Amir-Arsalan
Copy link
Contributor

Amir-Arsalan commented Nov 13, 2018

@mimoralea
Copy link
Contributor

I was still having this issue even while avoiding cuda or manual_seed calls. Since I'm not using CUDA for this Notebook, a quick workaround is to set the CUDA_VISIBLE_DEVICES to null.

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]=""

@sbelharbi
Copy link

sbelharbi commented Jan 7, 2019

Hello,
     I run to the same issue using Python 3.7.1 and Pytorch 1.0.0.
Error: RuntimeError: CUDA error: initialization error.

Here is my setup:

  1. I created a class that inherits from torch.utils.data.Dataset to make my own dataset. Inside this class, I do some preprocessing on samples in the GPU. This preprocessing is performed by some neural network that I created that was instantiated inside the class and sent to the GPU. torch.cuda.is_available() is called inside the class. The class gets the device: self.DEVICE = torch.device(device) and maintains it for future use (to send samples to be processed to the GPU). The class was tested (alone) and works fine. The issue starts when using this class with torch.utils.data.DataLoader. (see (2)).
  2. My dataset class is instantiated, gets the device, create the model that does the prerocessing, did some preprocessing to validation set samples. It works fine. Then, Pytorch data loader is called. No issue until now. The issue raised when starts looping over the samples:
for i, (data, labels) in enumerate(train_loader):
    pass

--> It raises the error at the moment when my dataset class tries to send the sample to the its device:

x.to(self.DEVICE)
RuntimeError: CUDA error: initialization error

I have read all the above comments, and other forums (1, 2, 3).
I tried to remove torch.cuda.is_available() within my dataset class to avoid CUDA initialization, and use torch.multiprocessing.set_start_method("spawn"), but it didn't help (but I am not sure if I am missing something). However, I think CUDA needs to be initialized at that class, because it starts using it.

My dataset class is the first class to use CUDA. In somehow, the dataloader may have re-initialized CUDA and messed it for the dataset class. I am not sure if dataset class should get the device a second time in case CUDA has been reinitialized.

Note: I do not explicitly use torch.multiprocessing anywhere in the code. I do not modify torch.manual_seed for now. I have only one GPU on the computer, so it is the same device whenever the GPU is used.

Not really an expert in CUDA. Any suggestions? Thank you!

Updates:

  1. (P.S.) The code is split into many files. One file as main entry. The dataset is in a different file than the main.
  2. After adding torch.multiprocessing.set_start_method('spawn', force="True") to the top of the main file, wrapping the main code in __name__ == "__main__ (otherwise, the main entry starts to call itself = the main code executes itself twice which is not expected ...), and setting num_workers=1 for the data loader (i.e., no mlutiprocessing in the data loader) things seem to work fine as expected. However, once turning num_workers > 1 things went south when arriving to
for i, (data, labels) in enumerate(train_loader):
    pass

: nvidia-smi starts showing new processes (I suppose each one concerns one worker which is the result from forking in the data loader), increase in the GPU memory use (each process takes up to 450 MiB). In a GPU with small memory, it runs out of memory quickly. In a GPU with large memory, after a while (it does take time to create the subprocesses = extremely slow) things work fine, but ends with waning: python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown.
3. I am not really sure what happens when the data loader starts forking (here). I assume that the whole process is duplicated (including the dataset? class (which use cuda) and all its belongings including the network that performs the preprocessing). I am not sure if this is a dead end (3).
4. One option it to switch to CPU. Forking seems way faster.
5. There is some work that uses GPU for data processing. (here).

Still looking for a solution.

Updates:

Currently, I use only one worker in the data loader (reduce the time of creating the worker, and the GPU memory usage). This seems practical. The worker does the preprocessing on the GPU.

**How to hide the warning

python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))

**? see here.

@ezyang
Copy link
Contributor

ezyang commented Jan 8, 2019

python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdow

This is benign, don't worry about it.

It sounds like you got PyTorch to use spawn for multiprocess invocations, which solves the CUDA initialization problem. Regarding subsequent memory errors, PyTorch can't help you design your multiprocessing pipeline. If you don't have that much GPU memory, you'll have to design appropriately not to use too much in each worker. Some basic tips include forking only once (rather than repeatedly, because each new process needs to initialize CUDA which will take time)

@sbelharbi
Copy link

sbelharbi commented Jan 8, 2019

Thank you for the pointers!!! very helpful!

For now, I decided to use only one worker in the data loader: num_wokers=0 or num_workers=1 which seems practical in terms of the time of creating the worker and memory usage. I updated my comment above.
P.S.
The worker does the preprocessing described in the dataset class (when called using dataset.__getitem__() on the GPU.

@rabeehk
Copy link

rabeehk commented Jan 17, 2019

I do not think any of the solutions provided were helpful and actually work!

@fehiepsi
Copy link
Contributor

Thank you @mimoralea. Your solution works like a charm.

@EvenOldridge
Copy link

I'm in a similar situation, trying to do preprocessing on GPU via RAPIDS and then passing the already on GPU tensors to the dataloader and running into this same issue. I've tried spawn (and forkserver) but haven't been able to get either working. Everything works fine with num_workers=0, but the throughput is less than half that of a dataloader that's using CPU memory and 4 workers.

It would be nice if spawn or forkserver worked in this context, or if there were some other way of doing in gpu memory dataloading. I'm surprised it's so much slower than the CPU multiprocessed version, but that's been my experience. My next step is to rewrite my own dataloader routine that passes the batches directly in the hopes that's faster, but I'm guessing the pytorch devs went down this route early on when comparing options for dataloading and that multi-CPU is faster rather than having cached batches on GPU?

@ezyang
Copy link
Contributor

ezyang commented May 6, 2019

@EvenOldridge What you describe, should work. If you can make a small script that repros the issue, please feel free to post it in a new issue. I did a quick search for dataloader spawn/forkserver issues but didn't see anything relevant.

@EvenOldridge
Copy link

I'll try to strip down the code and put something together. Thanks @ezyang.

@tshrjn
Copy link

tshrjn commented May 16, 2019

Hi @EvenOldridge were you able to solve it or reproduce it on stripped down code?

I'm having a similar issue when using a custom collate_fn with num_workers>0, I get the following error on the line which just makes zero tensor with appropriate device like this src = torch.zeros((bs, max_src_len), dtype=torch.long, device=self.device) :

  File "/path/to/miniconda3/envs/myenv/lib/python3.7/site-packages/tqdm/_tqdm.py", line 979, in __iter__
    for obj in iterable:
  File "/path/to/miniconda3/envs/myenv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 637, in __next__
    return self._process_next_batch(batch)
  File "/path/to/miniconda3/envs/myenv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
  File "/path/to/miniconda3/envs/myenv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/path/to/mycodebase/src/data.py", line 636, in _batch_helper
    (bs, max_src_len), dtype=torch.long, device=self.device)
RuntimeError: CUDA error: initialization error

It's works when num_workers>0 on cpu and also works on gpu with num_workers=0,
but not with num_workers>0 on gpu which makes the throughput quite low.

@Santiago810
Copy link

alternative solution: use threading to replace multiprocessing in inference function(which will call model.forward()), other data pre/post-processing function still use multiprocessing module.

@tshrjn
Copy link

tshrjn commented May 25, 2019

Thanks for the alternate solution, @Santiago810. Though, I'm not using multiprocessing or threading explicitly and only pytorch's dataloader and would expect it to handle the multiprocessing.

@kalufinnle
Copy link

can be solved by setting mp.set_start_method('spawn')

@hainow
Copy link

hainow commented Dec 6, 2019

In my case mp.set_start_method('spawn') works, but not with Python 3.7 and Pytorch 1.1.0. I finally had to downgrade Python to 3.6.9 and things work.

@ezyang
Copy link
Contributor

ezyang commented Dec 6, 2019

@hainow Can you file a new issue for this?

@hainow
Copy link

hainow commented Dec 6, 2019

@ezyang yes, please take a look at #30900

@Haydnspass
Copy link

Maybe it is helpful for somebody: I found out that in my case the tensor was already on a CUDA device by accident. As soon as one of the workers tried to access the tensor the dataloader failed with the initialization error.

@vodanhbk95
Copy link

For me, instead of using multi GPUs, swap to using single GPU and process works. I dont really know abt this issue but if not necessary, you can try my way.

@vdraceil
Copy link

vdraceil commented Jun 22, 2021

For anyone facing this issue with celery, setting worker_pool = 'solo' in celeryconfig would help.
With this setting, celery shall not use "fork" to spin off workers

Alternatively, you can do the same via the CLI - ex. celery -A app worker -Q queue-name -P solo

@ParamsRaman
Copy link

ParamsRaman commented Nov 11, 2021

I am facing a related issue. Details here: https://discuss.pytorch.org/t/runtimeerror-cuda-error-initialization-error-when-calling-torch-distributed-init-process-group-using-torch-multiprocessing/136625
Wondering if anybody has suggestions on how to go about fixing this problem?

@soumith @Amir-Arsalan - I found that you had some earlier discussion threads around this problem. Do you have any resolutions/tips?

Thanks in advance!

@f4z3k4s
Copy link

f4z3k4s commented Dec 1, 2021

#2517 (comment)

@vdraceil Thanks. Exactly what I've needed.

@albro96
Copy link

albro96 commented Feb 29, 2024

I was facing a similar issue with a torch dataset that does some preprocessing on the GPU using cuda. If num_workers in the Dataloader was set to 0 there was no problem. However setting it to != 0 left me with this error:
RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.

The problem was caused by import open3d as o3d in the header of the script. Be aware of your imports!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Issue Status
Uncategorized
Development

Successfully merging a pull request may close this issue.