New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA error (3): initialization error (multiprocessing) #2517
Comments
Got this too with |
hmmm maybe our lazy cuda initialization is leaking somewhere. @colesbury this is worth looking into. |
Strange is that I don't think I am actually initializing any cuda objects, everything is on the cpu. Old codebase though, will post something a bit more clearer as soon as possible. |
OK so I narrowed it down to import torch
import torch.nn as nn
import torch.multiprocessing as mp
import torch.nn.functional as F
from torch.autograd import Variable
def task(pid, model):
x = Variable(torch.rand(64, 10))
y = model(x)
t = y.clone() * 0.99
loss = F.smooth_l1_loss(y, t)
# here it breaks
loss.backward()
print("Process %d finished" % pid)
if __name__ == "__main__":
# comment manual_seed and the CUDA initialization error is gone.
torch.manual_seed(23)
net = nn.Linear(10, 4)
net.share_memory()
processes = []
for pid in range(8):
p = mp.Process(target=task, args=(pid, net))
p.start()
for p in processes:
p.join()
print("Done.") edit: this can be solved by setting |
The problem is that We need to avoid the There's a similar problem in |
I've had the same problem with Python2.7 and multi-GPU training when I had to ctrl-c a running code. This somehow crashed the NVIDIA driver. Running " sudo fuser -v /dev/nvidia* " and killing whatever there is resolved the problem. |
Note: it seems some how our test initialization code also accidentally initializes the CUDA driver, so we may need to write a regression test for this in a file that gets its own Python process. |
Instead of initializing CUDA immediately and executing them, we wait until we actually initialize CUDA before executing. To keep things debuggable, we also keep track of the original backtrace when these functions are called, so we can inform users where they actually called the seeding/state functions (as opposed to the first time they actually initialized the RNG). Fixes pytorch#2517 Signed-off-by: Edward Z. Yang <ezyang@fb.com>
@colesbury Did you want to fix |
Instead of initializing CUDA immediately and executing them, we wait until we actually initialize CUDA before executing. To keep things debuggable, we also keep track of the original backtrace when these functions are called, so we can inform users where they actually called the seeding/state functions (as opposed to the first time they actually initialized the RNG). Fixes #2517 Signed-off-by: Edward Z. Yang <ezyang@fb.com>
this is fixed in master. |
I was still having this issue even while avoiding
|
Hello, Here is my setup:
for i, (data, labels) in enumerate(train_loader):
pass --> It raises the error at the moment when my dataset class tries to send the sample to the its device:
I have read all the above comments, and other forums (1, 2, 3). My dataset class is the first class to use CUDA. In somehow, the dataloader may have re-initialized CUDA and messed it for the dataset class. I am not sure if dataset class should get the device a second time in case CUDA has been reinitialized. Note: I do not explicitly use Not really an expert in CUDA. Any suggestions? Thank you! Updates:
for i, (data, labels) in enumerate(train_loader):
pass : Still looking for a solution. Updates: Currently, I use only one worker in the data loader (reduce the time of creating the worker, and the GPU memory usage). This seems practical. The worker does the preprocessing on the GPU. **How to hide the warning python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
len(cache)) **? see here. |
This is benign, don't worry about it. It sounds like you got PyTorch to use spawn for multiprocess invocations, which solves the CUDA initialization problem. Regarding subsequent memory errors, PyTorch can't help you design your multiprocessing pipeline. If you don't have that much GPU memory, you'll have to design appropriately not to use too much in each worker. Some basic tips include forking only once (rather than repeatedly, because each new process needs to initialize CUDA which will take time) |
Thank you for the pointers!!! very helpful! For now, I decided to use only one worker in the data loader: |
I do not think any of the solutions provided were helpful and actually work! |
Thank you @mimoralea. Your solution works like a charm. |
I'm in a similar situation, trying to do preprocessing on GPU via RAPIDS and then passing the already on GPU tensors to the dataloader and running into this same issue. I've tried spawn (and forkserver) but haven't been able to get either working. Everything works fine with num_workers=0, but the throughput is less than half that of a dataloader that's using CPU memory and 4 workers. It would be nice if spawn or forkserver worked in this context, or if there were some other way of doing in gpu memory dataloading. I'm surprised it's so much slower than the CPU multiprocessed version, but that's been my experience. My next step is to rewrite my own dataloader routine that passes the batches directly in the hopes that's faster, but I'm guessing the pytorch devs went down this route early on when comparing options for dataloading and that multi-CPU is faster rather than having cached batches on GPU? |
@EvenOldridge What you describe, should work. If you can make a small script that repros the issue, please feel free to post it in a new issue. I did a quick search for dataloader spawn/forkserver issues but didn't see anything relevant. |
I'll try to strip down the code and put something together. Thanks @ezyang. |
Hi @EvenOldridge were you able to solve it or reproduce it on stripped down code? I'm having a similar issue when using a custom
It's works when |
alternative solution: use threading to replace multiprocessing in inference function(which will call model.forward()), other data pre/post-processing function still use multiprocessing module. |
Thanks for the alternate solution, @Santiago810. Though, I'm not using multiprocessing or threading explicitly and only pytorch's dataloader and would expect it to handle the multiprocessing. |
can be solved by setting mp.set_start_method('spawn') |
In my case mp.set_start_method('spawn') works, but not with Python 3.7 and Pytorch 1.1.0. I finally had to downgrade Python to 3.6.9 and things work. |
@hainow Can you file a new issue for this? |
Maybe it is helpful for somebody: I found out that in my case the tensor was already on a CUDA device by accident. As soon as one of the workers tried to access the tensor the dataloader failed with the initialization error. |
For me, instead of using multi GPUs, swap to using single GPU and process works. I dont really know abt this issue but if not necessary, you can try my way. |
For anyone facing this issue with celery, setting Alternatively, you can do the same via the CLI - ex. |
I am facing a related issue. Details here: https://discuss.pytorch.org/t/runtimeerror-cuda-error-initialization-error-when-calling-torch-distributed-init-process-group-using-torch-multiprocessing/136625 @soumith @Amir-Arsalan - I found that you had some earlier discussion threads around this problem. Do you have any resolutions/tips? Thanks in advance! |
@vdraceil Thanks. Exactly what I've needed. |
I was facing a similar issue with a torch dataset that does some preprocessing on the GPU using cuda. If num_workers in the Dataloader was set to 0 there was no problem. However setting it to != 0 left me with this error: The problem was caused by |
Summary by @ezyang. If you are using multiprocessing code, and are in Python 3, you can work around this problem by adding
mp.set_start_method('spawn')
to your script. Otherwise, to work around, you need to make sure there are no CUDA calls prior to starting your subprocesses;torch.cuda.is_available()
andtorch.manual_seed
both count as a CUDA calls in PyTorch 0.2.0.Run https://github.com/ikostrikov/pytorch-a3c using pytorch 0.2.0, python 2.7 with error
occur on this those lines:
https://github.com/ikostrikov/pytorch-a3c/blob/842ec4da82df3d2dda02943d881c6b577f078ab9/train.py#L105-L111
while using pytorch 0.12.0 works fine
The text was updated successfully, but these errors were encountered: