New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: reduce failed to synchronize: unspecified launch failure #4291

Open
naba89 opened this Issue Dec 21, 2017 · 4 comments

Comments

Projects
None yet
5 participants
@naba89
Copy link

naba89 commented Dec 21, 2017

Hi,

I am facing this error after few epochs of training.

loss_re = loss_fn(output_re, target_re_v)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 329, in forward
return F.mse_loss(input, target, size_average=self.size_average, reduce=self.reduce)
RuntimeError: reduce failed to synchronize: unspecified launch failure

Following is the system configuration:
OS: Ubuntu 16.04
Cuda: 9.1.85
Pytorch: 0.3.0.post4
GPU: 2x 1080ti (training on a single GPU without DataParallel specified using CUDA_VISIBLE_DEVICES)

Is this a known issue or any known workarounds?

Regards
Nabarun

@odegeasslbc

This comment has been minimized.

Copy link

odegeasslbc commented Jan 19, 2018

I got same error in an identical case.
bce = nn.BCELoss().cuda(cuda_id)
fake_label = Variable(torch.FloatTensor(batch_size).fill_(0).cuda(cuda_id))
loss = bce(pred, fake_label)

@juliohm

This comment has been minimized.

Copy link

juliohm commented Feb 20, 2018

I was having similar error. Make sure your layer has values that make sense to the BCELoss. If you for example output negative values and pass them to the logarithm, the training will fail.

In my case I was missing the last sigmoid activation to shrink the numbers between 0 and 1.

@Oracen

This comment has been minimized.

Copy link

Oracen commented Sep 27, 2018

@juliohm, he's using MSE loss.

I'm seeing similar errors. I have two near identical matfac models, one trains fine, the other kills the kernel within a single loop. Running within the Fast.Ai library, but the error is with Pytorch functional

OS: Ubuntu 16.04
CUDA: V9.0.176
Pytorch: 0.4.1
GPU: GTX1080TI (3 available, but only training on one)

Reproduction code seems kind of tough, because it's sporadic. I'd release the full dataset but it's PID.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-5-acdfebe571c4> in <module>()
----> 1 learn.fit(1e-3, 1)
      2 learn.pretrain=False
      3 learn.fit(1e-4, 2, wds=0, cycle_len=1, cycle_mult=2)
      4 learn.fit(3e-5, 1, cycle_len=1)
      5 patient_embed = to_np(learn.model.p.weight)

~/.conda/envs/alex-pytorch/lib/python3.6/site-packages/fastai/learner.py in fit(self, lrs, n_cycle, wds, **kwargs)
    300         self.sched = None
    301         layer_opt = self.get_layer_opt(lrs, wds)
--> 302         return self.fit_gen(self.model, self.data, layer_opt, n_cycle, **kwargs)
    303 
    304     def warm_up(self, lr, wds=None):

~/.conda/envs/alex-pytorch/lib/python3.6/site-packages/fastai/learner.py in fit_gen(self, model, data, layer_opt, n_cycle, cycle_len, cycle_mult, cycle_save_name, best_save_name, use_clr, use_clr_beta, metrics, callbacks, use_wd_sched, norm_wds, wds_sched_mult, use_swa, swa_start, swa_eval_freq, **kwargs)
    247             metrics=metrics, callbacks=callbacks, reg_fn=self.reg_fn, clip=self.clip, fp16=self.fp16,
    248             swa_model=self.swa_model if use_swa else None, swa_start=swa_start,
--> 249             swa_eval_freq=swa_eval_freq, **kwargs)
    250 
    251     def get_layer_groups(self): return self.models.get_layer_groups()

~/.conda/envs/alex-pytorch/lib/python3.6/site-packages/fastai/model.py in fit(model, data, n_epochs, opt, crit, metrics, callbacks, stepper, swa_model, swa_start, swa_eval_freq, visualize, **kwargs)
    139             batch_num += 1
    140             for cb in callbacks: cb.on_batch_begin()
--> 141             loss = model_stepper.step(V(x),V(y), epoch)
    142             avg_loss = avg_loss * avg_mom + loss * (1-avg_mom)
    143             debias_loss = avg_loss / (1 - avg_mom**batch_num)

~/.conda/envs/alex-pytorch/lib/python3.6/site-packages/fastai/model.py in step(self, xs, y, epoch)
     52         if self.fp16: self.m.zero_grad()
     53         else: self.opt.zero_grad()
---> 54         loss = raw_loss = self.crit(output, y)
     55         if self.loss_scale != 1: assert(self.fp16); loss = loss*self.loss_scale
     56         if self.reg_fn: loss = self.reg_fn(output, xtra, raw_loss)

<ipython-input-4-33d3aa22dcd6> in crit_fn(input, target)
      2   print(input.size())
      3   print(target.size())
----> 4   return F.mse_loss(input, target, reduction='sum')
      5 learn.crit = crit_fn

~/.conda/envs/alex-pytorch/lib/python3.6/site-packages/torch/nn/functional.py in mse_loss(input, target, size_average, reduce, reduction)
   1714     else:
   1715         reduction = _Reduction.get_enum(reduction)
-> 1716     return _pointwise_loss(lambda a, b: (a - b) ** 2, torch._C._nn.mse_loss, input, target, reduction)
   1717 
   1718 

~/.conda/envs/alex-pytorch/lib/python3.6/site-packages/torch/nn/functional.py in _pointwise_loss(lambd, lambd_optimized, input, target, reduction)
   1672         return torch.mean(d) if reduction == 'elementwise_mean' else torch.sum(d)
   1673     else:
-> 1674         return lambd_optimized(input, target, reduction)
   1675 
   1676 

RuntimeError: reduce failed to synchronize: device-side assert triggered

I checked that the targets and preds aligned with the wrapper function, and I also checked that the error occurs under both 'elementwise_mean' and 'sum'

HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value=''))) 0%| | 0/463933 [00:00<?, ?it/s]
torch.Size([128, 1])
torch.Size([128, 1])
0%| | 1/463933 [00:05<662:31:34, 5.14s/it, loss=1.05]torch.Size([128, 1])
torch.Size([128, 1])
torch.Size([128, 1])
torch.Size([128, 1])
torch.Size([128, 1])
torch.Size([128, 1])
torch.Size([128, 1])
torch.Size([128, 1])
RuntimeError

When I replaced F.mse_loss() with the following:

def crit_fn(input, target):
  x = F.mse_loss(input, target, reduction='none')
  x = x.mean()
  return x

learn.crit = crit_fn

I got a few iterations in, before the following error was raised:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-12-acdfebe571c4> in <module>()
----> 1 learn.fit(1e-3, 1)
      2 learn.pretrain=False
      3 learn.fit(1e-4, 2, wds=0, cycle_len=1, cycle_mult=2)
      4 learn.fit(3e-5, 1, cycle_len=1)
      5 patient_embed = to_np(learn.model.p.weight)

~/.conda/envs/alex-pytorch/lib/python3.6/site-packages/fastai/learner.py in fit(self, lrs, n_cycle, wds, **kwargs)
    300         self.sched = None
    301         layer_opt = self.get_layer_opt(lrs, wds)
--> 302         return self.fit_gen(self.model, self.data, layer_opt, n_cycle, **kwargs)
    303 
    304     def warm_up(self, lr, wds=None):

~/.conda/envs/alex-pytorch/lib/python3.6/site-packages/fastai/learner.py in fit_gen(self, model, data, layer_opt, n_cycle, cycle_len, cycle_mult, cycle_save_name, best_save_name, use_clr, use_clr_beta, metrics, callbacks, use_wd_sched, norm_wds, wds_sched_mult, use_swa, swa_start, swa_eval_freq, **kwargs)
    247             metrics=metrics, callbacks=callbacks, reg_fn=self.reg_fn, clip=self.clip, fp16=self.fp16,
    248             swa_model=self.swa_model if use_swa else None, swa_start=swa_start,
--> 249             swa_eval_freq=swa_eval_freq, **kwargs)
    250 
    251     def get_layer_groups(self): return self.models.get_layer_groups()

~/.conda/envs/alex-pytorch/lib/python3.6/site-packages/fastai/model.py in fit(model, data, n_epochs, opt, crit, metrics, callbacks, stepper, swa_model, swa_start, swa_eval_freq, visualize, **kwargs)
    139             batch_num += 1
    140             for cb in callbacks: cb.on_batch_begin()
--> 141             loss = model_stepper.step(V(x),V(y), epoch)
    142             avg_loss = avg_loss * avg_mom + loss * (1-avg_mom)
    143             debias_loss = avg_loss / (1 - avg_mom**batch_num)

~/.conda/envs/alex-pytorch/lib/python3.6/site-packages/fastai/model.py in step(self, xs, y, epoch)
     52         if self.fp16: self.m.zero_grad()
     53         else: self.opt.zero_grad()
---> 54         loss = raw_loss = self.crit(output, y)
     55         if self.loss_scale != 1: assert(self.fp16); loss = loss*self.loss_scale
     56         if self.reg_fn: loss = self.reg_fn(output, xtra, raw_loss)

<ipython-input-11-b3d6cadfe7f1> in crit_fn(input, target)
      1 def crit_fn(input, target):
      2   print('raw')
----> 3   print(input)
      4   print(target)
      5   x = F.mse_loss(input, target, reduction='none')

~/.conda/envs/alex-pytorch/lib/python3.6/site-packages/torch/tensor.py in __repr__(self)
     55         # characters to replace unicode characters with.
     56         if sys.version_info > (3,):
---> 57             return torch._tensor_str._str(self)
     58         else:
     59             if hasattr(sys.stdout, 'encoding'):

~/.conda/envs/alex-pytorch/lib/python3.6/site-packages/torch/_tensor_str.py in _str(self)
    254             suffix += ', dtype=' + str(self.dtype)
    255 
--> 256         formatter = _Formatter(get_summarized_data(self) if summarize else self)
    257         tensor_str = _tensor_str(self, indent, formatter, summarize)
    258 

~/.conda/envs/alex-pytorch/lib/python3.6/site-packages/torch/_tensor_str.py in __init__(self, tensor)
     80 
     81         else:
---> 82             copy = torch.empty(tensor.size(), dtype=torch.float64).copy_(tensor).view(tensor.nelement())
     83             copy_list = copy.tolist()
     84             try:

RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/generic/THCTensorCopy.cpp:70

I don't believe it's a problem with the Fast.Ai library, and I can't seem to narrow down what might be giving rise to this. As mentioned, I can switch to another Jupyter notebook and run another similar model to completion, as well as run Tensorflow models. If I can conduct any experiments to help narrow this down, let me know.

@micklexqg

This comment has been minimized.

Copy link

micklexqg commented Jan 18, 2019

I was having similar error. Make sure your layer has values that make sense to the BCELoss. If you for example output negative values and pass them to the logarithm, the training will fail.

In my case I was missing the last sigmoid activation to shrink the numbers between 0 and 1.

why I pass negative values to the algorithm (one is negative, another is 1), it still work. but the same script failed on linux...
so strange

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment