Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input #32564

Closed
imransai opened this issue Jan 23, 2020 · 18 comments

Comments

@imransai
Copy link

Seems like most of the issues regarding this subject matter is closed. So I am opening this issue again.

Issue description

I am facing this issue when using loss.backward() on my loss function.
Some of the loss function classes I am using are the following.

class CELoss_auxilary(nn.Module):

def __init__(self, auxloss = True):
    
    super(CELoss_auxilary, self).__init__()
    self.logsoftmax = nn.LogSoftmax(dim = 1)                
    self.crossentropyloss = nn.KLDivLoss(reduction = 'batchmean')

def forward(self, pred_auxdc, gt_target, nspatialscales = 16, device = None, params = None):        
    
    if device is None:
        device = torch.device("cpu")        
    nChannels = params['dce_nChannels'] + 2
    gt_resizeddc_target = depth2dc(gt_target, method = 'gaussian3hot', device = device)
    gt_resizeddc_target = gt_resizeddc_target.detach().permute(0, 2, 3, 1).view(-1, nChannels)
    valid_auxmask = (gt_target > 0).detach()
    valid_auxmask = valid_auxmask.view(-1)
    """
    gt_resizeddc_target = gt_resizeddc_target.view(1, nChannels, -1)
    valid_auxmask = (gt_target > 0).detach()
    valid_auxmask = valid_auxmask.view(-1)
    gt_resizeddc_targetsel = gt_resizeddc_target[..., valid_auxmask]
    pred_auxdc = pred_auxdc.view(1, nChannels, -1)
    pred_auxdcsel = pred_auxdc[..., valid_auxmask]
    """
    gt_resizeddc_targetsel = gt_resizeddc_target[valid_auxmask, ...]
    biased_aux = torch.sum(gt_resizeddc_targetsel*torch.log(gt_resizeddc_targetsel + 1e-7), 1)
    
    pred_auxdc = pred_auxdc.permute(0, 2, 3, 1).view(-1, nChannels)
    #valid_auxmask = (torch.sum(gt_resizeddc_target, 1) > 0.0).detach()
    gt_resizeddc_targetsel = gt_resizeddc_target[valid_auxmask, ...]  
    biased_aux = torch.sum(gt_resizeddc_targetsel*torch.log(gt_resizeddc_targetsel + 1e-7), 1)
    pred_auxdcsel = pred_auxdc[valid_auxmask, ...]                
    self.lossaux = self.crossentropyloss(self.logsoftmax(pred_auxdcsel), gt_resizeddc_targetsel)
    #self.lossaux = torch.mean(-torch.sum(gt_resizeddc_targetsel*self.logsoftmax(pred_auxdcsel), 1) + biased_aux)

    return self.lossaux

class MaskedMSELoss(nn.Module):
def init(self):
super(MaskedMSELoss, self).init()

def forward(self, pred, target):
    assert pred.dim() == target.dim(), "inconsistent dimensions"
    valid_mask = (target > 0).detach()
    diff = target - pred
    diff = diff[valid_mask]
    self.loss = (diff**2).mean()        
    return self.loss

I also checked with other loss functions.

System Info

Pytorch 1.4 cuda-toolkit 10.1 ubuntu 16.0.4.

I am facing this problem only during backward computation in training. My evaluation code runs fine.

Interestingly my code runs fine with this combination:
pytorch1.3 cuda-toolkit 10.0
pytorch1.1 cuda-toolkit 9.0

But I need to use the aforementioned combination pytorch 1.4 cuda-toolkit 10.1 for accessing some sparse convolution tools which are only available in CUDA 10.1>= higher. Can anyone help in this regard?

@peterjc123
Copy link
Collaborator

Maybe related to #32395.

@imransai
Copy link
Author

For now, pytorch 1.4 nightly seems to have solved this problem! Thanks!

@BoltzmannBrain
Copy link

Previous suggestions in this thread did not resolve my problem; currently on pytorch-nightly (1.6.0.dev20200525) w/ cuda 10.1.243. Oddly reducing my batch size from 64 to 36 worked 🤷‍♂️

@Hadrien-Cornier
Copy link

It happens to me as well with
pytorch=1.5.0
py3.8
cuda10.1.243
cudnn7.6.3_0

It happens when I reach a batch normalization layer with a huge batch size, but when I decrease the batch size the error is gone. It is probably a memory issue that happens when a batch is too big.

@DuckJ
Copy link

DuckJ commented Jun 22, 2020

It happens to me as well with
pytorch=1.2.0
py3
cuda 10.0
It happens when the network forwards the conv layer. I use 6 V100 and horovod to distributed training, batchsize is not huge (128) .
It's strange that this error didn't happen before I expanded the dataset. After I expanded the dataset, the batchsize didn't change, but this error happened

@TangDL
Copy link

TangDL commented Jul 22, 2020

maybe input size too large

@ShoufaChen
Copy link

Just reduce the batch size and try again. It works for me.

@yarkable
Copy link

Thanks, reducing the batch_size is OK. But it seems a bug?

@leopd
Copy link

leopd commented Oct 7, 2020

I'm getting this with pytorch 1.6.0. Definitely seems like a bug. Reducing batch size is not a good workaround. Getting the right batch size is critical to certain algorithms & loss functions, such as when doing negative sampling for contrastive learning.

Perhaps related, sometimes my code hits this error instead THCudaTensor sizes too large for THCDeviceTensor conversion which is logged here #24401

@CoderHHX
Copy link

It also happens to me
pytorch=1.4.0
py3
cuda 10.1.
Reduce the batch size was not working.
Finally, I change pytorch to 1.7 cuda 11.0, it works. lol~~~

@absorbguo
Copy link

reduce batch works for me

@nadir121
Copy link

nadir121 commented Jul 9, 2021

Why is this closed? Reducing the batch size does not solve it for me. It is still a bug.

@tdchua
Copy link

tdchua commented Nov 19, 2021

I was experiencing this problem as well from pytorch=1.4.0... upgrading to pytorch=1.5 solved the issue 😄

@rongduo
Copy link

rongduo commented Dec 1, 2021

Just reduce the batch size and try again. It works for me.

Thanks for your suggestion. It resolves my issue.

@ooodragon94
Copy link

Why is this closed? Reducing the batch size does not solve it for me. It is still a bug.

totally agreed. We ALWAYS need more batches

@Sanqiang
Copy link

It looks like OOM causes this problem too

@meisa233
Copy link

add this line after import torch

torch.backends.cudnn.enabled = False

@Sil3ntKn1ght
Copy link

any solution for Python 3.10 +
so slow without it.

Screenshot 2024-06-13 153149

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests