Skip to content

Gradient is incorrect for torch.nn.functional.nll_loss for CUDA #64163

@pmeier

Description

@pmeier

🐛 Bug

Gradient is incorrect for torch.nn.functional.nll_loss for CUDA.

To Reproduce

import torch
from torch.autograd import gradcheck
from torch.nn.functional import nll_loss, log_softmax

device = "cuda"
reduction = "mean"

torch.manual_seed(0)
input = log_softmax(torch.rand((1, 2), device=device, dtype=torch.float64), dim=1).requires_grad_(True)
target = torch.randint(0, 2, (1,), device=device, dtype=torch.int64)

gradcheck(lambda input, target: nll_loss(input, target, reduction=reduction), (input, target))
torch.autograd.gradcheck.GradcheckError: Jacobian mismatch for output 0 with respect to input 0,
numerical:tensor([[-1.0133],
        [ 0.0000]], device='cuda:0', dtype=torch.float64)
analytical:tensor([[-1.],
        [ 0.]], device='cuda:0', dtype=torch.float64)

The failure only happens on CUDA and if reduction="mean" (default) or reduction="sum" is selected. reduction="none" uses a different code path than the other two and does not show the behavior.

Expected behavior

torch.nn.functional.nll_loss should pass the gradient check. This is a regression since it does for torch==1.9.0.

Environment

This was detected in master CI while adding an OpInfo for nll_loss in #63854.

cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @anjali411 @albanD @gqchen @pearu @nikitaved @soulitzer @lezcano @Varal7 @mruberry

Metadata

Metadata

Assignees

Labels

high prioritymodule: autogradRelated to torch.autograd, and the autograd engine in generalmodule: nnRelated to torch.nnmodule: regressionIt used to work, and now it doesn'ttriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions