[BE] More informative error messages in `THPVariable_set_grad` #100174

awgu · 2023-04-27T15:35:52Z

pytorch/torch/csrc/autograd/python_variable.cpp

Line 896 in 31f311a

int THPVariable_set_grad(THPVariable* self, PyObject* py_grad, void* unused) {

When the gradient metadata does not match the corresponding tensor's metadata, THPVariable_set_grad() raises an error, e.g.:

RuntimeError: assigned grad has data of a different type

Including what leads to the mismatch would be very helpful. For example, for the above message, if we could see something like:

RuntimeError: assigned grad has data of type torch.float16 that differs from the required torch.float32

A similar idea applies for mismatched devices and sizes.

cc @ezyang @albanD @zou3519 @gqchen @pearu @nikitaved @soulitzer @lezcano @Varal7 @malfet

The text was updated successfully, but these errors were encountered:

awgu · 2023-04-27T15:48:04Z

Following up on this, consider the following two asserts:

pytorch/torch/csrc/autograd/python_variable.cpp

Lines 920 to 923 in 31f311a

    
           THPUtils_assertRet( 
        
               -1, 
        
               grad.options().type_equal(var.options()) || gradIsSparse, 
        
               "assigned grad has data of a different type");

pytorch/torch/csrc/autograd/python_variable.cpp

Lines 924 to 929 in 31f311a

    
           if (var.is_cuda()) { 
        
             THPUtils_assertRet( 
        
                 -1, 
        
                 grad.get_device() == var.get_device(), 
        
                 "assigned grad has data located on a different device"); 
        
           }

Can the second one (which is checked after the first) ever be triggered?

If I look at options(), it includes the device:

pytorch/aten/src/ATen/core/TensorBase.h

Lines 557 to 561 in 31f311a

    
           TensorOptions options() const { 
        
             return TensorOptions().dtype(dtype()) 
        
                                   .device(device()) 
        
                                   .layout(layout()); 
        
           }

>>> t = torch.randn((3, 3), device="cuda")
>>> g = torch.randn((3, 3), device="cpu")
>>> t.grad = g
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: assigned grad has data of a different type

albanD · 2023-04-28T14:14:46Z

I would expect type_equal() to only check the dtype? But maybe that's not true.

We would definitely be happy with a PR improving these errors!

awgu · 2023-04-28T14:26:23Z

pytorch/c10/core/TensorOptions.h

Lines 364 to 367 in e6f9bc5

    
           bool type_equal(const TensorOptions& other) const { 
        
             return computeDispatchKey() == other.computeDispatchKey() && 
        
                 typeMetaToScalarType(dtype_) == typeMetaToScalarType(other.dtype()); 
        
           }

Maybe the first clause on the dispatch key equality is causing the non-intuitive behavior since it seems to be a function of the device?

pytorch/c10/core/TensorOptions.h

Lines 439 to 442 in e6f9bc5

    
           DispatchKey computeDispatchKey() const { 
        
             return c10::computeDispatchKey( 
        
                 optTypeMetaToScalarType(dtype_opt()), layout_opt(), device_opt()); 
        
           }

bdhirsh · 2023-04-28T14:37:30Z

   THPUtils_assertRet( 
       -1, 
       grad.get_device() == var.get_device(), 
       "assigned grad has data located on a different device"); 
 }

It looks like get_device() checks that the device index matches, e.g. cuda:0 vs cuda: (code)

For the first check involving type_equal() - dispatch keys take into account the device type, but not the device index. So that first check will only error if e.g. the two tensors are on cpu vs cuda, but not if the two tensors have different indices, like cuda:0 vs cuda:1.

awgu added the better-engineering Relatively self-contained tasks for better engineering contributors label Apr 27, 2023

zou3519 added module: autograd Related to torch.autograd, and the autograd engine in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Apr 28, 2023

albanD added module: error checking Bugs related to incorrect/lacking error checking actionable labels Apr 28, 2023

kiersten-stokes mentioned this issue May 5, 2023

Improve error messages in THPVariable_set_grad #100683

Closed

pytorchmergebot closed this as completed in 47ec9cc May 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BE] More informative error messages in `THPVariable_set_grad` #100174

[BE] More informative error messages in `THPVariable_set_grad` #100174

awgu commented Apr 27, 2023 •

edited by pytorch-bot bot

awgu commented Apr 27, 2023

albanD commented Apr 28, 2023

awgu commented Apr 28, 2023

bdhirsh commented Apr 28, 2023

[BE] More informative error messages in THPVariable_set_grad #100174

[BE] More informative error messages in THPVariable_set_grad #100174

Comments

awgu commented Apr 27, 2023 • edited by pytorch-bot bot

awgu commented Apr 27, 2023

albanD commented Apr 28, 2023

awgu commented Apr 28, 2023

bdhirsh commented Apr 28, 2023

[BE] More informative error messages in `THPVariable_set_grad` #100174

[BE] More informative error messages in `THPVariable_set_grad` #100174

awgu commented Apr 27, 2023 •

edited by pytorch-bot bot