Improve error messages in `THPVariable_set_grad` #100683

kiersten-stokes · 2023-05-05T03:11:11Z

I'm not sure if there's another direction that we had in mind for this issue, but if so I'm happy to make the changes 🙂

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @soumith @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire @anijain2305 @soulitzer

pytorch-bot · 2023-05-05T03:11:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100683

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 6bb414f:

NEW FAILURE - The following job has failed:

macos-13-py3-arm64-mps / test (default, 1, 1) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

kiersten-stokes · 2023-05-05T18:34:18Z

Will fix the lint error first thing Monday!

albanD · 2023-05-08T16:45:30Z

I'll let @soulitzer review this one!

kiersten-stokes · 2023-05-08T21:24:11Z

Apologies for the review request spam - I mistakenly rebased on a difference branch. I suppose that's why I should be using the Pytorch bot :)

soulitzer

Thanks, small nit on the wording of the message

If you are up for it, it would also be cool to see the error messages improved for when the device ids and sizes are different.

torch/csrc/autograd/python_variable.cpp

kiersten-stokes · 2023-05-09T22:14:28Z

If you are up for it, it would also be cool to see the error messages improved for when the device ids and sizes are different.

@soulitzer Gladly! Will add a commit that addresses that. Would the easiest way to tackle this be using TORCH_CHECK asserts to replace the THPUtils_assertRet asserts? I believe THPUtils_setError (called from the latter) only accepts const char * and hence limits the message complexity.

At this point, I believe the below snippet is checking that 1. the layout of the gradient isn't sparse, and 2. the device type is the same (cuda/CPU). Do we want to go ahead and split those out as well to avoid ambiguity?

pytorch/torch/csrc/autograd/python_variable.cpp

Lines 917 to 923 in 31f311a

    
           bool gradIsSparse = 
        
               (var.dtype() == grad.dtype() && 
        
                var.device().type() == grad.device().type() && grad.layout() == kSparse); 
        
           THPUtils_assertRet( 
        
               -1, 
        
               grad.options().type_equal(var.options()) || gradIsSparse, 
        
               "assigned grad has data of a different type");

soulitzer · 2023-05-09T23:24:01Z

@kiersten-stokes Cool! Replacing with TORCH_CHECK sounds good to me.

Btw, now that I think about it, the TORCH_TYPE_CHECK we have currently also might be better off being a TORCH_CHECK because the python type is still a Tensor, its just that the Tensor has as different dtype which is just a property of the tensor. So its a subtle difference, but I think that would be more of a ValueError than a TypeError.

Regarding the logic with sparse, I think it is saying that:

if the gradient is not sparse, we do one type of check which is more involved
if the gradient is sparse, we do a less involved check (because we think the more involved check would fail)

The more involved check is superset of the less involved check, so we probably just want something like:

<do the same thing we do for dtype, but for device().type() >

if (grad.layout() != kSparse) {
  TORCH_CHECK(
      grad.options().type_equal(var.options()) , 
      "error msg" );
}

kiersten-stokes · 2023-05-10T19:18:58Z

Btw, now that I think about it, the TORCH_TYPE_CHECK we have currently also might be better off being a TORCH_CHECK because the python type is still a Tensor, its just that the Tensor has as different dtype which is just a property of the tensor. So its a subtle difference, but I think that would be more of a ValueError than a TypeError.

@soulitzer great point - updating that now!

Re: the below, I see what you're saying!

<do the same thing we do for dtype, but for device().type() >

if (grad.layout() != kSparse) {
  TORCH_CHECK(
      grad.options().type_equal(var.options()) , 
      "error msg" );
}

I believe, now that we have a check for dtype and device type preceding this, that the only remaining check here would be to ensure no layout mismatch between the gradient and the tensor if the gradient layout is sparse? Assuming type_equal behaves as indicated in the issue comment here. I'll push a change assuming this is true for now, but let me know if I'm missing something and we'd like to go a different way with it!

torch/csrc/autograd/python_variable.cpp

soulitzer · 2023-05-10T20:33:48Z

Thanks for the update, looks mostly good!

Re: the checks for layout - I think its a great observation that checking layout is probably all we need here given what type_equal is just a function of dtype, device type, and layout (in theory it may be a stricter check now since that function may not be injective).

I think it depends on what the intentions of the person who originally wrote the check, did they simply use dispatch key computation as a convenient proxy? e.g. maybe all we care about are dtype, device, and layout. Or does it matter that the computed dispatch keys are the same?

I don't have any strong opinion on this one. Keeping it the way it is with type_check would be the safer option, but turning that into a layout check would be potentially clearer. Tagging Alban for more thoughts.

cc @albanD

soulitzer · 2023-05-10T20:59:56Z

Based on some of Alban's thoughts offline, I am in favor of the safe option - unless we want to add a ton of tests to check that we don't regress things (which we probably don't want to do given that this function is only called in the case someone assigns to .grad)

kiersten-stokes · 2023-05-10T21:59:16Z

Based on some of Alban's thoughts offline, I am in favor of the safe option - unless we want to add a ton of tests to check that we don't regress things (which we probably don't want to do given that this function is only called in the case someone assigns to .grad)

Completely fair! And I appreciate you sharing your reasoning with me as well - it helps me get a bit more of the feel for the project overall. I'll get a commit in a bit later today to return to type_equal with a more generic error message

soulitzer

Thanks, looks great!

soulitzer · 2023-05-12T01:05:09Z

@pytorchbot merge

pytorchmergebot · 2023-05-12T01:07:58Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-05-12T01:08:02Z

Merge failed

Reason: 1 jobs have failed, first few of them are: Mac MPS / macos-13-py3-arm64-mps / test (default, 1, 1)

Details for Dev Infra team

Raised by workflow job

soulitzer · 2023-05-12T01:52:07Z

@pytorchbot merge -f "Unrelated failures"

pytorchmergebot · 2023-05-12T01:54:15Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

kiersten-stokes requested review from albanD and soulitzer as code owners May 5, 2023 03:11

pytorchbot added the open source label May 5, 2023

kiersten-stokes force-pushed the improve-err-msg branch from aae33af to c2cc683 Compare May 8, 2023 21:22

soulitzer reviewed May 8, 2023

View reviewed changes

torch/csrc/autograd/python_variable.cpp Outdated Show resolved Hide resolved

ezyang removed their request for review May 9, 2023 15:01

Update message according to review

af9a36c

Add additional TORCH_CHECK asserts

fb00594

kiersten-stokes commented May 10, 2023

View reviewed changes

torch/csrc/autograd/python_variable.cpp Outdated Show resolved Hide resolved

Use type_equal instead of pure layout comparison for remaining case

6bb414f

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 11, 2023

kiersten-stokes changed the title ~~Add error check for dtype match in THPVariable_set_grad~~ Improve error messages in THPVariable_set_grad May 11, 2023

soulitzer approved these changes May 11, 2023

View reviewed changes

soulitzer added the ciflow/trunk Trigger trunk jobs on your pull request label May 11, 2023

pytorchmergebot added the merging label May 12, 2023

pytorchmergebot removed the merging label May 12, 2023

pytorchmergebot added the merging label May 12, 2023

pytorchmergebot added Merged and removed merging labels May 12, 2023

pytorchmergebot closed this in 47ec9cc May 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve error messages in `THPVariable_set_grad` #100683

Improve error messages in `THPVariable_set_grad` #100683

kiersten-stokes commented May 5, 2023 •

edited by pytorch-bot bot

pytorch-bot bot commented May 5, 2023 •

edited

kiersten-stokes commented May 5, 2023

albanD commented May 8, 2023

kiersten-stokes commented May 8, 2023

soulitzer left a comment

kiersten-stokes commented May 9, 2023

soulitzer commented May 9, 2023

kiersten-stokes commented May 10, 2023

soulitzer commented May 10, 2023 •

edited

soulitzer commented May 10, 2023

kiersten-stokes commented May 10, 2023

soulitzer left a comment

soulitzer commented May 12, 2023

pytorchmergebot commented May 12, 2023

pytorchmergebot commented May 12, 2023

soulitzer commented May 12, 2023

pytorchmergebot commented May 12, 2023

Improve error messages in THPVariable_set_grad #100683

Improve error messages in THPVariable_set_grad #100683

Conversation

kiersten-stokes commented May 5, 2023 • edited by pytorch-bot bot

pytorch-bot bot commented May 5, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100683

❌ 1 New Failure

kiersten-stokes commented May 5, 2023

albanD commented May 8, 2023

kiersten-stokes commented May 8, 2023

soulitzer left a comment

Choose a reason for hiding this comment

kiersten-stokes commented May 9, 2023

soulitzer commented May 9, 2023

kiersten-stokes commented May 10, 2023

soulitzer commented May 10, 2023 • edited

soulitzer commented May 10, 2023

kiersten-stokes commented May 10, 2023

soulitzer left a comment

Choose a reason for hiding this comment

soulitzer commented May 12, 2023

pytorchmergebot commented May 12, 2023

Merge started

pytorchmergebot commented May 12, 2023

Merge failed

soulitzer commented May 12, 2023

pytorchmergebot commented May 12, 2023

Merge started

Improve error messages in `THPVariable_set_grad` #100683

Improve error messages in `THPVariable_set_grad` #100683

kiersten-stokes commented May 5, 2023 •

edited by pytorch-bot bot

pytorch-bot bot commented May 5, 2023 •

edited

soulitzer commented May 10, 2023 •

edited