New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve error messages in THPVariable_set_grad
#100683
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100683
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 6bb414f: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Will fix the lint error first thing Monday! |
I'll let @soulitzer review this one! |
aae33af
to
c2cc683
Compare
Apologies for the review request spam - I mistakenly rebased on a difference branch. I suppose that's why I should be using the Pytorch bot :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, small nit on the wording of the message
If you are up for it, it would also be cool to see the error messages improved for when the device ids and sizes are different.
@soulitzer Gladly! Will add a commit that addresses that. Would the easiest way to tackle this be using At this point, I believe the below snippet is checking that 1. the layout of the gradient isn't sparse, and 2. the device type is the same (cuda/CPU). Do we want to go ahead and split those out as well to avoid ambiguity? pytorch/torch/csrc/autograd/python_variable.cpp Lines 917 to 923 in 31f311a
|
@kiersten-stokes Cool! Replacing with Btw, now that I think about it, the Regarding the logic with sparse, I think it is saying that:
The more involved check is superset of the less involved check, so we probably just want something like:
|
@soulitzer great point - updating that now! Re: the below, I see what you're saying!
I believe, now that we have a check for dtype and device type preceding this, that the only remaining check here would be to ensure no layout mismatch between the gradient and the tensor if the gradient layout is sparse? Assuming |
Thanks for the update, looks mostly good! Re: the checks for layout - I think its a great observation that checking layout is probably all we need here given what I think it depends on what the intentions of the person who originally wrote the check, did they simply use dispatch key computation as a convenient proxy? e.g. maybe all we care about are dtype, device, and layout. Or does it matter that the computed dispatch keys are the same? I don't have any strong opinion on this one. Keeping it the way it is with type_check would be the safer option, but turning that into a layout check would be potentially clearer. Tagging Alban for more thoughts. cc @albanD |
Based on some of Alban's thoughts offline, I am in favor of the safe option - unless we want to add a ton of tests to check that we don't regress things (which we probably don't want to do given that this function is only called in the case someone assigns to .grad) |
Completely fair! And I appreciate you sharing your reasoning with me as well - it helps me get a bit more of the feel for the project overall. I'll get a commit in a bit later today to return to |
THPVariable_set_grad
THPVariable_set_grad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks great!
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: Mac MPS / macos-13-py3-arm64-mps / test (default, 1, 1) Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge -f "Unrelated failures" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fixes #100174
I'm not sure if there's another direction that we had in mind for this issue, but if so I'm happy to make the changes 🙂
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @soumith @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire @anijain2305 @soulitzer