Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error messages in THPVariable_set_grad #100683

Closed
wants to merge 5 commits into from

Conversation

kiersten-stokes
Copy link
Contributor

@kiersten-stokes kiersten-stokes commented May 5, 2023

Fixes #100174

I'm not sure if there's another direction that we had in mind for this issue, but if so I'm happy to make the changes 🙂

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @soumith @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire @anijain2305 @soulitzer

@pytorch-bot
Copy link

pytorch-bot bot commented May 5, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100683

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 6bb414f:

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@kiersten-stokes
Copy link
Contributor Author

Will fix the lint error first thing Monday!

@albanD
Copy link
Collaborator

albanD commented May 8, 2023

I'll let @soulitzer review this one!

@kiersten-stokes
Copy link
Contributor Author

Apologies for the review request spam - I mistakenly rebased on a difference branch. I suppose that's why I should be using the Pytorch bot :)

Copy link
Contributor

@soulitzer soulitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, small nit on the wording of the message

If you are up for it, it would also be cool to see the error messages improved for when the device ids and sizes are different.

torch/csrc/autograd/python_variable.cpp Outdated Show resolved Hide resolved
@ezyang ezyang removed their request for review May 9, 2023 15:01
@kiersten-stokes
Copy link
Contributor Author

If you are up for it, it would also be cool to see the error messages improved for when the device ids and sizes are different.

@soulitzer Gladly! Will add a commit that addresses that. Would the easiest way to tackle this be using TORCH_CHECK asserts to replace the THPUtils_assertRet asserts? I believe THPUtils_setError (called from the latter) only accepts const char * and hence limits the message complexity.

At this point, I believe the below snippet is checking that 1. the layout of the gradient isn't sparse, and 2. the device type is the same (cuda/CPU). Do we want to go ahead and split those out as well to avoid ambiguity?

bool gradIsSparse =
(var.dtype() == grad.dtype() &&
var.device().type() == grad.device().type() && grad.layout() == kSparse);
THPUtils_assertRet(
-1,
grad.options().type_equal(var.options()) || gradIsSparse,
"assigned grad has data of a different type");

@soulitzer
Copy link
Contributor

@kiersten-stokes Cool! Replacing with TORCH_CHECK sounds good to me.

Btw, now that I think about it, the TORCH_TYPE_CHECK we have currently also might be better off being a TORCH_CHECK because the python type is still a Tensor, its just that the Tensor has as different dtype which is just a property of the tensor. So its a subtle difference, but I think that would be more of a ValueError than a TypeError.

Regarding the logic with sparse, I think it is saying that:

  • if the gradient is not sparse, we do one type of check which is more involved
  • if the gradient is sparse, we do a less involved check (because we think the more involved check would fail)

The more involved check is superset of the less involved check, so we probably just want something like:

<do the same thing we do for dtype, but for device().type() >

if (grad.layout() != kSparse) {
  TORCH_CHECK(
      grad.options().type_equal(var.options()) , 
      "error msg" );
}

@kiersten-stokes
Copy link
Contributor Author

Btw, now that I think about it, the TORCH_TYPE_CHECK we have currently also might be better off being a TORCH_CHECK because the python type is still a Tensor, its just that the Tensor has as different dtype which is just a property of the tensor. So its a subtle difference, but I think that would be more of a ValueError than a TypeError.

@soulitzer great point - updating that now!

Re: the below, I see what you're saying!

<do the same thing we do for dtype, but for device().type() >

if (grad.layout() != kSparse) {
  TORCH_CHECK(
      grad.options().type_equal(var.options()) , 
      "error msg" );
}

I believe, now that we have a check for dtype and device type preceding this, that the only remaining check here would be to ensure no layout mismatch between the gradient and the tensor if the gradient layout is sparse? Assuming type_equal behaves as indicated in the issue comment here. I'll push a change assuming this is true for now, but let me know if I'm missing something and we'd like to go a different way with it!

@soulitzer
Copy link
Contributor

soulitzer commented May 10, 2023

Thanks for the update, looks mostly good!

Re: the checks for layout - I think its a great observation that checking layout is probably all we need here given what type_equal is just a function of dtype, device type, and layout (in theory it may be a stricter check now since that function may not be injective).

I think it depends on what the intentions of the person who originally wrote the check, did they simply use dispatch key computation as a convenient proxy? e.g. maybe all we care about are dtype, device, and layout. Or does it matter that the computed dispatch keys are the same?

I don't have any strong opinion on this one. Keeping it the way it is with type_check would be the safer option, but turning that into a layout check would be potentially clearer. Tagging Alban for more thoughts.

cc @albanD

@soulitzer
Copy link
Contributor

Based on some of Alban's thoughts offline, I am in favor of the safe option - unless we want to add a ton of tests to check that we don't regress things (which we probably don't want to do given that this function is only called in the case someone assigns to .grad)

@kiersten-stokes
Copy link
Contributor Author

Based on some of Alban's thoughts offline, I am in favor of the safe option - unless we want to add a ton of tests to check that we don't regress things (which we probably don't want to do given that this function is only called in the case someone assigns to .grad)

Completely fair! And I appreciate you sharing your reasoning with me as well - it helps me get a bit more of the feel for the project overall. I'll get a commit in a bit later today to return to type_equal with a more generic error message

@mikaylagawarecki mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 11, 2023
@kiersten-stokes kiersten-stokes changed the title Add error check for dtype match in THPVariable_set_grad Improve error messages in THPVariable_set_grad May 11, 2023
Copy link
Contributor

@soulitzer soulitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks great!

@soulitzer soulitzer added the ciflow/trunk Trigger trunk jobs on your pull request label May 11, 2023
@soulitzer
Copy link
Contributor

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: Mac MPS / macos-13-py3-arm64-mps / test (default, 1, 1)

Details for Dev Infra team Raised by workflow job

@soulitzer
Copy link
Contributor

@pytorchbot merge -f "Unrelated failures"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/inductor ciflow/mps Run MPS tests (subset of trunk) ciflow/trunk Trigger trunk jobs on your pull request Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: dynamo module: inductor open source release notes: quantization release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BE] More informative error messages in THPVariable_set_grad
6 participants