Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Foreach op don't follow aten level debug asserts #93940

Closed
albanD opened this issue Feb 2, 2023 · 11 comments
Closed

Foreach op don't follow aten level debug asserts #93940

albanD opened this issue Feb 2, 2023 · 11 comments
Labels
actionable high priority module: mta Issues related to multi-tensor apply kernels and foreach functions triage review triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Milestone

Comments

@albanD
Copy link
Collaborator

albanD commented Feb 2, 2023

Since #91846 has moved to make torch.nn.utils.clip_grad_norm_() use foreach ops, it throws the error message:
*** RuntimeError: t.storage().use_count() == 1 INTERNAL ASSERT FAILED at "caffe2/torch/csrc/autograd/autograd_not_implemented_fallback.cpp":189, please report a bug to PyTorch.

when PyTorch is built with debug asserts.

Given the assert, the error is that this foreach op should be returning a brand new Tensor but it is actually returning a Tensor that shares storage with at least another one.

  • If this is done on purpose for a good reason, we should remove this assert
  • If that is not expected, then it can be a bug in the implementation where it returns a view where it should not.

cc @ezyang @gchanan @zou3519 @crcrpar @mcarilli @ngimel

@albanD albanD added high priority triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: mta Issues related to multi-tensor apply kernels and foreach functions labels Feb 2, 2023
@albanD
Copy link
Collaborator Author

albanD commented Feb 2, 2023

Setting high priority as this will prevent anyone with a debug build from using clip_grad_norm by default.
So we need to fix this either by reverting the change to foreach by default or fixing the problem here.

@albanD albanD added this to the 2.0.0 milestone Feb 2, 2023
@crcrpar
Copy link
Collaborator

crcrpar commented Feb 2, 2023

the implementation is based off of apex's multi_tensor_l2norm which returns a Tensor whose each element represents the norm of each input Tensor. _foreach_norm returns a list of Tensors and each of them shares storage with the result 1D Tensor

related part:

std::vector<Tensor> result;
result.reserve(ntensors);
for (const auto& i : c10::irange(ntensors)) {
result.emplace_back(ret_per_tensor[i]);
}
(where ret_per_tensor is a 1D Tensor containing results)

@albanD
Copy link
Collaborator Author

albanD commented Feb 2, 2023

Ok, that's what I was expecting. We should remove the debug assert for this function then.
Any other foreach op uses the same "trick" where all the outputs are a view into the same buffer?

@crcrpar
Copy link
Collaborator

crcrpar commented Feb 2, 2023

I'm not aware of any other foreach with the "trick" :)

@ngimel
Copy link
Collaborator

ngimel commented Feb 2, 2023

Should we instead fix schema to return Tensor(a)[]?

@albanD
Copy link
Collaborator Author

albanD commented Feb 2, 2023

Well, technically the outputs are not views as they look at independent parts of the Tensor. And (unless you do shady stuff) you cannot change the other Tensors from one Tensor. So they don't have to be marked as views.

@ngimel
Copy link
Collaborator

ngimel commented Feb 2, 2023

Cool, @crcrpar can you please send a PR exempting for_each_norm from that assert in autograd_not_implemented_fallback.cpp?

@albanD
Copy link
Collaborator Author

albanD commented Feb 2, 2023

Updating the fallback kernel to special case for this one sounds ok to me: https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/autograd_not_implemented_fallback.cpp#L189

@ezyang
Copy link
Contributor

ezyang commented Feb 3, 2023

I get nervous though because functionalization uses storage to determine views and these are definitely sharing storages.

@albanD
Copy link
Collaborator Author

albanD commented Feb 3, 2023

Well, these are views from the point of view of functionalization but not from the point of view of autograd? :)

@ngimel
Copy link
Collaborator

ngimel commented Feb 3, 2023

Here, the outputs are views into the intermediate tensor, so no one can modify the base, so for the purposes of functionalization it should be fine (unless someone does naughty stuff and reaches into a different tensor via as_strided and offsets)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
actionable high priority module: mta Issues related to multi-tensor apply kernels and foreach functions triage review triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants