Foreach op don't follow aten level debug asserts #93940

albanD · 2023-02-02T15:16:50Z

Since #91846 has moved to make torch.nn.utils.clip_grad_norm_() use foreach ops, it throws the error message:
*** RuntimeError: t.storage().use_count() == 1 INTERNAL ASSERT FAILED at "caffe2/torch/csrc/autograd/autograd_not_implemented_fallback.cpp":189, please report a bug to PyTorch.

when PyTorch is built with debug asserts.

Given the assert, the error is that this foreach op should be returning a brand new Tensor but it is actually returning a Tensor that shares storage with at least another one.

If this is done on purpose for a good reason, we should remove this assert
If that is not expected, then it can be a bug in the implementation where it returns a view where it should not.

cc @ezyang @gchanan @zou3519 @crcrpar @mcarilli @ngimel

albanD · 2023-02-02T15:18:24Z

Setting high priority as this will prevent anyone with a debug build from using clip_grad_norm by default.
So we need to fix this either by reverting the change to foreach by default or fixing the problem here.

crcrpar · 2023-02-02T15:57:42Z

the implementation is based off of apex's multi_tensor_l2norm which returns a Tensor whose each element represents the norm of each input Tensor. _foreach_norm returns a list of Tensors and each of them shares storage with the result 1D Tensor

related part:

pytorch/aten/src/ATen/native/cuda/ForeachReduceOp.cu

Lines 180 to 184 in 98e1b3e

    
           std::vector<Tensor> result; 
        
           result.reserve(ntensors); 
        
           for (const auto& i : c10::irange(ntensors)) { 
        
             result.emplace_back(ret_per_tensor[i]); 
        
           }

(where ret_per_tensor is a 1D Tensor containing results)

albanD · 2023-02-02T16:30:08Z

Ok, that's what I was expecting. We should remove the debug assert for this function then.
Any other foreach op uses the same "trick" where all the outputs are a view into the same buffer?

crcrpar · 2023-02-02T16:41:52Z

I'm not aware of any other foreach with the "trick" :)

ngimel · 2023-02-02T17:52:53Z

Should we instead fix schema to return Tensor(a)[]?

albanD · 2023-02-02T18:17:05Z

Well, technically the outputs are not views as they look at independent parts of the Tensor. And (unless you do shady stuff) you cannot change the other Tensors from one Tensor. So they don't have to be marked as views.

ngimel · 2023-02-02T19:16:57Z

Cool, @crcrpar can you please send a PR exempting for_each_norm from that assert in autograd_not_implemented_fallback.cpp?

albanD · 2023-02-02T19:41:46Z

Updating the fallback kernel to special case for this one sounds ok to me: https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/autograd_not_implemented_fallback.cpp#L189

ezyang · 2023-02-03T01:33:47Z

I get nervous though because functionalization uses storage to determine views and these are definitely sharing storages.

albanD · 2023-02-03T16:20:56Z

Well, these are views from the point of view of functionalization but not from the point of view of autograd? :)

ngimel · 2023-02-03T17:08:46Z

Here, the outputs are views into the intermediate tensor, so no one can modify the base, so for the purposes of functionalization it should be fine (unless someone does naughty stuff and reaches into a different tensor via as_strided and offsets)

albanD added high priority triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: mta Issues related to multi-tensor apply kernels and foreach functions labels Feb 2, 2023

pytorch-bot bot added the triage review label Feb 2, 2023

albanD added this to the 2.0.0 milestone Feb 2, 2023

albanD added the actionable label Feb 2, 2023

crcrpar mentioned this issue Feb 2, 2023

Exempt _foreach_norm from autograd_not_implemented_fallback check #93995

Closed

pytorchmergebot closed this as completed in d9870d7 Feb 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Foreach op don't follow aten level debug asserts #93940

Foreach op don't follow aten level debug asserts #93940

albanD commented Feb 2, 2023 •

edited by pytorch-bot bot

albanD commented Feb 2, 2023

crcrpar commented Feb 2, 2023

albanD commented Feb 2, 2023

crcrpar commented Feb 2, 2023

ngimel commented Feb 2, 2023

albanD commented Feb 2, 2023

ngimel commented Feb 2, 2023

albanD commented Feb 2, 2023

ezyang commented Feb 3, 2023

albanD commented Feb 3, 2023

ngimel commented Feb 3, 2023

Foreach op don't follow aten level debug asserts #93940

Foreach op don't follow aten level debug asserts #93940

Comments

albanD commented Feb 2, 2023 • edited by pytorch-bot bot

albanD commented Feb 2, 2023

crcrpar commented Feb 2, 2023

albanD commented Feb 2, 2023

crcrpar commented Feb 2, 2023

ngimel commented Feb 2, 2023

albanD commented Feb 2, 2023

ngimel commented Feb 2, 2023

albanD commented Feb 2, 2023

ezyang commented Feb 3, 2023

albanD commented Feb 3, 2023

ngimel commented Feb 3, 2023

albanD commented Feb 2, 2023 •

edited by pytorch-bot bot