[fix] vmap: fix segfault on data access #97237

kshitij12345 · 2023-03-21T11:06:43Z

Fixes #97161

pytorch-bot · 2023-03-21T11:06:47Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/97237

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1a23bb4:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

kshitij12345 · 2023-03-21T13:15:53Z

test/functorch/test_vmap.py

+    def test_data_attribute(self):
+        def foo(x):
+            y = x.data
+            y.sum()


Without the fix, this line fails with

RuntimeError: batched == nullptr INTERNAL ASSERT FAILED at "aten/src/ATen/functorch/Interpreter.cpp":98, please report a bug to PyTorch.

zou3519 · 2023-03-21T15:07:42Z

aten/src/ATen/functorch/BatchedTensorImpl.cpp

+c10::intrusive_ptr<TensorImpl> BatchedTensorImpl::shallow_copy_and_detach(
+    const c10::VariableVersion& version_counter,
+    bool allow_tensor_metadata_change) const {
+  DispatchKeySet key_set = getKeysToPropagateToWrapper(value());
+  auto impl = c10::make_intrusive<BatchedTensorImpl>(key_set, value(), bdim(), level());
+  impl->set_version_counter(version_counter);
+  return impl;
+}
+
+c10::intrusive_ptr<TensorImpl> BatchedTensorImpl::shallow_copy_and_detach(
+    c10::VariableVersion&& version_counter,
+    bool allow_tensor_metadata_change) const {
+  DispatchKeySet key_set = getKeysToPropagateToWrapper(value());
+  auto impl = c10::make_intrusive<BatchedTensorImpl>(key_set, value(), bdim(), level());
+  impl->set_version_counter(version_counter);
+  return impl;
+}
+
+void BatchedTensorImpl::shallow_copy_from(const c10::intrusive_ptr<TensorImpl>& impl) {
+  TORCH_CHECK(false, "mutating directly with `.data` inside functorch transform is not allowed.");
+}
+


Do you have an explanation of why the segfault happened without these? (and do we know the full implications of adding shallow_copy_and_detach to BatchedTensorImpl?)

In shallow_copy_and_detach of the base TensorImpl, we propagate the key_set which includes FunctorchBatched key. So we create a Tensor which pretends to be a batched tensor but in reality isn't, which leads to problem downstream when we try to get a field or call method exclusive to BatchedTensor.

pytorch/c10/core/TensorImpl.cpp

Lines 782 to 796 in 517a432

auto impl = c10::make_intrusive<TensorImpl>(

// No need to populate Storage; copy_tensor_metadata will do it for us.

key_set_,

data_type_,

device_opt_);

copy_tensor_metadata(

/*src_impl=*/this,

/*dest_impl=*/impl.get(),

/*version_counter=*/std::forward<VariableVersion>(version_counter),

/*allow_tensor_metadata_change=*/allow_tensor_metadata_change);

impl->refresh_numel();

impl->refresh_contiguous();

return impl;

}

For the example in the issue, the code was failing trying to do batched->value().

pytorch/torch/csrc/functorch/init.cpp

Lines 312 to 317 in 517a432

static Tensor get_unwrapped(const Tensor& tensor) {

auto* batched = maybeGetBatchedImpl(tensor);

if (batched) {

return batched->value();

}

auto* wrapped = maybeGetTensorWrapper(tensor);

As for the implication, I didn't understand the complete semantics of these functions (and would like to discuss offline).

As for why I thought it is ok to support shallow_copy_from (getter for .data) was because it worked with GradTrackingTensor, so to be consistent I thought it makes sense to allow it for BatchedTensor as well.

Though on second thought, I think we should disallow it for both of them.

import torch def foo(x): y = x.data print(y) y.sum() return x.sum() # torch.func.vmap(foo)(torch.randn(3, 3)) torch.func.grad(foo)(torch.randn(3, 3))

Output:

GradTrackingTensor(lvl=1, value= tensor([[-1.6977, 0.6374, 0.0781], [-0.4140, 1.5172, 0.0473], [ 0.8435, -0.2261, 0.0345]]) )

accessing .data under vmap

Hmm, I think we should disallow accessing .data() under vmap. Under your PR, I get the following:

import torch def f(x): return x.data x = torch.randn([3], requires_grad=True) y = torch.vmap(f)(x)

This returns a Tensor y that does require True, which is not correct -- if we did this in a while loop, it would return Tensors that do not require grad.

setting .data under vmap

We should probably also disallow setting .data() under vmap? Under your PR I get the following:

import torch def f(x, y): x.data = y return x x = torch.randn([3]) y = torch.randn([3], requires_grad=True) res = torch.vmap(f)(x, y) print(res) RuntimeError: Batching rule not implemented for aten::_has_compatible_shallow_copy_type. We could not generate a fallback.

So we probably want to improve the error message some how.

setting .data under grad

The output of your script seems fine -- what is wrong with it?

Regarding, .data under grad

I had following example in mind where there seems to be conflicting semantics around .data. We disable directly updating .data (second function) but allow access to .data which returns a shallow copy. If we mutate the shallow copy, then x is also updated (and may result in silently incorrect results for x since that mutation doesn't go from autograd).

.data allowing mutation which is opaque to autograd is same semantic as PyTorch eager but currently we have conflicting semantics under grad, vjp, etc.

import torch # This works. def foo(x): y = x.data y.copy_(torch.zeros(3, 3)) return (x * x).sum() # FAILS: RuntimeError: false INTERNAL ASSERT FAILED at "aten/src/ATen/functorch/TensorWrapper.cpp":137, # please report a bug to PyTorch. NYI def foo(x): x.data = torch.zeros(3, 3) return (x * x).sum() print(torch.func.grad(foo)(torch.randn(3, 3)))

Skylion007 · 2023-03-21T15:14:53Z

aten/src/ATen/functorch/BatchedTensorImpl.cpp

+    bool allow_tensor_metadata_change) const {
+  DispatchKeySet key_set = getKeysToPropagateToWrapper(value());
+  auto impl = c10::make_intrusive<BatchedTensorImpl>(key_set, value(), bdim(), level());
+  impl->set_version_counter(version_counter);


Suggested change

impl->set_version_counter(version_counter);

impl->set_version_counter(std::move(version_counter));

Do we need std::move as version_counter is already c10::VariableVersion&&.

kshitij12345 · 2023-03-23T17:08:55Z

@zou3519 as discussed offline, have disabled .data for BatchedTensor.

zou3519 · 2023-03-23T22:17:22Z

test/functorch/test_vmap.py

+
+        with self.assertRaisesRegex(RuntimeError, "accessing `data` under vmap transform"):
+            torch.func.vmap(foo)(torch.randn(3, 3))
+


We should add a test for the set_data case and assert that it raises the nice error message

zou3519

Add a test please for the mutating data case, otherwise, this LGTM

kshitij12345 · 2023-03-24T05:49:04Z

Done. I had to add a batch rule for _has_compatible_shallow_copy_type which just errors out. PTAL :)

kshitij12345 · 2023-03-25T04:43:42Z

@pytorchbot merge

pytorchmergebot · 2023-03-25T04:45:40Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-03-25T05:46:25Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-bionic-cuda11.8-py3.10-gcc7 / test (functorch, 1, 1, linux.4xlarge.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

kshitij12345 · 2023-03-28T03:33:38Z

@pytorchbot merge

pytorchmergebot · 2023-03-28T03:35:33Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

kshitij12345 added 3 commits March 21, 2023 10:41

[fix] vmap: fix segfault on data access

d49d9fb

add test

2020d6b

add op to make sure test fails

68710fe

pytorchbot added the open source label Mar 21, 2023

kshitij12345 commented Mar 21, 2023

View reviewed changes

kshitij12345 marked this pull request as ready for review March 21, 2023 14:22

kshitij12345 requested review from Chillee and zou3519 as code owners March 21, 2023 14:22

zou3519 reviewed Mar 21, 2023

View reviewed changes

Skylion007 reviewed Mar 21, 2023

View reviewed changes

albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 21, 2023

disable data under vmap transform

d5c8788

kshitij12345 requested a review from zou3519 March 23, 2023 17:09

zou3519 reviewed Mar 23, 2023

View reviewed changes

zou3519 approved these changes Mar 23, 2023

View reviewed changes

update test and changes

0c2126f

kshitij12345 added the release notes: torch.func release notes category for torch.vmap or torch.func.* APIs label Mar 25, 2023

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 25, 2023

fix

1a23bb4

pytorchmergebot added the Merged label Mar 28, 2023

pytorchmergebot closed this in 1726c6f Mar 28, 2023

kshitij12345 mentioned this pull request Mar 29, 2023

[v2.0.1] Release Tracker #97272

Closed

	auto impl = c10::make_intrusive<TensorImpl>(
	// No need to populate Storage; copy_tensor_metadata will do it for us.
	key_set_,
	data_type_,
	device_opt_);
	copy_tensor_metadata(
	/src_impl=/this,
	/dest_impl=/impl.get(),
	/version_counter=/std::forward<VariableVersion>(version_counter),
	/allow_tensor_metadata_change=/allow_tensor_metadata_change);

	impl->refresh_numel();
	impl->refresh_contiguous();
	return impl;
	}

	static Tensor get_unwrapped(const Tensor& tensor) {
	auto* batched = maybeGetBatchedImpl(tensor);
	if (batched) {
	return batched->value();
	}
	auto* wrapped = maybeGetTensorWrapper(tensor);

	impl->set_version_counter(version_counter);
	impl->set_version_counter(std::move(version_counter));


		with self.assertRaisesRegex(RuntimeError, "accessing `data` under vmap transform"):
		torch.func.vmap(foo)(torch.randn(3, 3))

[fix] vmap: fix segfault on data access #97237

[fix] vmap: fix segfault on data access #97237

Uh oh!

Conversation

kshitij12345 commented Mar 21, 2023

Uh oh!

pytorch-bot bot commented Mar 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/97237

✅ No Failures

Uh oh!

kshitij12345 Mar 21, 2023

Choose a reason for hiding this comment

Uh oh!

zou3519 Mar 21, 2023

Choose a reason for hiding this comment

Uh oh!

kshitij12345 Mar 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zou3519 Mar 22, 2023

Choose a reason for hiding this comment

accessing .data under vmap

setting .data under vmap

setting .data under grad

Uh oh!

kshitij12345 Mar 23, 2023

Choose a reason for hiding this comment

Uh oh!

Skylion007 Mar 21, 2023

Choose a reason for hiding this comment

Uh oh!

kshitij12345 Mar 22, 2023

Choose a reason for hiding this comment

Uh oh!

kshitij12345 commented Mar 23, 2023

Uh oh!

zou3519 Mar 23, 2023

Choose a reason for hiding this comment

Uh oh!

zou3519 left a comment

Choose a reason for hiding this comment

Uh oh!

kshitij12345 commented Mar 24, 2023

Uh oh!

kshitij12345 commented Mar 25, 2023

Uh oh!

pytorchmergebot commented Mar 25, 2023

Merge started

Uh oh!

pytorchmergebot commented Mar 25, 2023

Merge failed

Uh oh!

kshitij12345 commented Mar 28, 2023

Uh oh!

pytorchmergebot commented Mar 28, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pytorch-bot bot commented Mar 21, 2023 •

edited

Loading

kshitij12345 Mar 22, 2023 •

edited

Loading