introduce TensorBase::mutable_data_ptr() #97859

dagitses · 2023-03-29T06:03:29Z

Stack from ghstack (oldest at bottom):

See D44409928 for motivation.

Note that we keep the const-ness of the existing data_ptr() member so
that we don't have to change all references atomically. We just change
the ones here that we have higher confidence with.

Differential Revision: D44492539

NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on Phabricator!

cc @ezyang @bhosmer @smessmer @ljk53 @bdhirsh @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @EikanWang

See D44409928 for motivation. Note that we keep the const-ness of the existing data_ptr() member so that we don't have to change all references atomically. We just change the ones here that we have higher confidence with. Differential Revision: [D44492539](https://our.internmc.facebook.com/intern/diff/D44492539/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44492539/)! [ghstack-poisoned]

pytorch-bot · 2023-03-29T06:03:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/97859

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 9ab65b0:

NEW FAILURE - The following job has failed:

Check labels (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

See D44409928 for motivation. Note that we keep the const-ness of the existing data_ptr() member so that we don't have to change all references atomically. We just change the ones here that we have higher confidence with. Differential Revision: [D44492539](https://our.internmc.facebook.com/intern/diff/D44492539/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44492539/)! cc ezyang bhosmer smessmer ljk53 bdhirsh jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 EikanWang [ghstack-poisoned]

ezyang · 2023-03-29T11:30:22Z

aten/src/ATen/cuda/CUDASparseDescriptors.cpp


  auto batch_stride = ndim > 2 && batch_offset >= 0 ? input_strides[ndim - 3] : 0;
-  void* values_ptr = static_cast<char*>(input.data_ptr()) +
+  void* values_ptr = static_cast<char*>(input.mutable_data_ptr()) +


This looks wrong. Based on the description here I doubt this pointer is actually getting mutated. Generally, you can't trust the underlying libraries to have correct const or not types

ezyang · 2023-03-29T11:31:25Z

aten/src/ATen/cuda/CUDASparseDescriptors.cpp

  cusparseDnVecDescr_t raw_descriptor;
  TORCH_CUDASPARSE_CHECK(cusparseCreateDnVec(
-      &raw_descriptor, input.numel(), input.data_ptr(), value_type));
+      &raw_descriptor, input.numel(), input.mutable_data_ptr(), value_type));


ezyang · 2023-03-29T11:32:20Z

aten/src/ATen/cuda/CUDASparseDescriptors.cpp

          batch_offset * col_indices_batch_stride * col_indices.itemsize(),
      // values of the sparse matrix, size = nnz
-      static_cast<char*>(values.data_ptr()) +
+      static_cast<char*>(values.mutable_data_ptr()) +


ezyang · 2023-03-29T11:32:56Z

aten/src/ATen/cuda/CUDASparseDescriptors.h

-        values.data_ptr()));
+        crow_indices.mutable_data_ptr(),
+        col_indices.mutable_data_ptr(),
+        values.mutable_data_ptr()));


ezyang · 2023-03-29T11:33:13Z

aten/src/ATen/cudnn/Descriptors.h

    AT_ASSERT(options.dtype() == kByte);
    state = at::empty({static_cast<int64_t>(state_size)}, options);
-    AT_CUDNN_CHECK(cudnnSetDropoutDescriptor(mut_desc(), handle, dropout, state.data_ptr(), state_size, seed));
+    AT_CUDNN_CHECK(cudnnSetDropoutDescriptor(mut_desc(), handle, dropout, state.mutable_data_ptr(), state_size, seed));


ezyang · 2023-03-29T11:34:15Z

aten/src/ATen/cudnn/Descriptors.h

    TORCH_INTERNAL_ASSERT(dropout > 0, "dropout must be nonzero; otherwise call set_no_dropout");
    state = state_;
-    void *state_ptr = state.data_ptr();
+    void *state_ptr = state.mutable_data_ptr();


Not sure about this one as dropout state may get subsequently mutated through the descriptor. You will need to read docs

ezyang · 2023-03-29T11:38:20Z

aten/src/ATen/native/Embedding.cpp

    .build();

-  const auto gW_data = reinterpret_cast<char*>(grad_weight.data_ptr());
-  const auto gO_data = reinterpret_cast<char*>(grad.data_ptr());


gO should be read only (gW should be write)

ezyang · 2023-03-29T11:40:57Z

I'm pausing review, please do a reaudit of the rest of your changes. We may also want to discuss what the API for "I promise this is non mutating but the downstream API is not const correct" (most straightforward is to use const_cast, albeit a bit wordy)

dagitses · 2023-03-29T16:36:57Z

I propose the following:

in the short-term, we have tensor.must_audit_mutable_data_ptr(). This can just reflect the status quo of what we have an gives us a convenient string to search for.
regarding const_cast, it might be nice to have a debug-only API that helps verify the mutable borrow is not mutated, something like:

{
  auto data_ptr = tensor.borrow_const_data_ptr();
  someCudaApi(data_ptr.subtle_as_non_const_for_naughty_api());
}

Here the result of borrow_const_data_ptr() is an RAII object that hashes the data upon construction in debug mode and hashes on destruction in debug mode and asserts they are identical.

Even if we deem that too excessive, I would vote in favor of having a member function that is explicit about what exactly we are doing, e.g. unsafe_get_non_const_data_ptr_for_non_const_correct_api(). Otherwise, I would feel compelled to add a wordy comment any place I used const_cast to justify it.

Personally, I like the idea of the debug-only hashing. Trust but verify.

ezyang · 2023-03-29T16:41:25Z

There is no logical place to do the debug check. E.g., in the cudnn APIs, you stash the pointer in to the descriptor, and then the actual mutation would only happen later when you actually do an API call.

dagitses · 2023-03-29T18:22:54Z

There is no logical place to do the debug check. E.g., in the cudnn APIs, you stash the pointer in to the descriptor, and then the actual mutation would only happen later when you actually do an API call.

Sure, that's generally true, but does that describe all or even most functions? And for functions that that are invoked with a separate plan then call: is this sequence typically done within another function or is there a separation between them that makes wrapping that call in a scope that checks difficult?

Alternatively, could such a check be implemented at the dispatcher level?

ezyang · 2023-03-29T18:29:25Z

A dispatcher level check could look something like this: upon entering a function, we setup some TLS mapping input tensors to whether or not they showed up in mutable or non-mutable argument positions according to schema. We now error if you access a data pointer on a tensor that is not explicitly mentioned, or access mutable data pointer on a non-mutable argument.

This check is unlikely to play well with __torch_dispatch__ though, so it will take some thought on how to design it correctly. See also @albanD constantly having to fend off people who want to delete the refcount asserts from autograd.

dagitses · 2023-03-29T18:47:39Z

A dispatcher level check could look something like this: upon entering a function, we setup some TLS mapping input tensors to whether or not they showed up in mutable or non-mutable argument positions according to schema. We now error if you access a data pointer on a tensor that is not explicitly mentioned, or access mutable data pointer on a non-mutable argument.

This check is unlikely to play well with __torch_dispatch__ though, so it will take some thought on how to design it correctly. See also @albanD constantly having to fend off people who want to delete the refcount asserts from autograd.

That seems reasonable to me. OK, that's not going to happen in the very short-term. So what do you think about just having a naming convention to address the two cases we're concerned about here:

tensor.get_data_as_non_const_for_external_api(): for where we'd use const_cast
tensor.must_audit_get_mutable_data(): for where we want to move off of the const returning accessor but need to more closely scrutinize in the medium term.

For the latter, I'm primarily concerned with being able to communicate areas of uncertainty to local experts. For example, it would be nice to put those changes into the flash attention implementation and defer to Driss to resolve them. I don't want to have to make this call authoritatively for every single function.

ezyang · 2023-03-29T22:01:06Z

ok

dagitses · 2023-03-30T12:29:10Z

ok

OK, I think I have an even better idea for the first problem. How about we wrap the "unsafe" APIs with const correct wrappers? Then we do the const_cast inside the wrapper. This makes the most sense to me because we should be auditing APIs for const correctness, not callers.

ezyang · 2023-03-30T21:31:31Z

I'm cool with that too.

See D44409928 for motivation. Note that we keep the const-ness of the existing data_ptr() member so that we don't have to change all references atomically. We just change the ones here that we have higher confidence with. Differential Revision: [D44492539](https://our.internmc.facebook.com/intern/diff/D44492539/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44492539/)! cc ezyang bhosmer smessmer ljk53 bdhirsh jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 EikanWang [ghstack-poisoned]

github-actions · 2023-06-18T13:34:00Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

dagitses requested review from H-Huang, awgu, d4l3k, fegin, jerryzh168, kiukchung, kwen2501, mrshenli, rohan-varma, salilsdesai, wanchaol and zhaojuanmao as code owners March 29, 2023 06:03

dagitses requested review from digantdesai, jianyuh and kimishpatel as code owners March 29, 2023 06:03

pytorch-bot bot added the release notes: quantization release notes category label Mar 29, 2023

github-actions bot added module: cpu CPU specific problem (e.g., perf, algorithm) NNC labels Mar 29, 2023

dagitses added module: internals Related to internal abstractions in c10 and ATen and removed release notes: quantization release notes category labels Mar 29, 2023

dagitses marked this pull request as draft March 29, 2023 06:08

dagitses marked this pull request as ready for review March 29, 2023 10:14

dagitses mentioned this pull request Mar 29, 2023

introduce TensorBase::mutable_data_ptr<T> #97874

Closed

github-actions bot added module: cpu CPU specific problem (e.g., perf, algorithm) NNC release notes: quantization release notes category labels Mar 29, 2023

ezyang reviewed Mar 29, 2023

View reviewed changes

dagitses mentioned this pull request Apr 1, 2023

distinguish mutability of Storage::unsafe_data #98152

Closed

dagitses mentioned this pull request Apr 2, 2023

introduce TensorIterator::unsafe_replace_input() #98162

Closed

pytorchbot added the open source label Apr 19, 2023

github-actions bot added the Stale label Jun 18, 2023

github-actions bot closed this Jul 18, 2023

ezyang added the ezyang's list Stuff ezyang doesn't want to lose label Jul 18, 2023

facebook-github-bot deleted the gh/dagitses/35/head branch August 17, 2023 14:16

lix19937 mentioned this pull request Oct 23, 2024

Installation fails (due to recent change?) NVIDIA/apex#1735

Open

introduce TensorBase::mutable_data_ptr() #97859

introduce TensorBase::mutable_data_ptr() #97859

Uh oh!

Conversation

dagitses commented Mar 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/97859

❌ 1 New Failure

Uh oh!

ezyang Mar 29, 2023

Choose a reason for hiding this comment

Uh oh!

ezyang Mar 29, 2023

Choose a reason for hiding this comment

Uh oh!

ezyang Mar 29, 2023

Choose a reason for hiding this comment

Uh oh!

ezyang Mar 29, 2023

Choose a reason for hiding this comment

Uh oh!

ezyang Mar 29, 2023

Choose a reason for hiding this comment

Uh oh!

ezyang Mar 29, 2023

Choose a reason for hiding this comment

Uh oh!

ezyang Mar 29, 2023

Choose a reason for hiding this comment

Uh oh!

ezyang commented Mar 29, 2023

Uh oh!

dagitses commented Mar 29, 2023

Uh oh!

ezyang commented Mar 29, 2023

Uh oh!

dagitses commented Mar 29, 2023

Uh oh!

ezyang commented Mar 29, 2023

Uh oh!

dagitses commented Mar 29, 2023

Uh oh!

ezyang commented Mar 29, 2023

Uh oh!

dagitses commented Mar 30, 2023

Uh oh!

ezyang commented Mar 30, 2023

Uh oh!

github-actions bot commented Jun 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dagitses commented Mar 29, 2023 •

edited

Loading

pytorch-bot bot commented Mar 29, 2023 •

edited

Loading