Add test to check that COW inputs are not materialized #119507

kurtamohler · 2024-02-08T23:57:24Z

Stack from ghstack (oldest at bottom):

Part of #97856

[ghstack-poisoned]

pytorch-bot · 2024-02-08T23:57:27Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119507

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 1 Pending, 2 Unrelated Failures

As of commit 74061db with merge base d534a49 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-focal-py3_8-clang9-xla / test (xla, 1, 1, linux.12xlarge) (gh)
test_simple_model

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

periodic / linux-focal-rocm6.0-py3.8 / test (distributed, 1, 2, linux.rocm.gpu, unstable) (gh)
distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc::test_reorder_compute_for_overlap_custom_runtime_estimation

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Part of #97856 [ghstack-poisoned]

ghstack-source-id: 8053d36 Pull Request resolved: #119507

ezyang · 2024-02-09T04:02:03Z

@pytorchbot merge

pytorchmergebot · 2024-02-09T04:04:05Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

ezyang · 2024-02-09T14:42:54Z

(some of) test fail looks real

kurtamohler · 2024-02-09T20:17:15Z

According to the trace for that failing macos job, materialize_cow_storage decrements the COWDeleterContext's refcount to 0, and then before ~COWDeleterContext() is called, the refcount gets incremented somewhere else. So the refcount is nonzero by the time the destructor is invoked, which we have an internal assert to detect. I suppose it must be a race condition. I'll figure out exactly why it's happening

EDIT: Oh I see. The TensorBase::data_ptr call is happening inside of an at::parallel_for.

pytorch/aten/src/ATen/native/cpu/DepthwiseConvKernel.cpp

Lines 295 to 301 in a7f82b7

    
           at::parallel_for(0, args.batch * args.out_channels, 0, [&](int64_t start, int64_t end) { 
        
             for (const auto k : c10::irange(start, end)) { 
        
               const int64_t g = k % args.out_channels; 
        
               const int64_t i = k / (args.out_channels / groups); 
        
               convolution_depthwise3x3_winograd_impl( 
        
                   args, 
        
                   input.data_ptr<float>() + i * input_hxw,

By the time one of the threads tries to delete the COWDeleterContext, another thread has already changed the refcount. The refcount is atomic, but in this case that doesn't prevent the race since the combined operation of decrementing the refcount and then deleting the context is not mutexed.

pytorch/c10/core/impl/COWDeleter.cpp

Lines 18 to 33 in a7f82b7

    
           auto cow::COWDeleterContext::increment_refcount() -> void { 
        
             auto refcount = ++refcount_; 
        
             TORCH_INTERNAL_ASSERT(refcount > 1); 
        
           } 
        
           auto cow::COWDeleterContext::decrement_refcount() 
        
               -> std::variant<NotLastReference, LastReference> { 
        
             auto refcount = --refcount_; 
        
             TORCH_INTERNAL_ASSERT(refcount >= 0, refcount); 
        
             if (refcount == 0) { 
        
               std::unique_lock lock(mutex_); 
        
               auto result = std::move(data_); 
        
               lock.unlock(); 
        
               delete this; 
        
               return {std::move(result)}; 
        
             }

I'll have to add a cpp test case that makes this race happen, and then fix it

kurtamohler · 2024-02-15T21:37:18Z

It turns out that fixing this race condition is trickier than I thought, unless I'm overlooking something.

Here's a basic example to make sure we're on the same page:

Tensor tensor = at::_lazy_clone(at::randn({10}));
int64_t num_thread = 100;
at::parallel_for(0, num_threads, 1, [&](int64_t begin, int64_t end) {
  void* ptr = tensor.mutable_data_ptr();
});

Inside of the mutable_data_ptr() call, the first thread (thread A) that tries to obtain the mutable pointer will need to materialize the data. When thread A starts doing this, the StorageImpl's DataPtr's UniqueVoidPtr's context is a COWDeleterContext.

To materialize the data, thread A needs to replace the StorageImpl's DataPtr with a new DataPtr that doesn't contain a pointer to the COWDeleterContext. Also, the COWDeleterContext will need to be deleted at some point to avoid a memory leak. Currently, for the example above, thread A decrements the refcount, which reaches 0, and the COWDeleterContext then gets deleted.

During this time, thread A needs to somehow coordinate with the other threads to make them wait until it is finished materializing the StorageImpl. Right now, this isn't being done properly, so other threads start trying to materialize the storage as well, and that's the race condition we're seeing.

I was considering using the mutex in the COWDeleterContext for this coordination, but I realized that since thread A will delete the COWDeleterContext, the mutex would also get deleted while other threads may be waiting to acquire a lock on it. Not good.

I was also considering having each thread increment the refcount of the COWDeleterContext before waiting on the mutex and then decrement it again afterwards. I thought that would prevent the mutex from getting deleted while the other threads are waiting on it, since the refcount seems like it wouldn't ever reach 0 until the last thread waiting on the mutex acquires it. However, there would actually still be another race condition here. Let's say thread A gets to the point where it decrements the refcount to 0 before any other threads catch up, but then right after that, thread B increments the refcount again and starts waiting on the mutex. In that case, thread A will continue to delete the COWDeleterContext, since it saw the refcount reach 0, and thread B is still left waiting for a deleted mutex. And it's also possible that thread B could increment the refcount right after thread A finishes deleting the COWDeleterContext, while the refcount is held in invalid memory. Still no good.

I thought of one solution that works, but it doesn't seem ideal. The idea is to add a mutex to the materialize_cow_storage function, so that only one of the threads can materialize a storage at a time. But that also prevents two threads from materializing two different storages at the same time. I'm not sure how often this happens, but performance would decrease for those cases.

Another option would be to provide a different mutex for each StorageImpl which will remain alive after the COWDeleterContext is deleted. That way, the other threads won't be left waiting on a deleted mutex. But where would this mutex live? I guess the most obvious option is to add the mutex to the StorageImpl. But I'm not sure if we'd want to increase the size of StorageImpl.

Another option is maybe to create a global container of mutexes. Each mutex would be paired with a reference to the StorageImpl that is supposed to be used for. But that seems like a very weird solution, and I'm not sure at what point we could decide that one of the mutexes is no longer needed and remove it from the container.

Maybe there's a better solution that I haven't thought of.

What do you think @ezyang ?

ezyang · 2024-02-19T00:30:12Z

My proposal will require some upfront work, but I think it is the simplest: let's ban multithreaded access to the mutable data pointer. The reasoning here is that in C++ it's typical that an object is not safe to access mutably concurrently from multiple threads, unless the object has been explicitly synchronized. at::Tensor traditionally falls in this bucket.

Now, in this particular case, we have code accessing data pointer in parallel from multiple threads from parallel_for. But this is easy to fix: we should access the mutable pointer once from outside of parallel for, and close over the resulting pointer (with the side condition that writes must be non-aliasing; we can't test for this in the compiler.)

This means we have to audit all existing parallel_for calls and ban data_ptr calls from inside them. Not sure if there's a way to enforce this dynamically cheaply, but that would help too.

kurtamohler · 2024-02-20T23:24:12Z

@ezyang, that might be the best idea. However, it seems like it's not actually good enough to only ban direct mutable_data_ptr calls.

I've been looking through the places where parallel_for is called in the codebase and there are at least a few examples where the loop function doesn't call mutable_data_ptr directly, but it does call an aten operation on a Tensor that is shared across all the threads. If that aten function materializes COW inputs, whether because it hasn't been updated to use const_data_ptr or because it actually needs to mutate the input, then it will materialize the tensor.

So more specifically, we would need to ban anything that causes a materialize_cow_storage call inside parallel_for. If anything in the loop function would cause a materialization, then the client needs to explicitly materialize the data before the parallel_for call.

Without a way to automatically detect this, I think it will be very difficult to enforce. It's not obvious to me how we would detect it (EDIT: Actually, I do have a possible idea I've added below). So I think it's worth asking, if we can't think of a good way to detect that something inside a parallel_for loop function is causing a materialization, then would we still be willing to try to enforce this policy?

I do have a potential idea for detecting it:

std::thread::id main_thread_id = std::this_thread::get_id();

C10_API void materialize_cow_storage(StorageImpl& storage) {
  TORCH_INTERNAL_ASSERT(
    std::this_thread::get_id() == main_thread_id,
    "Materializing a storage from a subthread is forbidden");
  ...

However, I tested this a bit and it doesn't seem very good. Because the first thread in parallel_for actually is the main thread, and it takes a while for the other threads to get up and running, a mutable_data_ptr call in the loop function actually seems to have a fairly high probability of only causing the first thread to call materialize_cow_storage--at least if the tensor is relatively small. The first thread often finishes the materialization before the other threads have time to reach the mutable_data_ptr call, so it goes undetected.

I have another idea though: At the beginning of a parallel_for call, before creating the other threads, we could set some shared variable bool is_in_parallel_for to true. Then after the threads are joined up again, set it back to false. In materialize_cow_storage, we can just assert that it's false.

kurtamohler · 2024-02-21T01:39:16Z

before creating the other threads, we could set some shared variable bool is_in_parallel_for to true. Then after the threads are joined up again, set it back to false. In materialize_cow_storage, we can just assert that it's false.

I tried this and it seems to work. Is this alright with you?

ghstack-source-id: a778aeb Pull Request resolved: #119507

kurtamohler · 2024-02-28T00:35:29Z

@pytorchbot merge -f "unrelated CI failures"

pytorchmergebot · 2024-02-28T00:37:04Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

izaitsevfb · 2024-02-28T21:37:40Z

Hey @kurtamohler, it looks like XLA job failure is actually caused by your PR.

See hud.

To verify, I retried the job on the previous commit in trunk and it succeeded:
https://github.com/pytorch/pytorch/actions/runs/8073145934/job/22095434317

I'll revert your stack to keep trunk green, unless you have any objections. Thank you.

kurtamohler · 2024-02-28T21:53:54Z

Thanks for the heads up @izaitsevfb , sure, you can revert these two PRs. There are several other failures elsewhere that I need to fix as well

izaitsevfb · 2024-02-28T22:25:08Z

@pytorchbot revert -m "breaks xla jobs" -c ignoredsignal

pytorchmergebot · 2024-02-28T22:26:49Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

…)" This reverts commit 2ebf2c8. Reverted #119507 on behalf of https://github.com/izaitsevfb due to breaks xla jobs ([comment](#119507 (comment)))

pytorchmergebot · 2024-02-28T22:27:04Z

@kurtamohler your PR has been successfully reverted.

Part of #97856 [ghstack-poisoned]

Part of #97856 Pull Request resolved: #119507 Approved by: https://github.com/ezyang ghstack dependencies: #120455 ghstack-source-id: 3fa863d

kurtamohler · 2024-03-01T02:29:09Z

@pytorchbot merge

pytorchmergebot · 2024-03-01T02:31:03Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Add test to check that COW inputs are not materialized

e1f6ac9

[ghstack-poisoned]

kurtamohler requested a review from mruberry as a code owner February 8, 2024 23:57

This was referenced Feb 8, 2024

Avoid COW materialization for TensorAccessors with const type #119501

Closed

Avoid COW materialization for TensorInfo with const type #119502

Closed

This was referenced Feb 8, 2024

Avoid COW materialize in pooling, batch linalg, upsample, softmax ops #119503

Closed

Avoid COW materialize in index, reduce, compare, unique, and copy ops #119504

Closed

Avoid COW materialize in various operations #119506

Closed

kurtamohler requested a review from ezyang February 9, 2024 00:00

pytorchbot added the open source label Feb 9, 2024

kurtamohler added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Feb 9, 2024

kurtamohler mentioned this pull request Feb 9, 2024

Avoid materializing COW inputs to many forward ops #118926

Closed

Update on "Add test to check that COW inputs are not materialized"

209a617

Part of #97856 [ghstack-poisoned]

kurtamohler added a commit that referenced this pull request Feb 9, 2024

Add test to check that COW inputs are not materialized

857aaa8

ghstack-source-id: 8053d36 Pull Request resolved: #119507

ezyang approved these changes Feb 9, 2024

View reviewed changes

pytorchmergebot added the merging label Feb 9, 2024

pytorchmergebot removed the merging label Feb 9, 2024

ezyang added the topic: not user facing topic category label Feb 9, 2024

kurtamohler added a commit that referenced this pull request Feb 27, 2024

Add test to check that COW inputs are not materialized

5f432bb

ghstack-source-id: a778aeb Pull Request resolved: #119507

pytorchmergebot added the merging label Feb 28, 2024

pytorchmergebot added the Merged label Feb 28, 2024

pytorchmergebot closed this in 2ebf2c8 Feb 28, 2024

pytorchmergebot removed the merging label Feb 28, 2024

pytorchmergebot added the Reverted label Feb 28, 2024

pytorchmergebot reopened this Feb 28, 2024

Update on "Add test to check that COW inputs are not materialized"

74061db

Part of #97856 [ghstack-poisoned]

kurtamohler added a commit that referenced this pull request Mar 1, 2024

Add test to check that COW inputs are not materialized (#119507)

93be5b3

Part of #97856 Pull Request resolved: #119507 Approved by: https://github.com/ezyang ghstack dependencies: #120455 ghstack-source-id: 3fa863d

pytorchmergebot added the merging label Mar 1, 2024

pytorchmergebot closed this in 77aea28 Mar 1, 2024

pytorchmergebot removed the merging label Mar 1, 2024

github-actions bot deleted the gh/kurtamohler/10/head branch April 1, 2024 01:59

Add test to check that COW inputs are not materialized #119507

Add test to check that COW inputs are not materialized #119507

Uh oh!

Conversation

kurtamohler commented Feb 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119507

⏳ 1 Pending, 2 Unrelated Failures

Uh oh!

ezyang commented Feb 9, 2024

Uh oh!

pytorchmergebot commented Feb 9, 2024

Merge failed

Uh oh!

ezyang commented Feb 9, 2024

Uh oh!

kurtamohler commented Feb 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kurtamohler commented Feb 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezyang commented Feb 19, 2024

Uh oh!

kurtamohler commented Feb 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kurtamohler commented Feb 21, 2024

Uh oh!

kurtamohler commented Feb 28, 2024

Uh oh!

pytorchmergebot commented Feb 28, 2024

Merge started

Uh oh!

izaitsevfb commented Feb 28, 2024

Uh oh!

kurtamohler commented Feb 28, 2024

Uh oh!

izaitsevfb commented Feb 28, 2024

Uh oh!

pytorchmergebot commented Feb 28, 2024

Uh oh!

pytorchmergebot commented Feb 28, 2024

Uh oh!

kurtamohler commented Mar 1, 2024

Uh oh!

pytorchmergebot commented Mar 1, 2024

Merge started

Uh oh!

Uh oh!

kurtamohler commented Feb 8, 2024 •

edited

Loading

pytorch-bot bot commented Feb 8, 2024 •

edited

Loading

kurtamohler commented Feb 9, 2024 •

edited

Loading

kurtamohler commented Feb 15, 2024 •

edited

Loading

kurtamohler commented Feb 20, 2024 •

edited

Loading