Skip to content

Conversation

kurtamohler
Copy link
Collaborator

@kurtamohler kurtamohler commented Feb 8, 2024

Copy link

pytorch-bot bot commented Feb 8, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119507

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 1 Pending, 2 Unrelated Failures

As of commit 74061db with merge base d534a49 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

kurtamohler added a commit that referenced this pull request Feb 9, 2024
@ezyang
Copy link
Contributor

ezyang commented Feb 9, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team Raised by workflow job

@ezyang ezyang added the topic: not user facing topic category label Feb 9, 2024
@ezyang
Copy link
Contributor

ezyang commented Feb 9, 2024

(some of) test fail looks real

@kurtamohler
Copy link
Collaborator Author

kurtamohler commented Feb 9, 2024

According to the trace for that failing macos job, materialize_cow_storage decrements the COWDeleterContext's refcount to 0, and then before ~COWDeleterContext() is called, the refcount gets incremented somewhere else. So the refcount is nonzero by the time the destructor is invoked, which we have an internal assert to detect. I suppose it must be a race condition. I'll figure out exactly why it's happening

EDIT: Oh I see. The TensorBase::data_ptr call is happening inside of an at::parallel_for.

at::parallel_for(0, args.batch * args.out_channels, 0, [&](int64_t start, int64_t end) {
for (const auto k : c10::irange(start, end)) {
const int64_t g = k % args.out_channels;
const int64_t i = k / (args.out_channels / groups);
convolution_depthwise3x3_winograd_impl(
args,
input.data_ptr<float>() + i * input_hxw,

By the time one of the threads tries to delete the COWDeleterContext, another thread has already changed the refcount. The refcount is atomic, but in this case that doesn't prevent the race since the combined operation of decrementing the refcount and then deleting the context is not mutexed.

auto cow::COWDeleterContext::increment_refcount() -> void {
auto refcount = ++refcount_;
TORCH_INTERNAL_ASSERT(refcount > 1);
}
auto cow::COWDeleterContext::decrement_refcount()
-> std::variant<NotLastReference, LastReference> {
auto refcount = --refcount_;
TORCH_INTERNAL_ASSERT(refcount >= 0, refcount);
if (refcount == 0) {
std::unique_lock lock(mutex_);
auto result = std::move(data_);
lock.unlock();
delete this;
return {std::move(result)};
}

I'll have to add a cpp test case that makes this race happen, and then fix it

@kurtamohler
Copy link
Collaborator Author

kurtamohler commented Feb 15, 2024

It turns out that fixing this race condition is trickier than I thought, unless I'm overlooking something.

Here's a basic example to make sure we're on the same page:

Tensor tensor = at::_lazy_clone(at::randn({10}));
int64_t num_thread = 100;
at::parallel_for(0, num_threads, 1, [&](int64_t begin, int64_t end) {
  void* ptr = tensor.mutable_data_ptr();
});

Inside of the mutable_data_ptr() call, the first thread (thread A) that tries to obtain the mutable pointer will need to materialize the data. When thread A starts doing this, the StorageImpl's DataPtr's UniqueVoidPtr's context is a COWDeleterContext.

To materialize the data, thread A needs to replace the StorageImpl's DataPtr with a new DataPtr that doesn't contain a pointer to the COWDeleterContext. Also, the COWDeleterContext will need to be deleted at some point to avoid a memory leak. Currently, for the example above, thread A decrements the refcount, which reaches 0, and the COWDeleterContext then gets deleted.

During this time, thread A needs to somehow coordinate with the other threads to make them wait until it is finished materializing the StorageImpl. Right now, this isn't being done properly, so other threads start trying to materialize the storage as well, and that's the race condition we're seeing.

I was considering using the mutex in the COWDeleterContext for this coordination, but I realized that since thread A will delete the COWDeleterContext, the mutex would also get deleted while other threads may be waiting to acquire a lock on it. Not good.

I was also considering having each thread increment the refcount of the COWDeleterContext before waiting on the mutex and then decrement it again afterwards. I thought that would prevent the mutex from getting deleted while the other threads are waiting on it, since the refcount seems like it wouldn't ever reach 0 until the last thread waiting on the mutex acquires it. However, there would actually still be another race condition here. Let's say thread A gets to the point where it decrements the refcount to 0 before any other threads catch up, but then right after that, thread B increments the refcount again and starts waiting on the mutex. In that case, thread A will continue to delete the COWDeleterContext, since it saw the refcount reach 0, and thread B is still left waiting for a deleted mutex. And it's also possible that thread B could increment the refcount right after thread A finishes deleting the COWDeleterContext, while the refcount is held in invalid memory. Still no good.

I thought of one solution that works, but it doesn't seem ideal. The idea is to add a mutex to the materialize_cow_storage function, so that only one of the threads can materialize a storage at a time. But that also prevents two threads from materializing two different storages at the same time. I'm not sure how often this happens, but performance would decrease for those cases.

Another option would be to provide a different mutex for each StorageImpl which will remain alive after the COWDeleterContext is deleted. That way, the other threads won't be left waiting on a deleted mutex. But where would this mutex live? I guess the most obvious option is to add the mutex to the StorageImpl. But I'm not sure if we'd want to increase the size of StorageImpl.

Another option is maybe to create a global container of mutexes. Each mutex would be paired with a reference to the StorageImpl that is supposed to be used for. But that seems like a very weird solution, and I'm not sure at what point we could decide that one of the mutexes is no longer needed and remove it from the container.

Maybe there's a better solution that I haven't thought of.

What do you think @ezyang ?

@ezyang
Copy link
Contributor

ezyang commented Feb 19, 2024

My proposal will require some upfront work, but I think it is the simplest: let's ban multithreaded access to the mutable data pointer. The reasoning here is that in C++ it's typical that an object is not safe to access mutably concurrently from multiple threads, unless the object has been explicitly synchronized. at::Tensor traditionally falls in this bucket.

Now, in this particular case, we have code accessing data pointer in parallel from multiple threads from parallel_for. But this is easy to fix: we should access the mutable pointer once from outside of parallel for, and close over the resulting pointer (with the side condition that writes must be non-aliasing; we can't test for this in the compiler.)

This means we have to audit all existing parallel_for calls and ban data_ptr calls from inside them. Not sure if there's a way to enforce this dynamically cheaply, but that would help too.

@kurtamohler
Copy link
Collaborator Author

kurtamohler commented Feb 20, 2024

@ezyang, that might be the best idea. However, it seems like it's not actually good enough to only ban direct mutable_data_ptr calls.

I've been looking through the places where parallel_for is called in the codebase and there are at least a few examples where the loop function doesn't call mutable_data_ptr directly, but it does call an aten operation on a Tensor that is shared across all the threads. If that aten function materializes COW inputs, whether because it hasn't been updated to use const_data_ptr or because it actually needs to mutate the input, then it will materialize the tensor.

So more specifically, we would need to ban anything that causes a materialize_cow_storage call inside parallel_for. If anything in the loop function would cause a materialization, then the client needs to explicitly materialize the data before the parallel_for call.

Without a way to automatically detect this, I think it will be very difficult to enforce. It's not obvious to me how we would detect it (EDIT: Actually, I do have a possible idea I've added below). So I think it's worth asking, if we can't think of a good way to detect that something inside a parallel_for loop function is causing a materialization, then would we still be willing to try to enforce this policy?

I do have a potential idea for detecting it:

std::thread::id main_thread_id = std::this_thread::get_id();

C10_API void materialize_cow_storage(StorageImpl& storage) {
  TORCH_INTERNAL_ASSERT(
    std::this_thread::get_id() == main_thread_id,
    "Materializing a storage from a subthread is forbidden");
  ...

However, I tested this a bit and it doesn't seem very good. Because the first thread in parallel_for actually is the main thread, and it takes a while for the other threads to get up and running, a mutable_data_ptr call in the loop function actually seems to have a fairly high probability of only causing the first thread to call materialize_cow_storage--at least if the tensor is relatively small. The first thread often finishes the materialization before the other threads have time to reach the mutable_data_ptr call, so it goes undetected.

I have another idea though: At the beginning of a parallel_for call, before creating the other threads, we could set some shared variable bool is_in_parallel_for to true. Then after the threads are joined up again, set it back to false. In materialize_cow_storage, we can just assert that it's false.

@kurtamohler
Copy link
Collaborator Author

before creating the other threads, we could set some shared variable bool is_in_parallel_for to true. Then after the threads are joined up again, set it back to false. In materialize_cow_storage, we can just assert that it's false.

I tried this and it seems to work. Is this alright with you?

kurtamohler added a commit that referenced this pull request Feb 27, 2024
@kurtamohler
Copy link
Collaborator Author

@pytorchbot merge -f "unrelated CI failures"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@izaitsevfb
Copy link
Contributor

Hey @kurtamohler, it looks like XLA job failure is actually caused by your PR.

See hud.

To verify, I retried the job on the previous commit in trunk and it succeeded:
https://github.com/pytorch/pytorch/actions/runs/8073145934/job/22095434317

I'll revert your stack to keep trunk green, unless you have any objections. Thank you.

@kurtamohler
Copy link
Collaborator Author

Thanks for the heads up @izaitsevfb , sure, you can revert these two PRs. There are several other failures elsewhere that I need to fix as well

@izaitsevfb
Copy link
Contributor

@pytorchbot revert -m "breaks xla jobs" -c ignoredsignal

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Feb 28, 2024
…)"

This reverts commit 2ebf2c8.

Reverted #119507 on behalf of https://github.com/izaitsevfb due to breaks xla jobs ([comment](#119507 (comment)))
@pytorchmergebot
Copy link
Collaborator

@kurtamohler your PR has been successfully reverted.

kurtamohler added a commit that referenced this pull request Mar 1, 2024
Part of #97856

Pull Request resolved: #119507
Approved by: https://github.com/ezyang
ghstack dependencies: #120455
ghstack-source-id: 3fa863d
@kurtamohler
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-actions github-actions bot deleted the gh/kurtamohler/10/head branch April 1, 2024 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged open source Reverted topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants