Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

caffe2: fix PinnedCPUAllocator cudaHostRegister() leak #16340

Closed
wants to merge 1 commit into from

Conversation

hartb
Copy link
Contributor

@hartb hartb commented Jan 24, 2019

In the NUMA case, PinnedCPUAllocator's allocate() would return a
DataPtr constructed by DefaultCPUAllocator, which would reference
the Default... Delete() rather than the Pinned... Delete(). That
meant Pinned... Delete() would never run, so cudaHostUnregister()
would never be called when regions were freed.

See: #16280

This change adds a 'naked_allocate()' method to the Default allocator
that just returns a pointer to the allocated memory rather than
wrapping it in a DataPtr. Pinned allocator uses that then constructs
a DataPtr with reference to its own Delete().

@ezyang
Copy link
Contributor

ezyang commented Jan 24, 2019

Thanks a lot, and nice catch.

Just fresh in master, there is a much more compact way to do an equivalent thing: on DataPtr use the method compare_exchange_deleter, passing in the old deleter (for the plain CPU allocator) and the new pinned deleter. You might have to make the old deleter public if it isn't already. Would you mind trying that out? Otherwise I will make an attempt.

@ezyang
Copy link
Contributor

ezyang commented Jan 24, 2019

cc @jerryzh168

@hartb
Copy link
Contributor Author

hartb commented Jan 24, 2019

Thank you for having a look. Sure; we'll try a compare_exchange_deleter version.

@jerryzh168
Copy link
Contributor

Nice catch! Thanks! @hartb

@@ -357,14 +357,13 @@ struct CAFFE2_CUDA_API PinnedCPUAllocator final : public at::Allocator {
at::DataPtr data_ptr;
std::lock_guard<std::mutex> lock(CUDAContext::mutex());
if (IsNUMAEnabled()) {
data_ptr = baseAllocator_.allocate(nbytes);
data = data_ptr.get();
data = baseAllocator_.naked_allocate(nbytes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we are in the process of merge pytorch and caffe2 allocator and we'll be using Allocator* for baseAllocator_, see: https://github.com/pytorch/pytorch/pull/14517/files#diff-6286b32ea83ee15c66db129928f27c42R343

@hartb
Copy link
Contributor Author

hartb commented Jan 25, 2019

@jerryzh168 A complication with the compare_exchange_deleter method... In the caffe2_report_cpu_memory_usage case the default allocator (original or shifted to c10) will do some reporter setup (including a New()), and will initialize the DataPtr with the ReportAndDelete() deleter.

So I think for pinned allocator to use compare_exchange, it has to first know about ReportAndDelete() (as well as just Delete()). And then in the ReportAndDelete() case has to choose to either exchange the deleter (leaking Default's Reporter.New()) or not exhange the deleter (leaking Pinned's cudaHostRegister()). Do I have that right?

@ezyang
Copy link
Contributor

ezyang commented Jan 25, 2019

Hmm, yes, you're right. I guess you have two options:

  1. Union Delete and ReportAndDelete into one function, which checks FLAGS_caffe2_report_cpu_memory_usage to see if it should report or not
  2. Do a condition on FLAGS_caffe2_report_cpu_memory_usage before doing a compare_exchange_deleter so that you know which deleter to test for in each case.

FLAGS_caffe2_report_cpu_memory_usage is a big stick in my craw and I want to get rid of it if possible.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang
Copy link
Contributor

ezyang commented Jan 28, 2019

@hartb Are you planning to look at this, or should one of us adopt this patch? Thanks!

@hartb
Copy link
Contributor Author

hartb commented Jan 28, 2019

I hope to update the PR with the suggestion above.

I'm thinking maybe use baseAllocator_'s raw_deleter() to get the expected Delete function, CAFFE_ENFORCE() that the swap succeeds, and include a comment that if the does succeed in the Report... case, then we leak a Reporter allocation (and in any case probably break reporting), but that's preferrable to leaking cudaHostRegister().

It's taking a bit longer than I'd hoped to swap my build setup from 1.0.0 to master; I'll let you know if that derails me for some reason.

@hartb
Copy link
Contributor Author

hartb commented Jan 29, 2019

@ezyang Here's a proposed fix (one commit) implementing the above based on PR 14517 (as @jerryzh168 ) mentioned above:

https://github.com/hartb/pytorch/tree/hartb-pr14517-add

Would you like to pick this over to that PR, or should modify that to base on master via this PR?

Note my worry above about leaking or breaking the Reporting case isn't an issue. The Pinned Delete() calls the base Allocator's delete to finish the job, so we get base/Default allocator clean-up that way.

@dzhulgakov dzhulgakov self-requested a review February 8, 2019 18:06
@ezyang
Copy link
Contributor

ezyang commented Feb 10, 2019

@hartb Feel free to force push! Sorry about the delay responding.

@hartb
Copy link
Contributor Author

hartb commented Feb 12, 2019

Will update this PR once tested the fix rebased to master.

Allocations returned by the PinnedCPUAllocator must carry
that Allocator's Delete() function or cudaHostRegistrations()
made by the pinned allocator will be leaked.

Ensure that in the NUMA case by swapping in the pinned allocator's
Delete() in place of the baseAllocator_'s deleter. The swap should
succeed unless something else already swapped the deleter (in
which case developer attention is required).

In the swap case, the pinned allocator's Delete() will call
baseAllocator_'s deleter explicitly, so any tear down actions to
be done there are preserved.
Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet and simple

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants