caffe2: fix PinnedCPUAllocator cudaHostRegister() leak #16340

hartb · 2019-01-24T22:46:38Z

In the NUMA case, PinnedCPUAllocator's allocate() would return a
DataPtr constructed by DefaultCPUAllocator, which would reference
the Default... Delete() rather than the Pinned... Delete(). That
meant Pinned... Delete() would never run, so cudaHostUnregister()
would never be called when regions were freed.

See: #16280

This change adds a 'naked_allocate()' method to the Default allocator
that just returns a pointer to the allocated memory rather than
wrapping it in a DataPtr. Pinned allocator uses that then constructs
a DataPtr with reference to its own Delete().

ezyang · 2019-01-24T22:55:46Z

Thanks a lot, and nice catch.

Just fresh in master, there is a much more compact way to do an equivalent thing: on DataPtr use the method compare_exchange_deleter, passing in the old deleter (for the plain CPU allocator) and the new pinned deleter. You might have to make the old deleter public if it isn't already. Would you mind trying that out? Otherwise I will make an attempt.

ezyang · 2019-01-24T22:55:55Z

cc @jerryzh168

hartb · 2019-01-24T23:04:46Z

Thank you for having a look. Sure; we'll try a compare_exchange_deleter version.

jerryzh168 · 2019-01-24T23:49:29Z

Nice catch! Thanks! @hartb

jerryzh168 · 2019-01-24T23:52:17Z

caffe2/core/context_gpu.h

@@ -357,14 +357,13 @@ struct CAFFE2_CUDA_API PinnedCPUAllocator final : public at::Allocator {
    at::DataPtr data_ptr;
    std::lock_guard<std::mutex> lock(CUDAContext::mutex());
    if (IsNUMAEnabled()) {
-      data_ptr = baseAllocator_.allocate(nbytes);
-      data = data_ptr.get();
+      data = baseAllocator_.naked_allocate(nbytes);


Actually, we are in the process of merge pytorch and caffe2 allocator and we'll be using Allocator* for baseAllocator_, see: https://github.com/pytorch/pytorch/pull/14517/files#diff-6286b32ea83ee15c66db129928f27c42R343

hartb · 2019-01-25T16:43:30Z

@jerryzh168 A complication with the compare_exchange_deleter method... In the caffe2_report_cpu_memory_usage case the default allocator (original or shifted to c10) will do some reporter setup (including a New()), and will initialize the DataPtr with the ReportAndDelete() deleter.

So I think for pinned allocator to use compare_exchange, it has to first know about ReportAndDelete() (as well as just Delete()). And then in the ReportAndDelete() case has to choose to either exchange the deleter (leaking Default's Reporter.New()) or not exhange the deleter (leaking Pinned's cudaHostRegister()). Do I have that right?

ezyang · 2019-01-25T16:47:44Z

Hmm, yes, you're right. I guess you have two options:

Union Delete and ReportAndDelete into one function, which checks FLAGS_caffe2_report_cpu_memory_usage to see if it should report or not
Do a condition on FLAGS_caffe2_report_cpu_memory_usage before doing a compare_exchange_deleter so that you know which deleter to test for in each case.

FLAGS_caffe2_report_cpu_memory_usage is a big stick in my craw and I want to get rid of it if possible.

facebook-github-bot

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ezyang · 2019-01-28T17:05:18Z

@hartb Are you planning to look at this, or should one of us adopt this patch? Thanks!

hartb · 2019-01-28T17:56:04Z

I hope to update the PR with the suggestion above.

I'm thinking maybe use baseAllocator_'s raw_deleter() to get the expected Delete function, CAFFE_ENFORCE() that the swap succeeds, and include a comment that if the does succeed in the Report... case, then we leak a Reporter allocation (and in any case probably break reporting), but that's preferrable to leaking cudaHostRegister().

It's taking a bit longer than I'd hoped to swap my build setup from 1.0.0 to master; I'll let you know if that derails me for some reason.

hartb · 2019-01-29T18:06:53Z

@ezyang Here's a proposed fix (one commit) implementing the above based on PR 14517 (as @jerryzh168 ) mentioned above:

https://github.com/hartb/pytorch/tree/hartb-pr14517-add

Would you like to pick this over to that PR, or should modify that to base on master via this PR?

Note my worry above about leaking or breaking the Reporting case isn't an issue. The Pinned Delete() calls the base Allocator's delete to finish the job, so we get base/Default allocator clean-up that way.

ezyang · 2019-02-10T03:57:26Z

@hartb Feel free to force push! Sorry about the delay responding.

hartb · 2019-02-12T23:58:48Z

Will update this PR once tested the fix rebased to master.

Allocations returned by the PinnedCPUAllocator must carry that Allocator's Delete() function or cudaHostRegistrations() made by the pinned allocator will be leaked. Ensure that in the NUMA case by swapping in the pinned allocator's Delete() in place of the baseAllocator_'s deleter. The swap should succeed unless something else already swapped the deleter (in which case developer attention is required). In the swap case, the pinned allocator's Delete() will call baseAllocator_'s deleter explicitly, so any tear down actions to be done there are preserved.

ezyang

Sweet and simple

facebook-github-bot

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

hartb mentioned this pull request Jan 24, 2019

caffe2: cudaHostRegister() and mbind()-related test failures on ppc64le #16280

Open

jerryzh168 reviewed Jan 24, 2019

View reviewed changes

facebook-github-bot reviewed Jan 28, 2019

View reviewed changes

dzhulgakov self-requested a review February 8, 2019 18:06

hartb force-pushed the hartb-16280 branch from a00d9da to 8ee3cd6 Compare February 13, 2019 15:06

ezyang approved these changes Feb 13, 2019

View reviewed changes

facebook-github-bot reviewed Feb 13, 2019

View reviewed changes

dzhulgakov approved these changes Feb 14, 2019

View reviewed changes

facebook-github-bot closed this in fbd690c Feb 15, 2019

ezyang added open source merged labels Jun 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

caffe2: fix PinnedCPUAllocator cudaHostRegister() leak #16340

caffe2: fix PinnedCPUAllocator cudaHostRegister() leak #16340

hartb commented Jan 24, 2019

ezyang commented Jan 24, 2019

ezyang commented Jan 24, 2019

hartb commented Jan 24, 2019

jerryzh168 commented Jan 24, 2019

jerryzh168 Jan 24, 2019

hartb commented Jan 25, 2019

ezyang commented Jan 25, 2019

facebook-github-bot left a comment

ezyang commented Jan 28, 2019

hartb commented Jan 28, 2019

hartb commented Jan 29, 2019

ezyang commented Feb 10, 2019

hartb commented Feb 12, 2019

ezyang left a comment

facebook-github-bot left a comment

caffe2: fix PinnedCPUAllocator cudaHostRegister() leak #16340

caffe2: fix PinnedCPUAllocator cudaHostRegister() leak #16340

Conversation

hartb commented Jan 24, 2019

ezyang commented Jan 24, 2019

ezyang commented Jan 24, 2019

hartb commented Jan 24, 2019

jerryzh168 commented Jan 24, 2019

jerryzh168 Jan 24, 2019

Choose a reason for hiding this comment

hartb commented Jan 25, 2019

ezyang commented Jan 25, 2019

facebook-github-bot left a comment

Choose a reason for hiding this comment

ezyang commented Jan 28, 2019

hartb commented Jan 28, 2019

hartb commented Jan 29, 2019

ezyang commented Feb 10, 2019

hartb commented Feb 12, 2019

ezyang left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment