Skip to content

Commit 0c54ea5

Browse files
swolchokfacebook-github-bot
authored andcommitted
[PyTorch] Avoid atomic refcounting in intrusive_ptr::make (#47100)
Summary: Pull Request resolved: #47100 Profiling with Linux `perf` shows that we spend at least 1% of our time doing this increment in our framework overhead benchmark. Here's the inline function breakdown for empty_cpu, which takes 6.91% of the total time: ``` - at::native::empty_cpu - 1.91% at::detail::make_tensor<c10::TensorImpl, c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >, c10::DispatchKey, caffe2::TypeMeta&> (inlined) - 0.98% c10::make_intrusive<c10::TensorImpl, c10::detail::intrusive_target_default_null_type<c10::TensorImpl>, c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >, c10::DispatchKey, caffe2::TypeMeta&> (inlined 0.97% c10::intrusive_ptr<c10::TensorImpl, c10::detail::intrusive_target_default_null_type<c10::TensorImpl> >::make<c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >, c10::DispatchKey, caffe2::TypeMeta&> 0.84% intrusive_ptr<c10::TensorImpl, c10::detail::intrusive_target_default_null_type<c10::TensorImpl> > (inlined) - 1.44% c10::make_intrusive<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl>, c10::StorageImpl::use_byte_size_t, long&, c10::DataPtr, c10::Allocator*&, bool> (inlined) - 1.44% c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >::make<c10::StorageImpl::use_byte_size_t, long&, c10::DataPtr, c10::Allocator*&, bool> (inlined) 1.02% std::__atomic_base<unsigned long>::operator++ (inlined) - 0.80% ~DataPtr (inlined) ~UniqueVoidPtr (inlined) ~unique_ptr (inlined) - 0.78% c10::TensorOptions::memory_format (inlined) - c10::TensorOptions::set_memory_format (inlined) - c10::optional<c10::MemoryFormat>::operator bool (inlined) c10::optional<c10::MemoryFormat>::initialized (inlined) ``` This change comes with a caveat: if we have constructors where `this` escapes to another thread before returning, we cannot make this assumption, because that other thread may have called `intrusive_ptr::make` already. I chose to just mandate that `instrusive_ptr_target`s's ctors hand back exclusive ownership of `this`, which seems like a reasonable requirement for a ctor anyway. If that turns out to be unacceptable, we could provide an opt-out from this optimization via a traits struct or similar template metaprogramming shenanigan. ghstack-source-id: 116368592 Test Plan: Run framework overhead benchmark. Results look promising, ranging from a tiny regression (? presumably noise) on the InPlace benchmark, 2.5% - 4% on OutOfPlace, to 9% on the empty benchmarks and 10-12% on the view benchmarks. Reviewed By: ezyang Differential Revision: D24606531 fbshipit-source-id: 1cf022063dab71cd1538535c72c4844d8dd7bb25
1 parent f2b7c38 commit 0c54ea5

File tree

1 file changed

+18
-2
lines changed

1 file changed

+18
-2
lines changed

c10/util/intrusive_ptr.h

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,8 @@ namespace raw {
2828
* performance because it does the refcounting intrusively
2929
* (i.e. in a member of the object itself).
3030
* Your class T needs to inherit from intrusive_ptr_target to allow it to be
31-
* used in an intrusive_ptr<T>.
31+
* used in an intrusive_ptr<T>. Your class's constructor should not allow
32+
*`this` to escape to other threads or create an intrusive_ptr from `this`.
3233
*/
3334

3435
// Note [Stack allocated intrusive_ptr_target safety]
@@ -396,7 +397,22 @@ class intrusive_ptr final {
396397
*/
397398
template <class... Args>
398399
static intrusive_ptr make(Args&&... args) {
399-
return intrusive_ptr(new TTarget(std::forward<Args>(args)...));
400+
auto result = intrusive_ptr(new TTarget(std::forward<Args>(args)...), raw::DontIncreaseRefcount{});
401+
402+
// We just created result.target_, so we know no other thread has
403+
// access to it, so we know we needn't care about memory ordering.
404+
// (On x86_64, a store with memory_order_relaxed generates a plain old
405+
// `mov`, whereas an atomic increment does a lock-prefixed `add`, which is
406+
// much more expensive: https://godbolt.org/z/eKPzj8.)
407+
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
408+
result.target_->refcount_ == 0 && result.target_->weakcount_ == 0,
409+
"intrusive_ptr: Newly-created target had non-zero refcounts. Does its "
410+
"constructor do something strange like incref or create an intrusive_ptr"
411+
"from `this`?");
412+
result.target_->refcount_.store(1, std::memory_order_relaxed);
413+
result.target_->weakcount_.store(1, std::memory_order_relaxed);
414+
415+
return result;
400416
}
401417

402418
/**

0 commit comments

Comments
 (0)