Roctracer crashes when number of samples too high #894

aaronenyeshi · 2024-03-28T20:00:50Z

As a workaround, we are currently using a thread_local mutex and thread_local unordered_map in RoctracerLogger.cpp's api_callback() function. See: https://github.com/pytorch/kineto/blob/main/libkineto/src/RoctracerLogger.cpp#L99-L101

This is strange. We expect that Roctracer runtime events will only enter and exit api_callback(), one event at a time per thread, and in-order of start and end. However, if we remove the mutex, GPU tracing will crash during high QPS (examples/sec).

Crashing stack trace (replaced the internal call stack with ...):

*** Aborted at 1711481163 (Unix time, try 'date -d @1711481163') ***
*** Signal 11 (SIGSEGV) (0x659b9001bb778) received by PID 1816440 (pthread TID 0x7fad2bbd0640) (linux TID 1994978) (maybe from PID 1816440, UID 416185) (code: -6), stack trace: ***
    @ 000000004eecfb1d folly::symbolizer::(anonymous namespace)::signalHandler(int, siginfo_t*, void*)
                       ./fbcode/folly/experimental/symbolizer/SignalHandler.cpp:449
    @ 000000000004453f (unknown)
    @ 00000000000c567a std::_Rb_tree_insert_and_rebalance(bool, std::_Rb_tree_node_base*, std::_Rb_tree_node_base*, std::_Rb_tree_node_base&)
    @ 0000000046ed6bd5 std::_Rb_tree_iterator<std::pair<unsigned long const, unsigned long> > std::_Rb_tree<unsigned long, std::pair<unsigned long const, unsigned long>, std::_Select1st<std::pair<unsigned long const, unsigned long> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, unsigned long> > >::_M_emplace_hint_unique<std::piecewise_construct_t const&, std::tuple<unsigned long const&>, std::tuple<> >(std::_Rb_tree_const_iterator<std::pair<unsigned long const, unsigned long> >, std::piecewise_construct_t const&, std::tuple<unsigned long const&>&&, std::tuple<>&&)
                       fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/stl_tree.h:2398
    @ 000000005adee4da RoctracerLogger::api_callback(unsigned int, unsigned int, void const*, void*)
                       fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/stl_map.h:501
    @ 000000000002154c (unknown)
    @ 00000000001a4e3a (unknown)
    @ 0000000000173e62 hipMemcpyAsync
    @ 00000000442ef8cb at::native::copy_kernel_cuda(at::TensorIterator&, bool)
    @ 0000000050c5893a at::native::copy_impl(at::Tensor&, at::Tensor const&, bool)
                       fbcode/ATen/native/DispatchStub.h:158
    @ 0000000050c5734c at::native::copy_(at::Tensor&, at::Tensor const&, bool)
                       ./fbcode/caffe2/aten/src/ATen/native/Copy.cpp:325
    @ 00000000554cb7d9 at::Tensor& c10::Dispatcher::callWithDispatchKeySlowPath<at::Tensor&, at::Tensor&, at::Tensor const&, bool>(c10::TypedOperatorHandle<at::Tensor& (at::Tensor&, at::Tensor const&, bool)> const&, at::StepCallbacks&, c10::DispatchKeySet, c10::KernelFunction const&, at::Tensor&, at::Tensor const&, bool)
                       fbcode/ATen/core/boxing/KernelFunction_impl.h:53
    @ 0000000055bcd146 at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool)
                       fbcode/ATen/core/dispatch/Dispatcher.h:676
    @ 00000000510029a2 at::native::_to_copy(at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>)
                       fbcode/ATen/core/TensorBody.h:2134

...

    @ 00000000664e82da std::thread::_State_impl<std::thread::_Invoker<std::tuple<caffe2::GPUTaskRunner::init()::$_0> > >::_M_run()
                       ./buck-out/v2/gen/fbcode/44a959e3c31b2479/caffe2/caffe2/fb/predictor/predictor_executor/__gpu_task_runner_hip_hipify_gen__/out/GPUTaskRunner.cpp:73
    @ 00000000000df4e4 execute_native_thread_routine
    @ 000000000009abae start_thread
    @ 000000000012d17b __clone3

The call stack seems to be pointing to a RB Tree / set used in HIP runtime. I am guessing there is some race or interleaving of callbacks in the Roctracer code. And adding a mutex either slows it down enough, or avoids some sort of contention.

Behavior

Actual:

Crashes during high QPS without thread_local mutex.
Runs successfully during high QPS with thread_local mutex.
Runs successfully during low QPS with or without thread_local mutex.

Expected:

Runs successfully during high QPS without needing a thread_local mutex.
Runs successfully during low QPS without needing a thread_local mutex.

cc AMD @mwootton
cc @davidberard98 , @houseroad, @842974287, @xw285cornell, @malfet

The text was updated successfully, but these errors were encountered:

Summary: There is a map access at the bottom of the `api_callback()` function that is storing the correlation id and external correlation id relation. This map is shared between threads, and is causing a SIGSEGV crash during RBTree insert_and_rebalance. To fix this, we add a mutex for the map. This is likely why the previous thread_local mutex was fixing crashes, due to delayed timing, so we can remove the thread_local mutex now. Resolves: pytorch#894 Differential Revision: D56017970 Pulled By: aaronenyeshi

aaronenyeshi added bug Something isn't working help wanted Extra attention is needed labels Mar 28, 2024

aaronenyeshi mentioned this issue Apr 11, 2024

Fix roctracer crash when number of samples too high #902

Closed

aaronenyeshi removed the help wanted Extra attention is needed label Apr 11, 2024

facebook-github-bot closed this as completed in a4b44c9 Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roctracer crashes when number of samples too high #894

Roctracer crashes when number of samples too high #894

aaronenyeshi commented Mar 28, 2024 •

edited

Loading

Roctracer crashes when number of samples too high #894

Roctracer crashes when number of samples too high #894

Comments

aaronenyeshi commented Mar 28, 2024 • edited Loading

Crashing stack trace (replaced the internal call stack with ...):

Behavior

Actual:

Expected:

aaronenyeshi commented Mar 28, 2024 •

edited

Loading