Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roctracer crashes when number of samples too high #894

Closed
aaronenyeshi opened this issue Mar 28, 2024 · 0 comments
Closed

Roctracer crashes when number of samples too high #894

aaronenyeshi opened this issue Mar 28, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@aaronenyeshi
Copy link
Member

aaronenyeshi commented Mar 28, 2024

As a workaround, we are currently using a thread_local mutex and thread_local unordered_map in RoctracerLogger.cpp's api_callback() function. See: https://github.com/pytorch/kineto/blob/main/libkineto/src/RoctracerLogger.cpp#L99-L101

This is strange. We expect that Roctracer runtime events will only enter and exit api_callback(), one event at a time per thread, and in-order of start and end. However, if we remove the mutex, GPU tracing will crash during high QPS (examples/sec).

Crashing stack trace (replaced the internal call stack with ...):

*** Aborted at 1711481163 (Unix time, try 'date -d @1711481163') ***
*** Signal 11 (SIGSEGV) (0x659b9001bb778) received by PID 1816440 (pthread TID 0x7fad2bbd0640) (linux TID 1994978) (maybe from PID 1816440, UID 416185) (code: -6), stack trace: ***
    @ 000000004eecfb1d folly::symbolizer::(anonymous namespace)::signalHandler(int, siginfo_t*, void*)
                       ./fbcode/folly/experimental/symbolizer/SignalHandler.cpp:449
    @ 000000000004453f (unknown)
    @ 00000000000c567a std::_Rb_tree_insert_and_rebalance(bool, std::_Rb_tree_node_base*, std::_Rb_tree_node_base*, std::_Rb_tree_node_base&)
    @ 0000000046ed6bd5 std::_Rb_tree_iterator<std::pair<unsigned long const, unsigned long> > std::_Rb_tree<unsigned long, std::pair<unsigned long const, unsigned long>, std::_Select1st<std::pair<unsigned long const, unsigned long> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, unsigned long> > >::_M_emplace_hint_unique<std::piecewise_construct_t const&, std::tuple<unsigned long const&>, std::tuple<> >(std::_Rb_tree_const_iterator<std::pair<unsigned long const, unsigned long> >, std::piecewise_construct_t const&, std::tuple<unsigned long const&>&&, std::tuple<>&&)
                       fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/stl_tree.h:2398
    @ 000000005adee4da RoctracerLogger::api_callback(unsigned int, unsigned int, void const*, void*)
                       fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/stl_map.h:501
    @ 000000000002154c (unknown)
    @ 00000000001a4e3a (unknown)
    @ 0000000000173e62 hipMemcpyAsync
    @ 00000000442ef8cb at::native::copy_kernel_cuda(at::TensorIterator&, bool)
    @ 0000000050c5893a at::native::copy_impl(at::Tensor&, at::Tensor const&, bool)
                       fbcode/ATen/native/DispatchStub.h:158
    @ 0000000050c5734c at::native::copy_(at::Tensor&, at::Tensor const&, bool)
                       ./fbcode/caffe2/aten/src/ATen/native/Copy.cpp:325
    @ 00000000554cb7d9 at::Tensor& c10::Dispatcher::callWithDispatchKeySlowPath<at::Tensor&, at::Tensor&, at::Tensor const&, bool>(c10::TypedOperatorHandle<at::Tensor& (at::Tensor&, at::Tensor const&, bool)> const&, at::StepCallbacks&, c10::DispatchKeySet, c10::KernelFunction const&, at::Tensor&, at::Tensor const&, bool)
                       fbcode/ATen/core/boxing/KernelFunction_impl.h:53
    @ 0000000055bcd146 at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool)
                       fbcode/ATen/core/dispatch/Dispatcher.h:676
    @ 00000000510029a2 at::native::_to_copy(at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>)
                       fbcode/ATen/core/TensorBody.h:2134

...

    @ 00000000664e82da std::thread::_State_impl<std::thread::_Invoker<std::tuple<caffe2::GPUTaskRunner::init()::$_0> > >::_M_run()
                       ./buck-out/v2/gen/fbcode/44a959e3c31b2479/caffe2/caffe2/fb/predictor/predictor_executor/__gpu_task_runner_hip_hipify_gen__/out/GPUTaskRunner.cpp:73
    @ 00000000000df4e4 execute_native_thread_routine
    @ 000000000009abae start_thread
    @ 000000000012d17b __clone3

The call stack seems to be pointing to a RB Tree / set used in HIP runtime. I am guessing there is some race or interleaving of callbacks in the Roctracer code. And adding a mutex either slows it down enough, or avoids some sort of contention.

Behavior

Actual:

Crashes during high QPS without thread_local mutex.
Runs successfully during high QPS with thread_local mutex.
Runs successfully during low QPS with or without thread_local mutex.

Expected:

Runs successfully during high QPS without needing a thread_local mutex.
Runs successfully during low QPS without needing a thread_local mutex.

cc AMD @mwootton
cc @davidberard98 , @houseroad, @842974287, @xw285cornell, @malfet

@aaronenyeshi aaronenyeshi added bug Something isn't working help wanted Extra attention is needed labels Mar 28, 2024
aaronenyeshi added a commit to aaronenyeshi/kineto that referenced this issue Apr 11, 2024
Summary:
There is a map access at the bottom of the `api_callback()` function that is storing the correlation id and external correlation id relation. This map is shared between threads, and is causing a SIGSEGV crash during RBTree insert_and_rebalance.

To fix this, we add a mutex for the map. This is likely why the previous thread_local mutex was fixing crashes, due to delayed timing, so we can remove the thread_local mutex now.

Resolves: pytorch#894

Differential Revision: D56017970

Pulled By: aaronenyeshi
@aaronenyeshi aaronenyeshi removed the help wanted Extra attention is needed label Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant