You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is strange. We expect that Roctracer runtime events will only enter and exit api_callback(), one event at a time per thread, and in-order of start and end. However, if we remove the mutex, GPU tracing will crash during high QPS (examples/sec).
Crashing stack trace (replaced the internal call stack with ...):
The call stack seems to be pointing to a RB Tree / set used in HIP runtime. I am guessing there is some race or interleaving of callbacks in the Roctracer code. And adding a mutex either slows it down enough, or avoids some sort of contention.
Behavior
Actual:
Crashes during high QPS without thread_local mutex.
Runs successfully during high QPS with thread_local mutex.
Runs successfully during low QPS with or without thread_local mutex.
Expected:
Runs successfully during high QPS without needing a thread_local mutex.
Runs successfully during low QPS without needing a thread_local mutex.
Summary:
There is a map access at the bottom of the `api_callback()` function that is storing the correlation id and external correlation id relation. This map is shared between threads, and is causing a SIGSEGV crash during RBTree insert_and_rebalance.
To fix this, we add a mutex for the map. This is likely why the previous thread_local mutex was fixing crashes, due to delayed timing, so we can remove the thread_local mutex now.
Resolves: pytorch#894
Differential Revision: D56017970
Pulled By: aaronenyeshi
As a workaround, we are currently using a thread_local mutex and thread_local unordered_map in RoctracerLogger.cpp's
api_callback()
function. See: https://github.com/pytorch/kineto/blob/main/libkineto/src/RoctracerLogger.cpp#L99-L101This is strange. We expect that Roctracer runtime events will only enter and exit
api_callback()
, one event at a time per thread, and in-order of start and end. However, if we remove the mutex, GPU tracing will crash during high QPS (examples/sec).Crashing stack trace (replaced the internal call stack with ...):
The call stack seems to be pointing to a RB Tree / set used in HIP runtime. I am guessing there is some race or interleaving of callbacks in the Roctracer code. And adding a mutex either slows it down enough, or avoids some sort of contention.
Behavior
Actual:
Crashes during high QPS without
thread_local mutex
.Runs successfully during high QPS with
thread_local mutex
.Runs successfully during low QPS with or without
thread_local mutex
.Expected:
Runs successfully during high QPS without needing a
thread_local mutex
.Runs successfully during low QPS without needing a
thread_local mutex
.cc AMD @mwootton
cc @davidberard98 , @houseroad, @842974287, @xw285cornell, @malfet
The text was updated successfully, but these errors were encountered: