c_api_distributed_test creates huge amount of threads and segfaults #47047
Labels
comp:apis
Highlevel API related issues
stat:awaiting tensorflower
Status - Awaiting response from tensorflower
TF 2.4
for issues related to TF 2.4
type:bug
Bug
System information
Describe the current behavior
When running
bazel test
on a system with a large physical core count the test//tensorflow/c/eager:c_api_distributed_test
finishes and then segfaults on exit.When I manually set
OMP_NUM_THREADS=80
the test succeeds without a segfault but at around 85 it again crashes.I'm unable to get a stacktrace neither through TensorFlow nor through gdb and even valgrind gives up with
It then prints the stacks of 500(!) threads. In GDB I was sometimes able to catch a part of the stack pointing to libiomp from the included llvm-OpenMP, but that was difficult and hard to reproduce. Usually the process would just be terminated even when in GDB.
Something I noticed: The ThreadPool(Device) creates a large amount of threads which don't terminate until program exit. I don't think this is intended and expect this to be the cause which triggers some limitation in the OpenMP runtime.
Also the crash does not happen when not all subtests are run (via the GTest filter), excluding any of the 5 (or 6?) makes the crash disappear
Describe the expected behavior
Threads exit when ThreadPool is destroyed and no crash happens.
Standalone code to reproduce the issue
CUDA_VISIBLE_DEVICES=-1 gdb /dev/shm//tmpzWGWuq-bazel-tf/fdff6046a749a079864ed2bee7e018bf/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/c_api_distributed_test
Other info / logs
(yes the log ends here, no stack trace!)
The text was updated successfully, but these errors were encountered: