Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gdb backtrace to qa tests when server fails to start within timeout #5310

Merged
merged 3 commits into from
Feb 4, 2023

Conversation

rmccorm4
Copy link
Collaborator

@rmccorm4 rmccorm4 commented Feb 3, 2023

Help to better debug server hangs moving forward by capturing relevant backtrace rather than having to try to reproduce locally each time.

To demonstrate if it works or not, I added an intentional deadlock in model_lifecycle.cc on model load. Here's the output from the test. Note that even for a RELEASE build this helps point to an issue in core:

Thread 1 (Thread 0x7f80077f3000 (LWP 638148)):
#0  __lll_lock_wait (futex=futex@entry=0x5581ab4f1980, private=0) at lowlevellock.c:52
#1  0x00007f8008ec60a3 in __GI___pthread_mutex_lock (mutex=0x5581ab4f1980) at ../nptl/pthread_mutex_lock.c:80
#2  0x00007f80088892c6 in triton::core::ModelLifeCycle::AsyncLoad(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, inference::ModelConfig const&, bool, std::shared_ptr<triton::core::TritonRepoAgentModelList> const&, std::function<void (triton::core::Status)>&&) () from /opt/tritonserver/bin/../lib/libtritonserver.so
...

For issues on shared libraries (backends etc.) or issues requiring more detailed information, I am looking to add a DEBUG build as well.

However, I think it's valuable to have it running in the RELEASE build tests as well just in case it appears in one build type and not the other in CI:

Test + GDB Backtrace Log
root@rmccormick-dt:/mnt/triton/jira/4548-debug-build/server/qa/L0_hang_bt# bash -x test.sh 
+ test -f /etc/shinit_v2
+ source /etc/shinit_v2
+++ sed -n 's/^NVRM.*Kernel Module *\([^() ]*\).*$/\1/p' /proc/driver/nvidia/version
+++ sed 's/^$/unknown/'
++ NV_DRIVER_VERS=525.78.01
++ export _CUDA_COMPAT_PATH=/usr/local/cuda/compat
++ _CUDA_COMPAT_PATH=/usr/local/cuda/compat
+++ hostname
++ _CUDA_COMPAT_CHECKFILE=/usr/local/cuda/compat/.525.78.01.rmccormick-dt.checked
++ _CUDA_COMPAT_REALLIB=/usr/local/cuda/compat/lib.real
++ _CUDA_COMPAT_SYMLINK=/usr/local/cuda/compat/lib
++ '[' '(' '(' -n 525.78.01 -a -e /dev/nvidiactl ')' -o -e /dev/nvgpu ')' -a '!' -e /usr/local/cuda/compat/.525.78.01.rmccormick-dt.checked ']'
+++ cat /sys/module/mlx5_core/version
+++ true
++ _DETECTED_MOFED=
++ '[' -n '' ']'
++ unset _DETECTED_MOFED
++ unset _CUDA_COMPAT_CHECKFILE
++ unset _CUDA_COMPAT_REALLIB
++ unset _CUDA_COMPAT_SYMLINK
+ '[' -z '' ']'
+ return
+ SERVER=/opt/tritonserver/bin/tritonserver
+ SERVER_ARGS=--model-repository=/mnt/triton/jira/4548-debug-build/server/qa/L0_hang_bt/models
+ SERVER_LOG=./inference_server.log
+ source ../common/util.sh
++ SERVER_IPADDR=localhost
++ SERVER_LOG=./inference_server.log
++ SERVER_TIMEOUT=120
++ SERVER_LD_PRELOAD=
++ MONITOR_FILE_TIMEOUT=10
+ export SERVER_TIMEOUT=10
+ SERVER_TIMEOUT=10
+ export CUDA_VISIBLE_DEVICES=
+ CUDA_VISIBLE_DEVICES=
+ echo 'Starting server with SERVER_TIMEOUT=10...'
Starting server with SERVER_TIMEOUT=10...
+ run_server
+ SERVER_PID=0
+ '[' -z /opt/tritonserver/bin/tritonserver ']'
+ '[' '!' -f /opt/tritonserver/bin/tritonserver ']'
+ '[' -z '' ']'
+ echo '=== Running /opt/tritonserver/bin/tritonserver --model-repository=/mnt/triton/jira/4548-debug-build/server/qa/L0_hang_bt/models'
=== Running /opt/tritonserver/bin/tritonserver --model-repository=/mnt/triton/jira/4548-debug-build/server/qa/L0_hang_bt/models
+ SERVER_PID=638148
+ wait_for_server_ready 638148 10
+ local spid=638148
+ shift
+ local wait_time_secs=10
+ shift
+ WAIT_RET=0
+ local wait_secs=10
+ LD_PRELOAD=:
+ echo 'In wait_for_server_ready: Waiting for 10 seconds...'
In wait_for_server_ready: Waiting for 10 seconds...
+ /opt/tritonserver/bin/tritonserver --model-repository=/mnt/triton/jira/4548-debug-build/server/qa/L0_hang_bt/models
+ test 10 -eq 0
+ kill -0 638148
+ sleep 1
+ set +e
++ curl -s -w '%{http_code}' localhost:8000/v2/health/ready
+ code=000
+ set -e
+ '[' 000 == 200 ']'
+ (( wait_secs-- ))
+ test 9 -eq 0
+ kill -0 638148
+ sleep 1
+ set +e
++ curl -s -w '%{http_code}' localhost:8000/v2/health/ready
+ code=000
+ set -e
+ '[' 000 == 200 ']'
+ (( wait_secs-- ))
+ test 8 -eq 0
+ kill -0 638148
+ sleep 1
+ set +e
++ curl -s -w '%{http_code}' localhost:8000/v2/health/ready
+ code=000
+ set -e
+ '[' 000 == 200 ']'
+ (( wait_secs-- ))
+ test 7 -eq 0
+ kill -0 638148
+ sleep 1
+ set +e
++ curl -s -w '%{http_code}' localhost:8000/v2/health/ready
+ code=000
+ set -e
+ '[' 000 == 200 ']'
+ (( wait_secs-- ))
+ test 6 -eq 0
+ kill -0 638148
+ sleep 1
+ set +e
++ curl -s -w '%{http_code}' localhost:8000/v2/health/ready
+ code=000
+ set -e
+ '[' 000 == 200 ']'
+ (( wait_secs-- ))
+ test 5 -eq 0
+ kill -0 638148
+ sleep 1
+ set +e
++ curl -s -w '%{http_code}' localhost:8000/v2/health/ready
+ code=000
+ set -e
+ '[' 000 == 200 ']'
+ (( wait_secs-- ))
+ test 4 -eq 0
+ kill -0 638148
+ sleep 1
+ set +e
++ curl -s -w '%{http_code}' localhost:8000/v2/health/ready
+ code=000
+ set -e
+ '[' 000 == 200 ']'
+ (( wait_secs-- ))
+ test 3 -eq 0
+ kill -0 638148
+ sleep 1
+ set +e
++ curl -s -w '%{http_code}' localhost:8000/v2/health/ready
+ code=000
+ set -e
+ '[' 000 == 200 ']'
+ (( wait_secs-- ))
+ test 2 -eq 0
+ kill -0 638148
+ sleep 1
+ set +e
++ curl -s -w '%{http_code}' localhost:8000/v2/health/ready
+ code=000
+ set -e
+ '[' 000 == 200 ']'
+ (( wait_secs-- ))
+ test 1 -eq 0
+ kill -0 638148
+ sleep 1
+ set +e
++ curl -s -w '%{http_code}' localhost:8000/v2/health/ready
+ code=000
+ set -e
+ '[' 000 == 200 ']'
+ (( wait_secs-- ))
+ test 0 -eq 0
+ echo '=== Timeout 10 secs. Server not ready.'
=== Timeout 10 secs. Server not ready.
+ WAIT_RET=1
+ '[' 1 '!=' 0 ']'
+ command -v gdb
/usr/bin/gdb
+ GDB_LOG=gdb_bt.638148.log
+ echo -e '=== WARNING: SERVER HANG DETECTED, DUMPING GDB BACKTRACE TO [/mnt/triton/jira/4548-debug-build/server/qa/L0_hang_bt/gdb_bt.638148.log] ==='
=== WARNING: SERVER HANG DETECTED, DUMPING GDB BACKTRACE TO [/mnt/triton/jira/4548-debug-build/server/qa/L0_hang_bt/gdb_bt.638148.log] ===
+ gdb -batch -ex 'thread apply all bt' -p 638148
+ tee gdb_bt.638148.log
[New LWP 638150]
[New LWP 638151]
[New LWP 638152]
[New LWP 638153]
[New LWP 638154]
[New LWP 638155]
[New LWP 638156]
[New LWP 638157]
[New LWP 638158]
[New LWP 638159]
[New LWP 638160]
[New LWP 638161]
[New LWP 638162]
[New LWP 638163]
[New LWP 638164]
[New LWP 638165]
[New LWP 638166]
[New LWP 638167]
[New LWP 638168]
[New LWP 638169]
[New LWP 638170]
[New LWP 638171]
[New LWP 638172]
[New LWP 638173]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__lll_lock_wait (futex=futex@entry=0x5581ab4f1980, private=0) at lowlevellock.c:52
52	lowlevellock.c: No such file or directory.

Thread 25 (Thread 0x7f7ffbaf1000 (LWP 638173)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5581ab4c2930) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5581ab4c28e0, cond=0x5581ab4c2908) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x5581ab4c2908, mutex=0x5581ab4c28e0) at pthread_cond_wait.c:647
#3  0x00007f80083d4e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f80089ca70a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#5  0x00007f80083dade4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f8008ec3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f80080c5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 24 (Thread 0x7f7ffc2f2000 (LWP 638172)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5581ab4c2930) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5581ab4c28e0, cond=0x5581ab4c2908) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x5581ab4c2908, mutex=0x5581ab4c28e0) at pthread_cond_wait.c:647
#3  0x00007f80083d4e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f80089ca70a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#5  0x00007f80083dade4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f8008ec3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f80080c5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 23 (Thread 0x7f7ffcaf3000 (LWP 638171)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5581ab4c2930) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5581ab4c28e0, cond=0x5581ab4c2908) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x5581ab4c2908, mutex=0x5581ab4c28e0) at pthread_cond_wait.c:647
#3  0x00007f80083d4e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f80089ca70a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#5  0x00007f80083dade4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f8008ec3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f80080c5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 22 (Thread 0x7f7ffd2f4000 (LWP 638170)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5581ab4c2930) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5581ab4c28e0, cond=0x5581ab4c2908) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x5581ab4c2908, mutex=0x5581ab4c28e0) at pthread_cond_wait.c:647
#3  0x00007f80083d4e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f80089ca70a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#5  0x00007f80083dade4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f8008ec3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f80080c5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 21 (Thread 0x7f7ffdaf5000 (LWP 638169)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5581ab4c2930) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5581ab4c28e0, cond=0x5581ab4c2908) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x5581ab4c2908, mutex=0x5581ab4c28e0) at pthread_cond_wait.c:647
#3  0x00007f80083d4e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f80089ca70a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#5  0x00007f80083dade4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f8008ec3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f80080c5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 20 (Thread 0x7f7ffe2f6000 (LWP 638168)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5581ab4c2930) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5581ab4c28e0, cond=0x5581ab4c2908) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x5581ab4c2908, mutex=0x5581ab4c28e0) at pthread_cond_wait.c:647
#3  0x00007f80083d4e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f80089ca70a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#5  0x00007f80083dade4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f8008ec3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f80080c5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 19 (Thread 0x7f7ffeaf7000 (LWP 638167)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5581ab4c2930) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5581ab4c28e0, cond=0x5581ab4c2908) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x5581ab4c2908, mutex=0x5581ab4c28e0) at pthread_cond_wait.c:647
#3  0x00007f80083d4e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f80089ca70a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#5  0x00007f80083dade4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f8008ec3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f80080c5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 18 (Thread 0x7f7fff2f8000 (LWP 638166)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5581ab4c2930) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5581ab4c28e0, cond=0x5581ab4c2908) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x5581ab4c2908, mutex=0x5581ab4c28e0) at pthread_cond_wait.c:647
#3  0x00007f80083d4e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f80089ca70a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#5  0x00007f80083dade4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f8008ec3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f80080c5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 17 (Thread 0x7f7fffaf9000 (LWP 638165)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5581ab4c2930) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5581ab4c28e0, cond=0x5581ab4c2908) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x5581ab4c2908, mutex=0x5581ab4c28e0) at pthread_cond_wait.c:647
#3  0x00007f80083d4e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f80089ca70a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#5  0x00007f80083dade4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f8008ec3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f80080c5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 16 (Thread 0x7f80002fa000 (LWP 638164)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5581ab4c2930) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5581ab4c28e0, cond=0x5581ab4c2908) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x5581ab4c2908, mutex=0x5581ab4c28e0) at pthread_cond_wait.c:647
#3  0x00007f80083d4e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f80089ca70a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#5  0x00007f80083dade4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f8008ec3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f80080c5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 15 (Thread 0x7f8000afb000 (LWP 638163)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5581ab4c2930) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5581ab4c28e0, cond=0x5581ab4c2908) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x5581ab4c2908, mutex=0x5581ab4c28e0) at pthread_cond_wait.c:647
#3  0x00007f80083d4e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f80089ca70a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#5  0x00007f80083dade4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f8008ec3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f80080c5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 14 (Thread 0x7f80012fc000 (LWP 638162)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5581ab4c2930) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5581ab4c28e0, cond=0x5581ab4c2908) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x5581ab4c2908, mutex=0x5581ab4c28e0) at pthread_cond_wait.c:647
#3  0x00007f80083d4e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f80089ca70a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#5  0x00007f80083dade4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f8008ec3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f80080c5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 13 (Thread 0x7f8001afd000 (LWP 638161)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5581ab4c2930) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5581ab4c28e0, cond=0x5581ab4c2908) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x5581ab4c2908, mutex=0x5581ab4c28e0) at pthread_cond_wait.c:647
#3  0x00007f80083d4e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f80089ca70a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#5  0x00007f80083dade4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f8008ec3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f80080c5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 12 (Thread 0x7f80022fe000 (LWP 638160)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5581ab4c2930) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5581ab4c28e0, cond=0x5581ab4c2908) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x5581ab4c2908, mutex=0x5581ab4c28e0) at pthread_cond_wait.c:647
#3  0x00007f80083d4e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f80089ca70a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#5  0x00007f80083dade4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f8008ec3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f80080c5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 11 (Thread 0x7f8002aff000 (LWP 638159)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5581ab4c2930) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5581ab4c28e0, cond=0x5581ab4c2908) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x5581ab4c2908, mutex=0x5581ab4c28e0) at pthread_cond_wait.c:647
#3  0x00007f80083d4e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f80089ca70a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#5  0x00007f80083dade4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f8008ec3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f80080c5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 10 (Thread 0x7f8003300000 (LWP 638158)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5581ab4c2930) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5581ab4c28e0, cond=0x5581ab4c2908) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x5581ab4c2908, mutex=0x5581ab4c28e0) at pthread_cond_wait.c:647
#3  0x00007f80083d4e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f80089ca70a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#5  0x00007f80083dade4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f8008ec3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f80080c5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 9 (Thread 0x7f8003b01000 (LWP 638157)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5581ab4c2930) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5581ab4c28e0, cond=0x5581ab4c2908) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x5581ab4c2908, mutex=0x5581ab4c28e0) at pthread_cond_wait.c:647
#3  0x00007f80083d4e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f80089ca70a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#5  0x00007f80083dade4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f8008ec3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f80080c5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 8 (Thread 0x7f8004302000 (LWP 638156)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5581ab4c2930) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5581ab4c28e0, cond=0x5581ab4c2908) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x5581ab4c2908, mutex=0x5581ab4c28e0) at pthread_cond_wait.c:647
#3  0x00007f80083d4e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f80089ca70a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#5  0x00007f80083dade4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f8008ec3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f80080c5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 7 (Thread 0x7f8004b03000 (LWP 638155)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5581ab4c2930) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5581ab4c28e0, cond=0x5581ab4c2908) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x5581ab4c2908, mutex=0x5581ab4c28e0) at pthread_cond_wait.c:647
#3  0x00007f80083d4e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f80089ca70a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#5  0x00007f80083dade4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f8008ec3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f80080c5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 6 (Thread 0x7f8005304000 (LWP 638154)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5581ab4c2930) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5581ab4c28e0, cond=0x5581ab4c2908) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x5581ab4c2908, mutex=0x5581ab4c28e0) at pthread_cond_wait.c:647
#3  0x00007f80083d4e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f80089ca70a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#5  0x00007f80083dade4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f8008ec3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f80080c5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7f8005b05000 (LWP 638153)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5581ab4c2930) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5581ab4c28e0, cond=0x5581ab4c2908) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x5581ab4c2908, mutex=0x5581ab4c28e0) at pthread_cond_wait.c:647
#3  0x00007f80083d4e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f80089ca70a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#5  0x00007f80083dade4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f8008ec3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f80080c5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7f80067ea000 (LWP 638152)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5581ab4c2930) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5581ab4c28e0, cond=0x5581ab4c2908) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x5581ab4c2908, mutex=0x5581ab4c28e0) at pthread_cond_wait.c:647
#3  0x00007f80083d4e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f80089ca70a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#5  0x00007f80083dade4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f8008ec3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f80080c5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7f8006feb000 (LWP 638151)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5581ab4c2930) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5581ab4c28e0, cond=0x5581ab4c2908) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x5581ab4c2908, mutex=0x5581ab4c28e0) at pthread_cond_wait.c:647
#3  0x00007f80083d4e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f80089ca70a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#5  0x00007f80083dade4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f8008ec3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f80080c5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7f80077ec000 (LWP 638150)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5581ab4c2930) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5581ab4c28e0, cond=0x5581ab4c2908) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x5581ab4c2908, mutex=0x5581ab4c28e0) at pthread_cond_wait.c:647
#3  0x00007f80083d4e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f80089ca70a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#5  0x00007f80083dade4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f8008ec3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f80080c5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7f80077f3000 (LWP 638148)):
#0  __lll_lock_wait (futex=futex@entry=0x5581ab4f1980, private=0) at lowlevellock.c:52
#1  0x00007f8008ec60a3 in __GI___pthread_mutex_lock (mutex=0x5581ab4f1980) at ../nptl/pthread_mutex_lock.c:80
#2  0x00007f80088892c6 in triton::core::ModelLifeCycle::AsyncLoad(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, inference::ModelConfig const&, bool, std::shared_ptr<triton::core::TritonRepoAgentModelList> const&, std::function<void (triton::core::Status)>&&) () from /opt/tritonserver/bin/../lib/libtritonserver.so
#3  0x00007f800889460a in triton::core::ModelRepositoryManager::LoadModelByDependency[abi:cxx11]() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#4  0x00007f800889e0de in triton::core::ModelRepositoryManager::PollAndUpdateInternal(bool*) () from /opt/tritonserver/bin/../lib/libtritonserver.so
#5  0x00007f800889f9cc in triton::core::ModelRepositoryManager::Create(triton::core::InferenceServer*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, bool, bool, bool, triton::core::ModelLifeCycleOptions const&, std::unique_ptr<triton::core::ModelRepositoryManager, std::default_delete<triton::core::ModelRepositoryManager> >*) () from /opt/tritonserver/bin/../lib/libtritonserver.so
#6  0x00007f8008904592 in triton::core::InferenceServer::Init() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#7  0x00007f800891874d in TRITONSERVER_ServerNew () from /opt/tritonserver/bin/../lib/libtritonserver.so
#8  0x00005581aa7d89e9 in main ()
[Inferior 1 (process 638148) detached]
+ kill 638148
+ SERVER_PID=0
+ '[' 0 == 0 ']'
+ echo -e '\n***\n*** Failed to start /opt/tritonserver/bin/tritonserver\n***'

***
*** Failed to start /opt/tritonserver/bin/tritonserver
***
+ cat ./inference_server.log
W0203 20:42:29.697729 638148 pinned_memory_manager.cc:236] Unable to allocate pinned system memory, pinned memory pool will not be available: no CUDA-capable device is detected
I0203 20:42:29.697951 638148 cuda_memory_manager.cc:115] CUDA memory pool disabled
+ exit 1

nv-kmcgill53
nv-kmcgill53 previously approved these changes Feb 3, 2023
@tanmayv25
Copy link
Contributor

This looks good. You can also try to capture gdb stack trace for an encountered segmentation fault.

@rmccorm4
Copy link
Collaborator Author

rmccorm4 commented Feb 3, 2023

This looks good. You can also try to capture gdb stack trace for an encountered segmentation fault.

@tanmayv25 Reading through the util script logic, I think this actually covers the case where server exits early (segfault) as well. Looks like this part would return early and set WAIT_RET=1:

    until test $wait_secs -eq 0 ; do
        if ! kill -0 $spid; then
            echo "=== Server not running."
            WAIT_RET=1
            return
        fi

...

So my wording in "SERVER HANG DETECTED" may not be totally accurate, but I think both cases end up there.


I will double check this.

@tanmayv25
Copy link
Contributor

tanmayv25 commented Feb 3, 2023

@rmccorm4 You are right. I was mistaking WAIT_RET to be the actual time pending to start the server. You can improve the language in message. Also, some changes would be required as the SERVER_PID will not be active for segfaults.

@rmccorm4
Copy link
Collaborator Author

rmccorm4 commented Feb 3, 2023

Update, on a segfault, the process has already exited before coming this far, so gdb fails with:

ptrace: No such process.

Not sure the best way to handle this. It looks like you'd need a wait $SERVER_PID running to capture the exit code (ex: 139 for segfault) of the background process and check for a segfault, but that would block the rest of the test from running.

Also, even if this worked I think this only captures segfault on server startup and does not capture after the server has successfully started and the test/client request triggers a segfault.

Some other ideas off the top of my head:

  1. Maybe it's possible to run wait $SERVER_PID in the background as well or something in our common utilities
  2. Is it possible to add a signal handler for SIGSEGV in core as well? I don't think the boost::stacktrace::stacktrace provided by server/src/main.cc provides anything very useful in RELEASE build when the segfault happens in core.

Either way, we can look into capturing segfaults separately from these changes, I'll update the language around "hang".

@rmccorm4 rmccorm4 changed the title Add gdb backtrace to qa tests when server exceeds wait timeout Add gdb backtrace to qa tests when server fails to start within timeout Feb 3, 2023
@rmccorm4
Copy link
Collaborator Author

rmccorm4 commented Feb 4, 2023

@GuanLuo added core dump with gcore in a similar fashion, seems to work locally:

root@rmccormick-dt:/mnt/triton/jira/4548-debug-build/server/qa/L0_hang_bt# ls
core.844819  gdb_bt.844819.log  inference_server.log  models2  models_be  models_core  test.sh

root@rmccormick-dt:/mnt/triton/jira/4548-debug-build/server/qa/L0_hang_bt# gdb tritonserver core.844819
...
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--
Core was generated by `/opt/tritonserver/bin/tritonserver'.
...
[Current thread is 1 (Thread 0x7fe48ac56000 (LWP 844819))]
(gdb) bt
#0  __lll_lock_wait (futex=futex@entry=0x55c5966ea980, private=0) at lowlevellock.c:52
#1  0x00007fe48c3290a3 in __GI___pthread_mutex_lock (mutex=0x55c5966ea980)
    at ../nptl/pthread_mutex_lock.c:80
#2  0x00007fe48bcec2c6 in triton::core::ModelLifeCycle::AsyncLoad(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, inference::ModelConfig const&, bool, std::shared_ptr<triton::core::TritonRepoAgentModelList> const&, std::function<void (triton::core::Status)>&&) ()
   from /opt/tritonserver/bin/../lib/libtritonserver.so
...

@rmccorm4 rmccorm4 merged commit db529db into main Feb 4, 2023
@rmccorm4 rmccorm4 deleted the rmccormick-debug-build branch February 4, 2023 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants