Skip to content

Program crashes (segmentation fault) during interrupted load tests using TensorRT/CUDA EP #24601

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dat58 opened this issue Apr 30, 2025 · 6 comments
Labels
ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider

Comments

@dat58
Copy link

dat58 commented Apr 30, 2025

Describe the issue

I have a program written using Rust bindings (ort) for ONNX Runtime (C/C++ backend). The program serves 3 models with TensorRT/CUDA execution provider, each configured with intra_threads=1 (greater value also face the same issue). The service is exposed via a gRPC server built with tonic.

When performing load testing with Locust (a Python load testing tool) at 128 concurrent users (CCU), I observe the following behavior:

  • If I interrupt the Locust test (Click "Stop" button), ~80% of the time the program crashes with Segmentation Fault (not a Rust panic, suggesting the issue originates from ONNX Runtime).

  • If I let Locust run the test continuously for a full day (without interruption), the program runs without any errors.

  • As an alternative test, I wrote a separate program using a ThreadPool (128 workers) to call the gRPC server. When allowing this test to finish without interruption, I never encounter any errors. This has been confirmed through multiple test runs.

Stack Trace:

Thread 26 "hermes" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fb50a5ff640 (LWP 77247)]
0x00007fb51ca436cb in onnxruntime::ConfigOptions::GetConfigEntry(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const () from /lib/libonnxruntime.so
(gdb) backtrace
#0  0x00007fb51ca436cb in onnxruntime::ConfigOptions::GetConfigEntry(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const () from /lib/libonnxruntime.so
#1  0x00007fb51ca4398d in onnxruntime::ConfigOptions::GetConfigOrDefault(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const () from /lib/libonnxruntime.so
#2  0x00007fb51c19775d in onnxruntime::InferenceSession::Run(OrtRunOptions const&, gsl::span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, 18446744073709551615ul>, gsl::span<OrtValue const, 18446744073709551615ul>, gsl::span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, 18446744073709551615ul>, std::vector<OrtValue, std::allocator<OrtValue> >*, std::vector<OrtDevice, std::allocator<OrtDevice> > const*) [clone .localalias] () from /lib/libonnxruntime.so
#3  0x00007fb51c19967d in onnxruntime::InferenceSession::Run(OrtRunOptions const&, gsl::span<char const* const, 18446744073709551615ul>, gsl::span<OrtValue const* const, 18446744073709551615ul>, gsl::span<char const* const, 18446744073709551615ul>, gsl::span<OrtValue*, 18446744073709551615ul>) () from /lib/libonnxruntime.so
#4  0x00007fb51c199ac9 in std::_Function_handler<void (), onnxruntime::InferenceSession::RunAsync(OrtRunOptions const*, gsl::span<char const* const, 18446744073709551615ul>, gsl::span<OrtValue const* const, 18446744073709551615ul>, gsl::span<char const* const, 18446744073709551615ul>, gsl::span<OrtValue*, 18446744073709551615ul>, void (*)(void*, OrtValue**, unsigned long, OrtStatus*), void*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /lib/libonnxruntime.so
#5  0x00007fb51ccfe0a8 in onnxruntime::concurrency::ThreadPoolTempl<onnxruntime::Env>::WorkerLoop(int) () from /lib/libonnxruntime.so
#6  0x00007fb51ccfeed6 in onnxruntime::concurrency::ThreadPoolTempl<onnxruntime::Env>::WorkerLoop(int, Eigen::ThreadPoolInterface*) () from /lib/libonnxruntime.so
#7  0x00007fb51cd0ae9c in onnxruntime::(anonymous namespace)::PosixThread::ThreadMain(void*) () from /lib/libonnxruntime.so
#8  0x00007fb528617ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#9  0x00007fb5286a8a04 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

Occasionally, I also encounter this error:

Thread 35 "hermes" received signal SIGFPE, Arithmetic exception.
[Switching to Thread 0x7fd524dfe640 (LWP 78603)]
0x00007fd53eac46c5 in onnxruntime::ConfigOptions::GetConfigEntry(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const () from /lib/libonnxruntime.so
(gdb) 
(gdb) backtrace
#0  0x00007fd53eac46c5 in onnxruntime::ConfigOptions::GetConfigEntry(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const () from /lib/libonnxruntime.so
#1  0x00007fd53eac498d in onnxruntime::ConfigOptions::GetConfigOrDefault(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const () from /lib/libonnxruntime.so
#2  0x00007fd53e21875d in onnxruntime::InferenceSession::Run(OrtRunOptions const&, gsl::span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, 18446744073709551615ul>, gsl::span<OrtValue const, 18446744073709551615ul>, gsl::span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, 18446744073709551615ul>, std::vector<OrtValue, std::allocator<OrtValue> >*, std::vector<OrtDevice, std::allocator<OrtDevice> > const*) [clone .localalias] () from /lib/libonnxruntime.so
#3  0x00007fd53e21a67d in onnxruntime::InferenceSession::Run(OrtRunOptions const&, gsl::span<char const* const, 18446744073709551615ul>, gsl::span<OrtValue const* const, 18446744073709551615ul>, gsl::span<char const* const, 18446744073709551615ul>, gsl::span<OrtValue*, 18446744073709551615ul>) () from /lib/libonnxruntime.so
#4  0x00007fd53e21aac9 in std::_Function_handler<void (), onnxruntime::InferenceSession::RunAsync(OrtRunOptions const*, gsl::span<char const* const, 18446744073709551615ul>, gsl::span<OrtValue const* const, 18446744073709551615ul>, gsl::span<char const* const, 18446744073709551615ul>, gsl::span<OrtValue*, 18446744073709551615ul>, void (*)(void*, OrtValue**, unsigned long, OrtStatus*), void*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /lib/libonnxruntime.so
#5  0x00007fd53ed7f0a8 in onnxruntime::concurrency::ThreadPoolTempl<onnxruntime::Env>::WorkerLoop(int) () from /lib/libonnxruntime.so
#6  0x00007fd53ed7fed6 in onnxruntime::concurrency::ThreadPoolTempl<onnxruntime::Env>::WorkerLoop(int, Eigen::ThreadPoolInterface*) () from /lib/libonnxruntime.so
#7  0x00007fd53ed8be9c in onnxruntime::(anonymous namespace)::PosixThread::ThreadMain(void*) () from /lib/libonnxruntime.so
#8  0x00007fd54aedbac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#9  0x00007fd54af6ca04 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

To reproduce

Let me know if you need further details to reproduce the issue.

My machine information:

OS: Ubuntu 22.04
GPU: Nvidia A40
CPU: AMD EPYC Processor (with IBPB) (2 Socket with 16 core each)
RAM: 256GB

Urgency

This issue prevents the program from running in production.

Platform

Linux

OS Version

Ubuntu 22.04.5 LTS

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.21.1

ONNX Runtime API

Other / Unknown

Architecture

X64

Execution Provider

CUDA, TensorRT

Execution Provider Library Version

nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04 && Tensorrt@8.6.1.6-1+cuda11.8

@github-actions github-actions bot added ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider labels Apr 30, 2025
@dat58
Copy link
Author

dat58 commented May 1, 2025

I compiled ONNX Runtime with Debug flags, resulting in more informative stack traces:

// **Some sort of log before the segmentation fault:**

2025-05-01T02:52:41.351111Z TRACE ort{id="" location="sequential_executor.cc:593 ExecuteThePlan"}: ort::environment: Number of streams: 1
2025-05-01T02:52:41.351267Z TRACE ort{id="" location="sequential_executor.cc:185 SessionScope"}: ort::environment: Begin execution
2025-05-01T02:52:41.313849Z TRACE ort{id="" location="sequential_executor.cc:593 ExecuteThePlan"}: ort::environment: Number of streams: 2
2025-05-01T02:52:41.316761Z TRACE ort{id="" location="sequential_executor.cc:593 ExecuteThePlan"}: ort::environment: Number of streams: 1
2025-05-01T02:52:41.358440Z TRACE ort{id="" location="sequential_executor.cc:185 SessionScope"}: ort::environment: Begin execution
2025-05-01T02:52:41.358510Z TRACE ort{id="" location="sequential_executor.cc:185 SessionScope"}: ort::environment: Begin execution
2025-05-01T02:52:41.318017Z TRACE ort{id="" location="stream_execution_context.cc:171 RecycleNodeInputs"}: ort::environment: ort value 38 released
2025-05-01T02:52:41.366365Z TRACE ort{id="" location="stream_execution_context.cc:171 RecycleNodeInputs"}: ort::environment: ort value 47 released
2025-05-01T02:52:41.366380Z TRACE ort{id="" location="stream_execution_context.cc:171 RecycleNodeInputs"}: ort::environment: ort value 56 released
2025-05-01T02:52:41.366393Z TRACE ort{id="" location="sequential_executor.cc:577 ExecuteKernel"}: ort::environment: stream 0 launch kernel with idx 42
2025-05-01T02:52:41.375351Z TRACE ort{id="" location="sequential_executor.cc:593 ExecuteThePlan"}: ort::environment: Number of streams: 2
2025-05-01T02:52:41.375397Z TRACE ort{id="" location="sequential_executor.cc:593 ExecuteThePlan"}: ort::environment: Number of streams: 1
2025-05-01T02:52:41.375504Z TRACE ort{id="" location="sequential_executor.cc:185 SessionScope"}: ort::environment: Begin execution
2025-05-01T02:52:41.375603Z TRACE ort{id="" location="sequential_executor.cc:185 SessionScope"}: ort::environment: Begin execution
2025-05-01T02:52:41.375623Z TRACE ort{id="" location="sequential_executor.cc:593 ExecuteThePlan"}: ort::environment: Number of streams: 1
2025-05-01T02:52:41.375847Z TRACE ort{id="" location="sequential_executor.cc:185 SessionScope"}: ort::environment: Begin execution
2025-05-01T02:52:41.377925Z TRACE ort{id="" location="sequential_executor.cc:593 ExecuteThePlan"}: ort::environment: Number of streams: 2
2025-05-01T02:52:41.378068Z TRACE ort{id="" location="sequential_executor.cc:185 SessionScope"}: ort::environment: Begin execution
2025-05-01T02:52:41.276622Z TRACE ort{id="" location="stream_execution_context.cc:171 RecycleNodeInputs"}: ort::environment: ort value 37 released
2025-05-01T02:52:41.378743Z TRACE ort{id="" location="stream_execution_context.cc:171 RecycleNodeInputs"}: ort::environment: ort value 38 released
2025-05-01T02:52:41.378776Z TRACE ort{id="" location="stream_execution_context.cc:171 RecycleNodeInputs"}: ort::environment: ort value 47 released
2025-05-01T02:52:41.378797Z TRACE ort{id="" location="stream_execution_context.cc:171 RecycleNodeInputs"}: ort::environment: ort value 56 released
2025-05-01T02:52:41.378830Z TRACE ort{id="" location="sequential_executor.cc:577 ExecuteKernel"}: ort::environment: stream 0 launch kernel with idx 42
2025-05-01T02:52:41.380197Z TRACE ort{id="" location="sequential_executor.cc:593 ExecuteThePlan"}: ort::environment: Number of streams: 1
2025-05-01T02:52:41.380404Z TRACE ort{id="" location="sequential_executor.cc:185 SessionScope"}: ort::environment: Begin execution
2025-05-01T02:52:41.381317Z TRACE ort{id="" location="sequential_executor.cc:593 ExecuteThePlan"}: ort::environment: Number of streams: 1
2025-05-01T02:52:41.381671Z TRACE ort{id="" location="sequential_executor.cc:185 SessionScope"}: ort::environment: Begin execution
2025-05-01T02:52:41.386848Z TRACE ort{id="" location="sequential_executor.cc:593 ExecuteThePlan"}: ort::environment: Number of streams: 2
2025-05-01T02:52:41.386934Z TRACE ort{id="" location="sequential_executor.cc:593 ExecuteThePlan"}: ort::environment: Number of streams: 1
2025-05-01T02:52:41.387017Z TRACE ort{id="" location="sequential_executor.cc:185 SessionScope"}: ort::environment: Begin execution
2025-05-01T02:52:41.387130Z TRACE ort{id="" location="sequential_executor.cc:185 SessionScope"}: ort::environment: Begin execution
2025-05-01T02:52:41.394341Z TRACE ort{id="" location="sequential_executor.cc:593 ExecuteThePlan"}: ort::environment: Number of streams: 2
2025-05-01T02:52:41.394473Z TRACE ort{id="" location="sequential_executor.cc:185 SessionScope"}: ort::environment: Begin execution
2025-05-01T02:52:41.394502Z TRACE ort{id="" location="sequential_executor.cc:593 ExecuteThePlan"}: ort::environment: Number of streams: 2
2025-05-01T02:52:41.394686Z TRACE ort{id="" location="sequential_executor.cc:185 SessionScope"}: ort::environment: Begin execution
2025-05-01T02:52:41.402430Z TRACE ort{id="" location="sequential_executor.cc:593 ExecuteThePlan"}: ort::environment: Number of streams: 1
2025-05-01T02:52:41.402598Z TRACE ort{id="" location="sequential_executor.cc:593 ExecuteThePlan"}: ort::environment: Number of streams: 1
2025-05-01T02:52:41.402675Z TRACE ort{id="" location="sequential_executor.cc:185 SessionScope"}: ort::environment: Begin execution
2025-05-01T02:52:41.402773Z TRACE ort{id="" location="sequential_executor.cc:185 SessionScope"}: ort::environment: Begin execution
2025-05-01T02:52:41.407477Z TRACE ort{id="" location="sequential_executor.cc:593 ExecuteThePlan"}: ort::environment: Number of streams: 2
2025-05-01T02:52:41.407521Z TRACE ort{id="" location="sequential_executor.cc:593 ExecuteThePlan"}: ort::environment: Number of streams: 1
2025-05-01T02:52:41.407607Z TRACE ort{id="" location="sequential_executor.cc:185 SessionScope"}: ort::environment: Begin execution
2025-05-01T02:52:41.407772Z TRACE ort{id="" location="sequential_executor.cc:185 SessionScope"}: ort::environment: Begin execution
Thread 9 "hermes" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ff84e3fe640 (LWP 480)]
std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node (this=0x7ff70a0be470, __bkt=23461684946685, __k="gpu_graph_id", __code=4284862625839974261) at /usr/include/c++/11/bits/hashtable.h:1833
1833	      __node_base_ptr __prev_p = _M_buckets[__bkt];
(gdb) backtrace
#0  std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node (this=0x7ff70a0be470, __bkt=23461684946685, __k="gpu_graph_id", __code=4284862625839974261)
    at /usr/include/c++/11/bits/hashtable.h:1833
#1  0x00007ff852e2ff76 in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_node (this=0x7ff70a0be470, __bkt=23461684946685, __key="gpu_graph_id", __c=4284862625839974261)
    at /usr/include/c++/11/bits/hashtable.h:810
#2  0x00007ff852fa20ef in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::find (this=0x7ff70a0be470, __k="gpu_graph_id") at /usr/include/c++/11/bits/hashtable.h:1610
#3  0x00007ff852fa0275 in std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::find (
    this=0x7ff70a0be470, __x="gpu_graph_id") at /usr/include/c++/11/bits/unordered_map.h:880
#4  0x00007ff853d85f1c in onnxruntime::ConfigOptions::GetConfigEntry (this=0x7ff70a0be470, config_key="gpu_graph_id") at /onnxruntime/onnxruntime/core/framework/config_options.cc:11
#5  0x00007ff853d86082 in onnxruntime::ConfigOptions::GetConfigOrDefault (this=0x7ff70a0be470, config_key="gpu_graph_id", default_value="")
    at /onnxruntime/onnxruntime/core/framework/config_options.cc:29
#6  0x00007ff852f359ab in onnxruntime::InferenceSession::Run (this=0x7ff7c4562400, run_options=..., feed_names=..., feeds=..., output_names=..., p_fetches=0x7ff84e3fb110, p_fetches_device_info=0x0)
    at /onnxruntime/onnxruntime/core/session/inference_session.cc:2594
#7  0x00007ff852f379c1 in onnxruntime::InferenceSession::Run (this=0x7ff7c4562400, run_options=..., feed_names=..., feeds=..., fetch_names=..., fetches=...)
    at /onnxruntime/onnxruntime/core/session/inference_session.cc:2841
#8  0x00007ff852f37fe5 in operator() (__closure=0x7ff80ce5a370) at /onnxruntime/onnxruntime/core/session/inference_session.cc:2883
#9  0x00007ff852f76ff4 in std::__invoke_impl<void, onnxruntime::InferenceSession::RunAsync(const RunOptions*, gsl::span<char const* const>, gsl::span<const OrtValue* const>, gsl::span<char const* const>, gsl::span<OrtValue*>, RunAsyncCallbackFn, void*)::<lambda()>&>(std::__invoke_other, struct {...} &) (__f=...) at /usr/include/c++/11/bits/invoke.h:61
#10 0x00007ff852f6a9bc in std::__invoke_r<void, onnxruntime::InferenceSession::RunAsync(const RunOptions*, gsl::span<char const* const>, gsl::span<const OrtValue* const>, gsl::span<char const* const>, gsl::span<OrtValue*>, RunAsyncCallbackFn, void*)::<lambda()>&>(struct {...} &) (__fn=...) at /usr/include/c++/11/bits/invoke.h:111
#11 0x00007ff852f56aa6 in std::_Function_handler<void(), onnxruntime::InferenceSession::RunAsync(const RunOptions*, gsl::span<char const* const>, gsl::span<const OrtValue* const>, gsl::span<char const* const>, gsl::span<OrtValue*>, RunAsyncCallbackFn, void*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/11/bits/std_function.h:290
#12 0x00007ff852f3d08a in std::function<void ()>::operator()() const (this=0x7ff84e3fb520) at /usr/include/c++/11/bits/std_function.h:590
#13 0x00007ff854309078 in onnxruntime::concurrency::ThreadPoolTempl<onnxruntime::Env>::WorkerLoop (this=0x7ff8566e9460, thread_id=4)
    at /onnxruntime/include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h:1635
#14 0x00007ff854307a6c in onnxruntime::concurrency::ThreadPoolTempl<onnxruntime::Env>::WorkerLoop (id=4, param=0x7ff8566e9460)
    at /onnxruntime/include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h:705
#15 0x00007ff854313415 in onnxruntime::(anonymous namespace)::PosixThread::ThreadMain (param=0x7ff85673e240) at /onnxruntime/onnxruntime/core/platform/posix/env.cc:244
#16 0x00007ff860b07ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#17 0x00007ff860b98a04 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100
Thread 13 "hermes" received signal SIGFPE, Arithmetic exception.
[Switching to Thread 0x7fe96b3ff640 (LWP 693)]
0x00007fe972800db5 in std::__detail::_Mod_range_hashing::operator() (this=0x7fe96b3fb857, __num=4284862625839974261, __den=0) at /usr/include/c++/11/bits/hashtable_policy.h:430
430	    { return __num % __den; }
(gdb) backtrace
#0  0x00007fe972800db5 in std::__detail::_Mod_range_hashing::operator() (this=0x7fe96b3fb857, __num=4284862625839974261, __den=0) at /usr/include/c++/11/bits/hashtable_policy.h:430
#1  0x00007fe97283b4d0 in std::__detail::_Hash_code_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Select1st, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, true>::_M_bucket_index (this=0x7fe933e3f830, __c=4284862625839974261, __bkt_count=0)
    at /usr/include/c++/11/bits/hashtable_policy.h:1233
#2  0x00007fe97282ff3f in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_bucket_index (this=0x7fe933e3f830, __c=4284862625839974261) at /usr/include/c++/11/bits/hashtable.h:795
#3  0x00007fe9729a20d3 in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::find (this=0x7fe933e3f830, __k="gpu_graph_id") at /usr/include/c++/11/bits/hashtable.h:1609
#4  0x00007fe9729a0275 in std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::find (
    this=0x7fe933e3f830, __x="gpu_graph_id") at /usr/include/c++/11/bits/unordered_map.h:880
#5  0x00007fe973785f1c in onnxruntime::ConfigOptions::GetConfigEntry (this=0x7fe933e3f830, config_key="gpu_graph_id") at /onnxruntime/onnxruntime/core/framework/config_options.cc:11
#6  0x00007fe973786082 in onnxruntime::ConfigOptions::GetConfigOrDefault (this=0x7fe933e3f830, config_key="gpu_graph_id", default_value="")
    at /onnxruntime/onnxruntime/core/framework/config_options.cc:29
#7  0x00007fe9729359ab in onnxruntime::InferenceSession::Run (this=0x7fe89f060800, run_options=..., feed_names=..., feeds=..., output_names=..., p_fetches=0x7fe96b3fc110, p_fetches_device_info=0x0)
    at /onnxruntime/onnxruntime/core/session/inference_session.cc:2594
#8  0x00007fe9729379c1 in onnxruntime::InferenceSession::Run (this=0x7fe89f060800, run_options=..., feed_names=..., feeds=..., fetch_names=..., fetches=...)
    at /onnxruntime/onnxruntime/core/session/inference_session.cc:2841
#9  0x00007fe972937fe5 in operator() (__closure=0x7fe9356572a0) at /onnxruntime/onnxruntime/core/session/inference_session.cc:2883
#10 0x00007fe972976ff4 in std::__invoke_impl<void, onnxruntime::InferenceSession::RunAsync(const RunOptions*, gsl::span<char const* const>, gsl::span<const OrtValue* const>, gsl::span<char const* const>, gsl::span<OrtValue*>, RunAsyncCallbackFn, void*)::<lambda()>&>(std::__invoke_other, struct {...} &) (__f=...) at /usr/include/c++/11/bits/invoke.h:61
#11 0x00007fe97296a9bc in std::__invoke_r<void, onnxruntime::InferenceSession::RunAsync(const RunOptions*, gsl::span<char const* const>, gsl::span<const OrtValue* const>, gsl::span<char const* const>, gsl::span<OrtValue*>, RunAsyncCallbackFn, void*)::<lambda()>&>(struct {...} &) (__fn=...) at /usr/include/c++/11/bits/invoke.h:111
#12 0x00007fe972956aa6 in std::_Function_handler<void(), onnxruntime::InferenceSession::RunAsync(const RunOptions*, gsl::span<char const* const>, gsl::span<const OrtValue* const>, gsl::span<char const* const>, gsl::span<OrtValue*>, RunAsyncCallbackFn, void*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/11/bits/std_function.h:290
#13 0x00007fe97293d08a in std::function<void ()>::operator()() const (this=0x7fe96b3fc520) at /usr/include/c++/11/bits/std_function.h:590
#14 0x00007fe973d09078 in onnxruntime::concurrency::ThreadPoolTempl<onnxruntime::Env>::WorkerLoop (this=0x7fe9760e9460, thread_id=7)
    at /onnxruntime/include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h:1635
#15 0x00007fe973d07a6c in onnxruntime::concurrency::ThreadPoolTempl<onnxruntime::Env>::WorkerLoop (id=7, param=0x7fe9760e9460)
    at /onnxruntime/include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h:705
#16 0x00007fe973d13415 in onnxruntime::(anonymous namespace)::PosixThread::ThreadMain (param=0x7fe97613e380) at /onnxruntime/onnxruntime/core/platform/posix/env.cc:244
#17 0x00007fe980499ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#18 0x00007fe98052aa04 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

@xadupre
Copy link
Member

xadupre commented May 6, 2025

If I understand correctly, you would like to be able to interrupt locust tests without having onnxruntime crashing?

@dat58
Copy link
Author

dat58 commented May 6, 2025

Yes, but to clarify: Interrupting a Locust test may resemble real-world scenarios where multiple requests are sent to ONNX Runtime, and network issues between ONNX Runtime and the user could corrupt the connection.

@xadupre
Copy link
Member

xadupre commented May 6, 2025

But even if the connection is lost, the thread should continue to run. I don't know locust but I wonder how the thread runing onnxruntime is terminated. Maybe the memory hosting the data onnxruntime is playing with is removed before onnxruntime ends.

@yuslepukhin
Copy link
Member

To gracefully shutdown a running inference use terminate flag which is a member of RunOptions.
More recently, we have introduced a load cancelation flag which is a member of SessionOptions which helps to deal with big models that take time to load.

@dat58
Copy link
Author

dat58 commented May 9, 2025

I have tested many scenarios and realized that when the connection is interrupted on the client side (Locust), the gRPC server (Tonic) terminates the concurrent request. At the same time, the payload is still being processed by ONNX Runtime, and some memory may be released before ONNX Runtime finishes execution, which leads to a program crash.
So I guess this scenario wouldn’t happen in other languages that use garbage collectors. Currently, I’ve found a solution and resolved this issue with very little performance overhead.
Thanks for taking the time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider
Projects
None yet
Development

No branches or pull requests

3 participants