Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

编译runtime/libtorch模块下的GPU版本,运行websocket_server_main 压测会不定时退出 #1959

Closed
raycool opened this issue Aug 23, 2023 · 4 comments
Labels

Comments

@raycool
Copy link

raycool commented Aug 23, 2023

cuda11.3版本下,-DGPU=ON 编译完成后,启动服务,如图所示正常启动,也使用了GPU。
image

此时进行压测,服务运行在V100 32G上面,并发数在30进行压测,压测过程中不定时会自动退出并且没有报错
如图所示:服务正运行着就莫名退出,此时显存占用也不高。
image
image
不知道有什么建议进行排查此类故障。

补充下,使用gdb打开生成的core dump文件。
(gdb) bt
#0 tcache_get (tc_idx=) at malloc.c:2937
#1 GI___libc_malloc (bytes=8) at malloc.c:3051
#2 0x00007f2eecd16b39 in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007f2f274010a7 in at::native::expand_param_if_needed(c10::ArrayRef, char const*, long) ()
from /home/wenet/runtime/libtorch/fc_base/libtorch-src/lib/libtorch_cpu.so
#4 0x00007f2f273fb954 in at::native::convolution(at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long, bool, bool, bool, bool) () from /home/wenet/runtime/libtorch/fc_base/libtorch-src/lib/libtorch_cpu.so
#5 0x00007f2f27e84a3a in c10::impl::wrap_kernel_functor_unboxed
<c10::impl::detail::WrapFunctionIntoFunctor
<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long, bool, bool, bool, bool), &at::(anonymous namespace)::(anonymous namespace)::wrapper___convolution>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long, bool, bool, bool, bool> >, at::Tensor (at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long, bool, bool, bool, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long, bool, bool, bool, bool) ()
from /home/wenet/runtime/libtorch/fc_base/libtorch-src/lib/libtorch_cpu.so
#6 0x00007f2f27998557 in at::ops::convolution::call(at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long, bool, bool, bool, bool) () from /home/wenet/runtime/libtorch/fc_base/libtorch-src/lib/libtorch_cpu.so
#7 0x00007f2f273f539b in at::native::convolution(at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long) ()
from /home/wenet/runtime/libtorch/fc_base/libtorch-src/lib/libtorch_cpu.so
#8 0x00007f2f27e847da in c10::impl::wrap_kernel_functor_unboxed
<c10::impl::detail::WrapFunctionIntoFunctor
<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long), &at::(anonymous namespace)::(anonymous namespace)::wrapper__convolution>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long> >, at::Tensor (at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long) () from /home/wenet/runtime/libtorch/fc_base/libtorch-src/lib/libtorch_cpu.so
#9 0x00007f2f2796f1d6 in at::ops::convolution::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long) () from /home/wenet/runtime/libtorch/fc_base/libtorch-src/lib/libtorch_cpu.so
#10 0x00007f2f289acf18 in torch::autograd::VariableType::(anonymous namespace)::convolution(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long) ()
from /home/wenet/runtime/libtorch/fc_base/libtorch-src/lib/libtorch_cpu.so
#11 0x00007f2f289ada66 in c10::impl::wrap_kernel_functor_unboxed
<c10::impl::detail::WrapFunctionIntoFunctor
<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long), &torch::autograd::VariableType::(anonymous namespace)::convolution>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long) ()
from /home/wenet/runtime/libtorch/fc_base/libtorch-src/lib/libtorch_cpu.so
#12 0x00007f2f279971e1 in at::ops::convolution::call(at::Tensor const&, at::Tensor const&, c10::optional<at::T--Type for more, q to quit, c to continue without paging--
ensor> const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long)
() from /home/wenet/runtime/libtorch/fc_base/libtorch-src/lib/libtorch_cpu.so
#13 0x00007f2f273f9979 in at::native::conv1d(at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long) ()
from /home/wenet/runtime/libtorch/fc_base/libtorch-src/lib/libtorch_cpu.so
#14 0x00007f2f27f8e9a1 in c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor
<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long), &at::(anonymous namespace)::(anonymous namespace)::wrapper__conv1d>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocatorc10::IValue >) () from /home/wenet/runtime/libtorch/fc_base/libtorch-src/lib/libtorch_cpu.so
#15 0x00007f2f2971d000 in c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocatorc10::IValue >
) const () from /home/wenet/runtime/libtorch/fc_base/libtorch-src/lib/libtorch_cpu.so
#16 0x00007f2f293aa689 in torch::jit::InterpreterStateImpl::runImpl(std::vector<c10::IValue, std::allocatorc10::IValue >&) () from /home/wenet/runtime/libtorch/fc_base/libtorch-src/lib/libtorch_cpu.so
#17 0x00007f2f29398e3f in torch::jit::InterpreterState::run(std::vector<c10::IValue, std::allocatorc10::IValue >&) () from /home/wenet/runtime/libtorch/fc_base/libtorch-src/lib/libtorch_cpu.so
#18 0x00007f2f2938c2c2 in torch::jit::GraphExecutorImplBase::run(std::vector<c10::IValue, std::allocatorc10::IValue >&) () from /home/wenet/runtime/libtorch/fc_base/libtorch-src/lib/libtorch_cpu.so
#19 0x00007f2f2905c923 in torch::jit::Method::operator()(std::vector<c10::IValue, std::allocatorc10::IValue >, std::unordered_map<std::string, c10::IValue, std::hashstd::string, std::equal_tostd::string, std::allocator<std::pair<std::string const, c10::IValue> > > const&) const ()
from /home/wenet/runtime/libtorch/fc_base/libtorch-src/lib/libtorch_cpu.so
#20 0x000055f7018b9a9e in wenet::TorchAsrModel::ForwardEncoderFunc (this=0x7f2d64001670,
chunk_feats=std::vector of length 64, capacity 64 = {...}, out_prob=0x7f2c6efc5680)
at /home/wenet/runtime/libtorch/decoder/torch_asr_model.cc:153
#21 0x000055f701767e57 in wenet::AsrModel::ForwardEncoder (this=0x7f2d64001670,
chunk_feats=std::vector of length 64, capacity 64 = {...}, ctc_prob=0x7f2c6efc5680)
at /home/wenet/runtime/libtorch/decoder/asr_model.cc:49
#22 0x000055f70175fe45 in wenet::AsrDecoder::AdvanceDecoding (this=0x7f2d640051f0, block=true)
at /home/wenet/runtime/libtorch/decoder/asr_decoder.cc:108
#23 0x000055f70175f9eb in wenet::AsrDecoder::Decode (this=0x7f2d640051f0, block=true)
at /home/wenet/runtime/libtorch/decoder/asr_decoder.cc:77
#24 0x000055f7016b3a1d in wenet::ConnectionHandler::DecodeThreadFunc (this=0x55f7a7838eb8)
at /home/wenet/runtime/libtorch/websocket/websocket_server.cc:123
#25 0x000055f70175de98 in std::__invoke_impl<void, void (wenet::ConnectionHandler::)(), wenet::ConnectionHandler> (
__f=@0x7f2d640054c0: (void (wenet::ConnectionHandler::)(class wenet::ConnectionHandler * const)) 0x55f7016b39de wenet::ConnectionHandler::DecodeThreadFunc(), __t=@0x7f2d640054b8: 0x55f7a7838eb8)
at /usr/include/c++/9/bits/invoke.h:73
#26 0x000055f70175d9ae in std::__invoke<void (wenet::ConnectionHandler::
)(), wenet::ConnectionHandler*> (
__fn=@0x7f2d640054c0: (void (wenet::ConnectionHandler::)(class wenet::ConnectionHandler * const)) 0x55f7016b39de wenet::ConnectionHandler::DecodeThreadFunc()) at /usr/include/c++/9/bits/invoke.h:95
#27 0x000055f70175cc13 in std::thread::_Invoker<std::tuple<void (wenet::ConnectionHandler::
)(), wenet::ConnectionHandler*> >::_M_invoke<0ul, 1ul> (this=0x7f2d640054b8) at /usr/include/c++/9/thread:244
#28 0x000055f70175ba65 in std::thread::_Invoker<std::tuple<void (wenet::ConnectionHandler::)(), wenet::ConnectionHandler> >::operator() (this=0x7f2d640054b8) at /usr/include/c++/9/thread:251
#29 0x000055f701759924 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (wenet::ConnectionHandler::)(), wenet::ConnectionHandler> > >::_M_run (this=0x7f2d640054b0) at /usr/include/c++/9/thread:195
#30 0x00007f2fa1d7771f in execute_native_thread_routine ()
from /home/wenet/runtime/libtorch/fc_base/libtorch-src/lib/libtorch_cuda.so
#31 0x00007f2f260fa609 in start_thread (arg=) at pthread_create.c:477
#32 0x00007f2eeca2d133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@raycool raycool changed the title 编译runtime-libtorch模块下的GPU版本,运行websocket_server_main 压测会不定时退出 编译runtime/libtorch模块下的GPU版本,运行websocket_server_main 压测会不定时退出 Aug 23, 2023
@robin1001
Copy link
Collaborator

可以先看看 CPU 上有没有问题。

@raycool
Copy link
Author

raycool commented Aug 24, 2023

可以先看看 CPU 上有没有问题。

忘记补充了,CPU上没任何问题,压测一天也没问题。
GPU上压测 一分钟内就出问题。

Copy link

github-actions bot commented Dec 2, 2023

This issue has been automatically closed due to inactivity.

@github-actions github-actions bot added the Stale label Dec 2, 2023
Copy link

github-actions bot commented Dec 9, 2023

This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.

@github-actions github-actions bot closed this as completed Dec 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants