Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage block on exit #3071

Closed
kikimo opened this issue Oct 14, 2021 · 2 comments
Closed

storage block on exit #3071

kikimo opened this issue Oct 14, 2021 · 2 comments
Assignees
Labels
type/bug Type: something is unexpected
Milestone

Comments

@kikimo
Copy link
Contributor

kikimo commented Oct 14, 2021

Please check the FAQ documentation before raising an issue

Please check the FAQ documentation and old issues before raising an issue in case someone has asked the same question that you are asking.

Describe the bug (must be provided)

storage block on exit, stack and log are provided below, this is a different kind of storage stuck, we've never seen before during our previous test.

Your Environments (must be provided)

  • OS: uname -a
  • Compliler: g++ --version or clang++ --version
  • CPU: lscpu
  • Commit id a460c0e

How To Reproduce(must be provided)

Steps to reproduce the behavior:

  1. start a cluster with 5storage + 1meta + 1graph
  2. make storage change leader quickly and keep inserting edge with 1024 concurrent clients
  3. then kill all the services with SIGTERM

Expected behavior

all nebula services exit gracefully, but we found two storages instance stuck and can only be killed with -9 signal.

Additional context

part of storage stack, the full process stack can be found here :
2927120.txt

Thread 123 (Thread 0x7f7fa91ff700 (LWP 2930639) "hazptr-tpe-0"):
#0  0x00007f7ff809fd19 in syscall () from /lib64/libc.so.6
#1  0x000000000410cc01 in folly::detail::futexWaitImpl(std::atomic<unsigned int> const*, unsigned int, std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const*, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const*, unsigned int) ()
#2  0x0000000002a71508 in folly::detail::futexWaitImpl<std::atomic<unsigned int> const, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > > (futex=0x7f7ff2de8488, expected=2, deadline=..., waitMask=4294967295) at /root/src/wwl/nebula/build/third-party/install/include/folly/detail/Futex-inl.h:81
#3  0x0000000002a6cf6e in folly::detail::futexWaitUntil<std::atomic<unsigned int>, std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (futex=0x7f7ff2de8488, expected=2, deadline=..., waitMask=4294967295) at /root/src/wwl/nebula/build/third-party/install/include/folly/detail/Futex-inl.h:119
#4  0x0000000002a6764d in folly::detail::MemoryIdler::futexWaitUntil<std::atomic<unsigned int>, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (fut=..., expected=2, deadline=..., waitMask=4294967295, idleTimeout=..., stackToRetain=1024, timeoutVariationFrac=0.5) at /root/src/wwl/nebula/build/third-party/install/include/folly/detail/MemoryIdler.h:164
#5  0x0000000002aa86c9 in folly::SaturatingSemaphore<true, std::atomic>::tryWaitSlow<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (this=0x7f7ff2de8488, deadline=..., opt=...) at /root/src/wwl/nebula/build/third-party/install/include/folly/synchronization/SaturatingSemaphore.h:312
#6  0x0000000002aa5d28 in folly::SaturatingSemaphore<true, std::atomic>::try_wait_until<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (opt=..., deadline=..., this=0x7f7ff2de8488) at /root/src/wwl/nebula/build/third-party/install/include/folly/synchronization/SaturatingSemaphore.h:173
#7  folly::detail::LifoSemBase<folly::SaturatingSemaphore<true, std::atomic>, std::atomic>::try_wait_until<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (this=0x7f7fb18ee2c0, deadline=...) at /root/src/wwl/nebula/build/third-party/install/include/folly/synchronization/LifoSem.h:512
#8  0x00000000046445e0 in folly::UnboundedBlockingQueue<folly::CPUThreadPoolExecutor::CPUTask>::try_take_for(std::chrono::duration<long, std::ratio<1l, 1000l> >) ()
#9  0x000000000463e427 in folly::CPUThreadPoolExecutor::threadRun(std::shared_ptr<folly::ThreadPoolExecutor::Thread>) ()
#10 0x0000000004586a19 in void folly::detail::function::FunctionTraits<void ()>::callBig<std::_Bind<void (folly::ThreadPoolExecutor::*(folly::ThreadPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)> >(folly::detail::function::Data&) ()
#11 0x0000000002a3e85c in folly::detail::function::FunctionTraits<void ()>::operator()() (this=0x7f7fe7561470) at /root/src/wwl/nebula/build/third-party/install/include/folly/Function.h:400
#12 0x0000000002a4ea06 in folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}::operator()() (__closure=0x7f7fe7561470) at /root/src/wwl/nebula/build/third-party/install/include/folly/executors/thread_factory/NamedThreadFactory.h:40
#13 0x0000000002a60400 in std::__invoke_impl<void, folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}>(std::__invoke_other, folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}&&) (__f=...) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/bits/invoke.h:60
#14 0x0000000002a5994a in std::__invoke<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}>(folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}&&) (__fn=...) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/bits/invoke.h:95
#15 0x0000000002aa52c2 in std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x7f7fe7561470) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/thread:234
#16 0x0000000002aa47b7 in std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}> >::operator()() (this=0x7f7fe7561470) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/thread:243
#17 0x0000000002aa1a54 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}> > >::_M_run() (this=0x7f7fe7561460) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/thread:186
#18 0x0000000004b4143f in execute_native_thread_routine ()
#19 0x00007f7ff837cea5 in start_thread () from /lib64/libpthread.so.0
#20 0x00007f7ff80a59fd in clone () from /lib64/libc.so.6
...
Thread 1 (Thread 0x7f7ff8ec49c0 (LWP 2927120) "nebula-storaged"):
#0  0x00007f7ff809fd19 in syscall () from /lib64/libc.so.6
#1  0x000000000410cc01 in folly::detail::futexWaitImpl(std::atomic<unsigned int> const*, unsigned int, std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const*, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const*, unsigned int) ()
#2  0x0000000002a6cf52 in folly::detail::futexWaitUntil<std::atomic<unsigned int>, std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (futex=0x7f7ff7a53b20, expected=2, deadline=..., waitMask=4294967295) at /root/src/wwl/nebula/build/third-party/install/include/folly/detail/Futex-inl.h:119
#3  0x0000000002a6764d in folly::detail::MemoryIdler::futexWaitUntil<std::atomic<unsigned int>, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (fut=..., expected=2, deadline=..., waitMask=4294967295, idleTimeout=..., stackToRetain=1024, timeoutVariationFrac=0.5) at /root/src/wwl/nebula/build/third-party/install/include/folly/detail/MemoryIdler.h:164
#4  0x0000000002b2504c in folly::Baton<true, std::atomic>::tryWaitSlow<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (this=0x7f7ff7a53b20, deadline=..., opt=...) at /root/src/wwl/nebula/build/third-party/install/include/folly/synchronization/Baton.h:305
#5  0x00000000043373eb in folly::DefaultKeepAliveExecutor::joinKeepAlive() ()
#6  0x0000000004331559 in apache::thrift::concurrency::ThreadManager::Impl::stopImpl(bool) ()
#7  0x00000000043376f9 in apache::thrift::concurrency::PriorityThreadManager::PriorityImpl::join() ()
#8  0x0000000004239054 in apache::thrift::ThriftServer::~ThriftServer() ()
#9  0x0000000004239481 in apache::thrift::ThriftServer::~ThriftServer() ()
#10 0x0000000002a645f9 in std::default_delete<apache::thrift::ThriftServer>::operator() (this=0x7f7ff7a323c0, __ptr=0x7f7ff7a3e000) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/bits/unique_ptr.h:78
#11 0x0000000002a66b55 in std::unique_ptr<apache::thrift::ThriftServer, std::default_delete<apache::thrift::ThriftServer> >::reset (this=0x7f7ff7a323c0, __p=0x7f7ff7a3e000) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/bits/unique_ptr.h:371
#12 0x0000000003d9591d in nebula::raftex::RaftexService::waitUntilStop (this=0x7f7ff7a323b0) at /root/src/wwl/nebula/src/kvstore/raftex/RaftexService.cpp:141
#13 0x0000000003cf738b in nebula::kvstore::NebulaStore::~NebulaStore (this=0x7f7ff7a57400, __in_chrg=<optimized out>) at /root/src/wwl/nebula/src/kvstore/NebulaStore.cpp:42
#14 0x0000000003cf75e4 in nebula::kvstore::NebulaStore::~NebulaStore (this=0x7f7ff7a57400, __in_chrg=<optimized out>) at /root/src/wwl/nebula/src/kvstore/NebulaStore.cpp:50
#15 0x0000000002a647b7 in std::default_delete<nebula::kvstore::KVStore>::operator() (this=0x7f7ff7a4f1f0, __ptr=0x7f7ff7a57400) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/bits/unique_ptr.h:78
#16 0x0000000002a60073 in std::unique_ptr<nebula::kvstore::KVStore, std::default_delete<nebula::kvstore::KVStore> >::reset (this=0x7f7ff7a4f1f0, __p=0x7f7ff7a57400) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/bits/unique_ptr.h:371
#17 0x0000000002a59326 in nebula::storage::StorageServer::stop (this=0x7f7ff7a4f180) at /root/src/wwl/nebula/src/storage/StorageServer.cpp:349
#18 0x0000000002a3ded8 in signalHandler (sig=15) at /root/src/wwl/nebula/src/daemons/StorageDaemon.cpp:175
#19 0x0000000002a3dd1c in <lambda(nebula::SignalHandler::GeneralSignalInfo*)>::operator()(nebula::SignalHandler::GeneralSignalInfo *) const (__closure=0x5d9b020 <nebula::SignalHandler::get()::instance+448>, info=0x7ffff8a15630) at /root/src/wwl/nebula/src/daemons/StorageDaemon.cpp:166
#20 0x0000000002a45b2e in std::_Function_handler<void(nebula::SignalHandler::GeneralSignalInfo*), setupSignalHandler()::<lambda(nebula::SignalHandler::GeneralSignalInfo*)> >::_M_invoke(const std::_Any_data &, nebula::SignalHandler::GeneralSignalInfo *&&) (__functor=..., __args#0=@0x7ffff8a155f0: 0x7ffff8a15630) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/bits/std_function.h:316
#21 0x0000000003c0b842 in std::function<void (nebula::SignalHandler::GeneralSignalInfo*)>::operator()(nebula::SignalHandler::GeneralSignalInfo*) const (this=0x5d9b020 <nebula::SignalHandler::get()::instance+448>, __args#0=0x7ffff8a15630) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/bits/std_function.h:706
#22 0x0000000003c0b41d in nebula::SignalHandler::handleGeneralSignal (this=0x5d9ae60 <nebula::SignalHandler::get()::instance>, sig=15, info=0x7ffff8a157f0) at /root/src/wwl/nebula/src/common/base/SignalHandler.cpp:107
#23 0x0000000003c0b3b9 in nebula::SignalHandler::doHandle (this=0x5d9ae60 <nebula::SignalHandler::get()::instance>, sig=15, info=0x7ffff8a157f0, uctx=0x7ffff8a156c0) at /root/src/wwl/nebula/src/common/base/SignalHandler.cpp:101
#24 0x0000000003c0b337 in nebula::SignalHandler::handlerHook (sig=15, info=0x7ffff8a157f0, uctx=0x7ffff8a156c0) at /root/src/wwl/nebula/src/common/base/SignalHandler.cpp:82
#25 <signal handler called>
#26 0x00007f7ff837e017 in pthread_join () from /lib64/libpthread.so.0
#27 0x0000000004b414d3 in std::thread::join() ()
#28 0x0000000002a59090 in nebula::storage::StorageServer::waitUntilStop (this=0x7f7ff7a4f180) at /root/src/wwl/nebula/src/storage/StorageServer.cpp:310
#29 0x0000000002a3d9f4 in main (argc=2, argv=0x7ffff8a168c0) at /root/src/wwl/nebula/src/daemons/StorageDaemon.cpp:158

and the log shows that raft is trying hard to catch up log index but no luck for more nearly 5mins(and then we send sigterm to to process but it stuck):

  14278 I1014 14:37:54.739661 2927154 RaftPart.cpp:1549] [Port: 42648, Space: 1, Part: 1] Local is missing logs from id 251343. Need to catch up
  14279 I1014 14:37:54.739706 2930624 ChainAddEdgesProcessorLocal.cpp:404] e0645f4e-23fe-42b5-a63f-4d807d06e1dc 104->118, val: "76->13430" 2021-10-14T06:37:54.836000  commitLocal(), code_ = 
...
8409188 I1014 14:42:35.491660 2927168 MetaClient.cpp:590] Send request to meta "192.168.15.11":9559
8409189 I1014 14:42:35.492377 2927168 MetaClient.cpp:2379] Metad last update time: 1634193507184
8409190 I1014 14:42:35.495230 2927157 RaftPart.cpp:1549] [Port: 42648, Space: 1, Part: 1] Local is missing logs from id 251343. Need to catch up
@critical27
Copy link
Contributor

Perhaps this issue is fixed in #3306

@critical27
Copy link
Contributor

Fixed confirmed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Type: something is unexpected
Projects
None yet
Development

No branches or pull requests

3 participants