Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TOSS add edge block storaged from exit #3304

Closed
kikimo opened this issue Nov 11, 2021 · 0 comments · Fixed by #3306
Closed

TOSS add edge block storaged from exit #3304

kikimo opened this issue Nov 11, 2021 · 0 comments · Fixed by #3306
Assignees
Labels
type/bug Type: something is unexpected
Milestone

Comments

@kikimo
Copy link
Contributor

kikimo commented Nov 11, 2021

Please check the FAQ documentation before raising an issue

Describe the bug (required)

Leader storage instance blocked when exiting, also block two other follower instances.

Your Environments (required)

  • OS: uname -a
  • Compiler: g++ --version or clang++ --version
  • CPU: lscpu
  • Commit id 74ceaea

How To Reproduce(required)

5storage + 1meta + 1graph, keep stopping and resuming leader, run for a while, stop the cluster, we see three storage instance failed to exit.

root     1431246 10.3  0.0 1465308 98548 ?       Ssl  00:34  98:55 /root/src/nebula/build/bin/nebula-storaged --flagfile /root/src/test/etc/nebula-storaged.conf --pid_file /root/src/test/pids/nebula-storaged.pid.1 --meta_server_addrs 192.168.15.12:9559 --heartbeat_interval_secs 1 --raft_heartbeat_interval_secs 1 --wal_ttl 259200 --clean_wal_interval_secs 259200 --log_dir /root/src/test/logs/storaged.1 --local_ip 192.168.15.12 --port 47513 --ws_http_port 33409 --ws_h2_port 44275 --data_path /root/src/test/data/storaged.1
root     1431404 11.1  0.0 1418732 136368 ?      Ssl  00:34 107:06 /root/src/nebula/build/bin/nebula-storaged --flagfile /root/src/test/etc/nebula-storaged.conf --pid_file /root/src/test/pids/nebula-storaged.pid.2 --meta_server_addrs 192.168.15.12:9559 --heartbeat_interval_secs 1 --raft_heartbeat_interval_secs 1 --wal_ttl 259200 --clean_wal_interval_secs 259200 --log_dir /root/src/test/logs/storaged.2 --local_ip 192.168.15.12 --port 53045 --ws_http_port 41607 --ws_h2_port 33069 --data_path /root/src/test/data/storaged.2
root     1509717  616  0.2 2336260 604160 ?      Ssl  10:54 2077:30 /root/src/nebula/build/bin/nebula-storaged --flagfile /data/src/ntest/test/etc/nebula-storaged.conf --pid_file /data/src/ntest/test/pids/nebula-storaged.pid.4 --meta_server_addrs 192.168.15.12:9559 --heartbeat_interval_secs 1 --raft_heartbeat_interval_secs 1 --wal_ttl 259200 --clean_wal_interval_secs 259200 --log_dir /data/src/ntest/test/logs/storaged.4 --local_ip 192.168.15.12 --port 46161 --ws_http_port 58975 --ws_h2_port 36383 --data_path /data/src/ntest/test/data/storaged.4

the leader which is process of pid run with very high cpu:

1509717 root      20   0 2336260 604116  35496 S 109.6   0.2   2078:01 nebula-storaged

from the log we can see that the leader keep trying append log to follower without success:

...
W1111 16:22:41.996908 1509746 RaftPart.cpp:944] [Port: 46162, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1111 16:22:42.013818 1509746 RaftPart.cpp:944] [Port: 46162, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1111 16:22:42.030689 1509746 RaftPart.cpp:944] [Port: 46162, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1111 16:22:42.047605 1509746 RaftPart.cpp:944] [Port: 46162, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1111 16:22:42.064508 1509746 RaftPart.cpp:944] [Port: 46162, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1111 16:22:42.081408 1509746 RaftPart.cpp:944] [Port: 46162, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1111 16:22:42.098253 1509746 RaftPart.cpp:944] [Port: 46162, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1111 16:22:42.115130 1509746 RaftPart.cpp:944] [Port: 46162, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1111 16:22:42.132002 1509746 RaftPart.cpp:944] [Port: 46162, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1111 16:22:42.148905 1509746 RaftPart.cpp:944] [Port: 46162, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again

Part of leader storage stack, seems that it was blocked on ChainResumeProcessor.cpp:57

1162 #8  0x0000000002e36992 in folly::Future<folly::Unit>::get()+61 in /data/src/nebula/build/bin/nebula-storaged at Future-inl.h:2323
1163 #9  0x0000000002e362bd in nebula::storage::ChainResumeProcessor::process()+2010 in /data/src/nebula/build/bin/nebula-storaged at ChainResumeProcessor.cpp:57
1164 #10 0x0000000002df7dd1 in nebula::storage::TransactionManager::resumeThread()+94 in /data/src/nebula/build/bin/nebula-storaged at TransactionManager.cpp:87
1165 #11 0x0000000002e1d3e7 in std::__invoke_impl<void, void (nebula::storage::TransactionManager::*&)(), nebula::storage::TransactionManager*&>()+107 in /data/src/nebula/build/bin/nebula-storaged at invoke.h:73
1166 #12 0x0000000002e1d346 in std::__invoke<void (nebula::storage::TransactionManager::*&)(), nebula::storage::TransactionManager*&>()+56 in /data/src/nebula/build/bin/nebula-storaged at invoke.h:95
1167 #13 0x0000000002e1d248 in std::_Bind<void (nebula::storage::TransactionManager::*(nebula::storage::TransactionManager*))()>::__call<void, 0>()+91 in /data/src/nebula/build/bin/nebula-storaged at functional:400
1168 #14 0x0000000002e1cfe7 in std::_Bind<void (nebula::storage::TransactionManager::*(nebula::storage::TransactionManager*))()>::operator()<>()+54 in /data/src/nebula/build/bin/nebula-storaged at functional:484
1169 #15 0x0000000002e1bce2 in std::_Function_handler<void(), std::_Bind<void (nebula::storage::TransactionManager::*(nebula::storage::TransactionManager*))()> >::_M_invoke()+33 in /data/src/nebula/build/bin/nebula-storaged at std_function.h:300
1170 #16 0x0000000002d7c4dc in std::function<void()>::operator()()+53 in /data/src/nebula/build/bin/nebula-storaged at std_function.h:688
1171 #17 0x0000000002dfbbe8 in _ZZN6nebula6thread13GenericWorker12addDelayTaskIMNS_7storage18TransactionManagerEFvvEJPS4_EEENSt9enable_ifIXsrSt7is_voidINSt9result_ofIFT_DpT0_EE4typeEE5valueEN5folly10SemiFutureINSI_4UnitEEEE4typeEmOSB_DpOSC_ENKUlvE_clEv!()+51 in /data/src/nebula/buil     d/bin/nebula-storaged at GenericWorker.h:215
1172 #18 0x0000000002e18274 in std::__invoke_impl<void, nebula::thread::GenericWorker::addDelayTask(size_t, F&&, Args&& ...) [with F = void (nebula::storage::TransactionManager::*)(); Args = {nebula::storage::TransactionManager*}]::<lambda()>&>()+33 in /data/src/nebula/build/bin/neb     ula-storaged at invoke.h:60
1173 #19 0x0000000002e151c1 in std::__invoke<nebula::thread::GenericWorker::addDelayTask(size_t, F&&, Args&& ...) [with F = void (nebula::storage::TransactionManager::*)(); Args = {nebula::storage::TransactionManager*}]::<lambda()>&>()+33 in /data/src/nebula/build/bin/nebula-storage     d at invoke.h:95
1174 #20 0x0000000002e11b80 in std::_Bind<nebula::thread::GenericWorker::addDelayTask(size_t, F&&, Args&& ...) [with F = void (nebula::storage::TransactionManager::*)(); Args = {nebula::storage::TransactionManager*}; typename std::enable_if<std::is_void<typename std::result_of<_Func     tor(_ArgTypes ...)>::type>::value, folly::SemiFuture<folly::Unit> >::type = folly::SemiFuture<folly::Unit>; size_t = long unsigned int]::<lambda()>()>::__call<void>()+29 in /data/src/nebula/build/bin/nebula-storaged at functional:400
1175 #21 0x0000000002e0ed97 in std::_Bind<nebula::thread::GenericWorker::addDelayTask(size_t, F&&, Args&& ...) [with F = void (nebula::storage::TransactionManager::*)(); Args = {nebula::storage::TransactionManager*}; typename std::enable_if<std::is_void<typename std::result_of<_Func     tor(_ArgTypes ...)>::type>::value, folly::SemiFuture<folly::Unit> >::type = folly::SemiFuture<folly::Unit>; size_t = long unsigned int]::<lambda()>()>::operator()<>()+54 in /data/src/nebula/build/bin/nebula-storaged at functional:484
1176 #22 0x0000000002e0aa85 in std::_Function_handler<void(), std::_Bind<nebula::thread::GenericWorker::addDelayTask(size_t, F&&, Args&& ...) [with F = void (nebula::storage::TransactionManager::*)(); Args = {nebula::storage::TransactionManager*}; typename std::enable_if<std::is_voi     d<typename std::result_of<_Functor(_ArgTypes ...)>::type>::value, folly::SemiFuture<folly::Unit> >::type = folly::SemiFuture<folly::Unit>; size_t = long unsigned int]::<lambda()>()> >::_M_invoke()+33 in /data/src/nebula/build/bin/nebula-storaged at std_function.h:300
1177 #23 0x0000000002d7c4dc in std::function<void()>::operator()()+53 in /data/src/nebula/build/bin/nebula-storaged at std_function.h:688

ChainResumeProcessor.cpp:57

https://github.com/critical27/nebula/blob/74ceaeae356233dfdac044993f950f38c3037f5b/src/storage/transaction/ChainResumeProcessor.cpp#L45-L62

leader storage pstack:
1509717.txt

pstack of stucked followers:

1431246.txt
1431404.txt

@kikimo kikimo added the type/bug Type: something is unexpected label Nov 11, 2021
@liuyu85cn liuyu85cn mentioned this issue Nov 11, 2021
7 tasks
@Sophie-Xie Sophie-Xie added this to the v3.0.0 milestone Nov 11, 2021
@yixinglu yixinglu linked a pull request Nov 12, 2021 that will close this issue
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Type: something is unexpected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants