Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutex deadlock / race in API node #3583

Open
ScottSallinen opened this issue Jan 23, 2020 · 4 comments
Open

Mutex deadlock / race in API node #3583

ScottSallinen opened this issue Jan 23, 2020 · 4 comments

Comments

@ScottSallinen
Copy link
Contributor

debug for all threads.txt

A highly used, non-mira "full node" API deadlocked with the included backtrace.

The main thread is stuck as follows:

#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x3281228)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x32811d8, cond=0x3281200) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x3281200, mutex=0x32811d8) at pthread_cond_wait.c:655
#3  0x000000000183a913 in boost::condition_variable::wait(boost::unique_lock<boost::mutex>&) ()
#4  0x000000000204e43d in boost::thread::join_noexcept() ()
#5  0x00000000016e169a in steem::plugins::webserver::detail::webserver_plugin_impl::stop_webserver() ()
#6  0x000000000119f5f9 in appbase::application::shutdown() ()
#7  0x00000000011a0228 in appbase::application::exec() ()
#8  0x000000000116e564 in main ()

Locks should have timeouts.

@mvandeberg
Copy link
Contributor

The stack trace takes place while the node is shutting down. We can add a timeout on the thread join, but this seems like a fairly minimal issue. The deadlock is not occurring while the node is operational. If you don't believe the node should have been shutting down, there may be a different issue, but this is not enough information to diagnose the problem.

@ScottSallinen
Copy link
Contributor Author

Hm, yes, unfortunately I tried to shut it down before taking the stack trace, I should have done it the other way around.

If it occurs again I will take the trace while it's halted.

@ScottSallinen
Copy link
Contributor Author

Attached is another log from another API node (different one) taken while stuck.
gdb.txt

@mvandeberg
Copy link
Contributor

mvandeberg commented Feb 11, 2020

I am not seeing any deadlock here...

The write thread:

Thread 524 (Thread 0x7f9963dff700 (LWP 19986)):
#0  0x000055bed7c17c60 in boost::multi_index::detail::ordered_index_node_impl<boost::multi_index::detail::null_augment_policy, boost::interprocess::allocator<char, boost::interprocess::segment_manager<char, boost::interprocess::rbtree_best_fit<boost::interprocess::mutex_family, boost::interprocess::offset_ptr<void, long, unsigned long, 0ul>, 0ul>, boost::interprocess::iset_index> > >::rebalance_for_erase(boost::interprocess::offset_ptr<boost::multi_index::detail::ordered_index_node_impl<boost::multi_index::detail::null_augment_policy, boost::interprocess::allocator<char, boost::interprocess::segment_manager<char, boost::interprocess::rbtree_best_fit<boost::interprocess::mutex_family, boost::interprocess::offset_ptr<void, long, unsigned long, 0ul>, 0ul>, boost::interprocess::iset_index> > >, long, unsigned long, 0ul>, boost::interprocess::offset_ptr<boost::multi_index::detail::ordered_index_node_impl<boost::multi_index::detail::null_augment_policy, boost::interprocess::allocator<char, boost::interprocess::segment_manager<char, boost::interprocess::rbtree_best_fit<boost::interprocess::mutex_family, boost::interprocess::offset_ptr<void, long, unsigned long, 0ul>, 0ul>, boost::interprocess::iset_index> > >, long, unsigned long, 0ul>&, boost::interprocess::offset_ptr<boost::multi_index::detail::ordered_index_node_impl<boost::multi_index::detail::null_augment_policy, boost::interprocess::allocator<char, boost::interprocess::segment_manager<char, boost::interprocess::rbtree_best_fit<boost::interprocess::mutex_family, boost::interprocess::offset_ptr<void, long, unsigned long, 0ul>, 0ul>, boost::interprocess::iset_index> > >, long, unsigned long, 0ul>&, boost::interprocess::offset_ptr<boost::multi_index::detail::ordered_index_node_impl<boost::multi_index::detail::null_augment_policy, boost::interprocess::allocator<char, boost::interprocess::segment_manager<char, boost::interprocess::rbtree_best_fit<boost::interprocess::mutex_family, boost::interprocess::offset_ptr<void, long, unsigned long, 0ul>, 0ul>, boost::interprocess::iset_index> > >, long, unsigned long, 0ul>&) ()
#1  0x000055bed8191080 in _ZN5boost11multi_index6detail18ordered_index_implINS0_13composite_keyIN5steem5chain14comment_objectENS0_6memberIS6_N2fc14time_point_secEXadL_ZNS6_12cashout_timeEEEEENS7_IS6_N9chainbase3oidIS6_EEXadL_ZNS6_2idEEEEENS_6tuples9null_typeESG_SG_SG_SG_SG_SG_SG_EESt4lessINS0_20composite_key_resultISH_EEENS1_9nth_layerILi2ES6_NS0_10indexed_byINS0_14ordered_uniqueINS0_3tagINS5_5by_idEN4mpl_2naESS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_EESE_SS_EENSO_INSP_INS5_15by_cashout_timeESS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_EESH_SS_EENSO_INSP_INS5_11by_permlinkESS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_EENS3_IS6_NS7_IS6_NS4_8protocol17fixed_string_implINS8_7uint128EEEXadL_ZNS6_6authorEEEEENS7_IS6_NS_9container12basic_stringIcSt11char_traitsIcENS_12interprocess9allocatorIcNS19_15segment_managerIcNS19_15rbtree_best_fitINS19_12mutex_familyENS19_10offset_ptrIvlmLm0EEELm0EEENS19_10iset_indexEEEEEEEXadL_ZNS6_8permlinkEEEEESG_SG_SG_SG_SG_SG_SG_SG_EENS0_21composite_key_compareISI_IS13_ENS5_11strcmp_lessESG_SG_SG_SG_SG_SG_SG_SG_EEEENSO_INSP_INS5_7by_rootESS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_EENS3_IS6_NS7_IS6_SD_XadL_ZNS6_12root_commentEEEEESE_SG_SG_SG_SG_SG_SG_SG_SG_EESS_EENSO_INSP_INS5_9by_parentESS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_EENS3_IS6_NS7_IS6_S13_XadL_ZNS6_13parent_authorEEEEENS7_IS6_S1K_XadL_ZNS6_15parent_permlinkEEEEESE_SG_SG_SG_SG_SG_SG_SG_EENS1N_IS1O_S1P_SI_ISD_ESG_SG_SG_SG_SG_SG_SG_EEEESS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_SS_EENS1A_IS6_S1I_EEEENS_3mpl6v_itemISV_NS28_7vector0ISS_EELi0EEENS1_18ordered_unique_tagENS1_19null_augment_policyEE6erase_EPNS1_18ordered_index_nodeIS2E_NS2G_IS2E_NS2G_IS2E_NS2G_IS2E_NS1_15index_node_baseIS6_S26_EEEEEEEEEE ()
#2  0x000055bed83f68a7 in _ZN9chainbase13generic_indexIN5boost11multi_index21multi_index_containerIN5steem5chain14comment_objectENS2_10indexed_byINS2_14ordered_uniqueINS2_3tagINS5_5by_idEN4mpl_2naESC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_EENS2_6memberIS6_NS_3oidIS6_EEXadL_ZNS6_2idEEEEESC_EENS8_INS9_INS5_15by_cashout_timeESC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_EENS2_13composite_keyIS6_NSE_IS6_N2fc14time_point_secEXadL_ZNS6_12cashout_timeEEEEESH_NS1_6tuples9null_typeESQ_SQ_SQ_SQ_SQ_SQ_SQ_EESC_EENS8_INS9_INS5_11by_permlinkESC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_EENSL_IS6_NSE_IS6_NS4_8protocol17fixed_string_implINSM_7uint128EEEXadL_ZNS6_6authorEEEEENSE_IS6_NS1_9container12basic_stringIcSt11char_traitsIcENS1_12interprocess9allocatorIcNS14_15segment_managerIcNS14_15rbtree_best_fitINS14_12mutex_familyENS14_10offset_ptrIvlmLm0EEELm0EEENS14_10iset_indexEEEEEEEXadL_ZNS6_8permlinkEEEEESQ_SQ_SQ_SQ_SQ_SQ_SQ_SQ_EENS2_21composite_key_compareISt4lessISY_ENS5_11strcmp_lessESQ_SQ_SQ_SQ_SQ_SQ_SQ_SQ_EEEENS8_INS9_INS5_7by_rootESC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_EENSL_IS6_NSE_IS6_SG_XadL_ZNS6_12root_commentEEEEESH_SQ_SQ_SQ_SQ_SQ_SQ_SQ_SQ_EESC_EENS8_INS9_INS5_9by_parentESC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_EENSL_IS6_NSE_IS6_SY_XadL_ZNS6_13parent_authorEEEEENSE_IS6_S1F_XadL_ZNS6_15parent_permlinkEEEEESH_SQ_SQ_SQ_SQ_SQ_SQ_SQ_EENS1I_IS1K_S1L_S1J_ISG_ESQ_SQ_SQ_SQ_SQ_SQ_SQ_EEEESC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_EENS15_IS6_S1D_EEEEE4undoEv ()
#3  0x000055bed83f6bf9 in _ZN9chainbase12session_implINS_13generic_indexIN5boost11multi_index21multi_index_containerIN5steem5chain14comment_objectENS3_10indexed_byINS3_14ordered_uniqueINS3_3tagINS6_5by_idEN4mpl_2naESD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_EENS3_6memberIS7_NS_3oidIS7_EEXadL_ZNS7_2idEEEEESD_EENS9_INSA_INS6_15by_cashout_timeESD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_EENS3_13composite_keyIS7_NSF_IS7_N2fc14time_point_secEXadL_ZNS7_12cashout_timeEEEEESI_NS2_6tuples9null_typeESR_SR_SR_SR_SR_SR_SR_EESD_EENS9_INSA_INS6_11by_permlinkESD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_EENSM_IS7_NSF_IS7_NS5_8protocol17fixed_string_implINSN_7uint128EEEXadL_ZNS7_6authorEEEEENSF_IS7_NS2_9container12basic_stringIcSt11char_traitsIcENS2_12interprocess9allocatorIcNS15_15segment_managerIcNS15_15rbtree_best_fitINS15_12mutex_familyENS15_10offset_ptrIvlmLm0EEELm0EEENS15_10iset_indexEEEEEEEXadL_ZNS7_8permlinkEEEEESR_SR_SR_SR_SR_SR_SR_SR_EENS3_21composite_key_compareISt4lessISZ_ENS6_11strcmp_lessESR_SR_SR_SR_SR_SR_SR_SR_EEEENS9_INSA_INS6_7by_rootESD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_EENSM_IS7_NSF_IS7_SH_XadL_ZNS7_12root_commentEEEEESI_SR_SR_SR_SR_SR_SR_SR_SR_EESD_EENS9_INSA_INS6_9by_parentESD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_EENSM_IS7_NSF_IS7_SZ_XadL_ZNS7_13parent_authorEEEEENSF_IS7_S1G_XadL_ZNS7_15parent_permlinkEEEEESI_SR_SR_SR_SR_SR_SR_SR_EENS1J_IS1L_S1M_S1K_ISH_ESR_SR_SR_SR_SR_SR_SR_EEEESD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_EENS16_IS7_S1E_EEEEE7sessionEE4undoEv ()
#4  0x000055bed80eac25 in steem::chain::database::clear_pending() ()
#5  0x000055bed812f314 in steem::chain::database::push_block(steem::protocol::signed_block const&, unsigned int) ()
#6  0x000055bed80d1c19 in steem::plugins::chain::detail::write_request_visitor::operator()(steem::protocol::signed_block const*) ()
#7  0x000055bed80d0971 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<steem::plugins::chain::detail::chain_plugin_impl::start_write_processing()::{lambda()#1}> > >::_M_run() ()
#8  0x000055bed8be6e10 in execute_native_thread_routine ()
#9  0x00007fbfdaccf182 in start_thread (arg=<optimized out>) at pthread_create.c:486
#10 0x00007fbfdaa99b1f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

It is not waiting on a lock.

There are 54 instances of API threads waiting on futures in broadcast_transaction and 21 broadcast_transaction_synchronous. We recently added a timeout on broadcast_transaction_synchronous waiting on the futures (which might help the problem). We could add timeouts on the futures in accept_transaction as well. The broadcast_transaction calls should progress as soon as the write process is done pushing the current block.

The node might just be reaching capacity and beginning to slow down. If it continues happening, more stack traces would be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants