cql_tracing_test.TestCqlTracing.tracing_startup_test hangs on shutdown in dtest debug #8079

bhalevy · 2021-02-15T06:32:06Z

Regression started at https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/721/artifact/logs-debug.2/1612888868753_cql_tracing_test.TestCqlTracing.tracing_startup_test/node2.log

INFO  2021-02-09 16:26:10,988 [shard 0] compaction_manager - Asked to stop
INFO  2021-02-09 16:26:10,989 [shard 0] compaction_manager - Stopping 0 ongoing compactions
INFO  2021-02-09 16:26:10,989 [shard 0] init - Signal received; shutting down
INFO  2021-02-09 16:26:10,989 [shard 0] init - Shutting down redis service
INFO  2021-02-09 16:26:10,989 [shard 0] init - Shutting down redis service was successful
INFO  2021-02-09 16:26:10,990 [shard 0] init - Shutting down view builder
INFO  2021-02-09 16:26:10,990 [shard 0] view - Stopping view builder
INFO  2021-02-09 16:26:11,007 [shard 1] compaction_manager - Asked to stop
INFO  2021-02-09 16:26:11,008 [shard 1] compaction_manager - Stopping 0 ongoing compactions
INFO  2021-02-09 16:26:11,028 [shard 0] compaction_manager - Stopped
INFO  2021-02-09 16:26:11,030 [shard 1] view - Stopping view builder
INFO  2021-02-09 16:26:11,087 [shard 1] compaction_manager - Stopped
INFO  2021-02-09 16:26:11,210 [shard 0] storage_service - Node 127.0.12.1 state jump to normal
INFO  2021-02-09 16:26:11,326 [shard 0] init - Shutting down view builder was successful
INFO  2021-02-09 16:26:11,326 [shard 0] init - Shutting down local storage
INFO  2021-02-09 16:26:11,329 [shard 0] storage_service - Drain on shutdown: starts
INFO  2021-02-09 16:26:11,331 [shard 0] storage_service - Stop transport: starts
INFO  2021-02-09 16:26:11,332 [shard 0] storage_service - Shutting down native transport
INFO  2021-02-09 16:26:11,817 [shard 0] cql_server_controller - CQL server stopped
INFO  2021-02-09 16:26:11,832 [shard 0] storage_service - Shutting down native transport was successful
INFO  2021-02-09 16:26:11,832 [shard 0] storage_service - Shutting down rpc server
INFO  2021-02-09 16:26:11,884 [shard 0] thrift_controller - Thrift server stopped
INFO  2021-02-09 16:26:11,936 [shard 0] storage_service - Shutting down rpc server was successful
INFO  2021-02-09 16:26:11,936 [shard 0] storage_service - Stop transport: shutdown rpc and cql server done
INFO  2021-02-09 16:26:11,938 [shard 0] gossip - My status = NORMAL
INFO  2021-02-09 16:26:11,938 [shard 0] gossip - Announcing shutdown
INFO  2021-02-09 16:26:12,027 [shard 0] storage_service - Node 127.0.12.2 state jump to normal
INFO  2021-02-09 16:26:14,386 [shard 0] gossip - Disable and wait for gossip loop started
INFO  2021-02-09 16:26:14,386 [shard 0] gossip - Disable and wait for gossip loop finished
INFO  2021-02-09 16:26:14,400 [shard 0] gossip - Gossip is now stopped
INFO  2021-02-09 16:26:14,401 [shard 0] storage_service - Stop transport: stop_gossiping done
INFO  2021-02-09 16:26:14,462 [shard 0] storage_service - messaging_service stopped
INFO  2021-02-09 16:26:14,462 [shard 0] storage_service - Stop transport: shutdown messaging_service done
INFO  2021-02-09 16:26:14,463 [shard 0] storage_service - stream_manager stopped
INFO  2021-02-09 16:26:14,463 [shard 0] storage_service - Stop transport: shutdown stream_manager done
INFO  2021-02-09 16:26:14,463 [shard 0] storage_service - Stop transport: done
INFO  2021-02-09 16:26:14,463 [shard 0] storage_service - Drain on shutdown: stop_transport done
INFO  2021-02-09 16:26:14,464 [shard 0] tracing - Asked to shut down
INFO  2021-02-09 16:26:14,470 [shard 1] tracing - Asked to shut down
INFO  2021-02-09 16:26:14,475 [shard 0] tracing - Tracing is down
INFO  2021-02-09 16:26:15,465 [shard 1] tracing - Tracing is down
INFO  2021-02-09 16:26:15,467 [shard 0] storage_service - Drain on shutdown: tracing is stopped
INFO  2021-02-09 16:26:15,468 [shard 0] storage_service - Drain on shutdown: system distributed keyspace stopped
INFO  2021-02-09 16:26:15,470 [shard 0] hints_manager - Asked to stop
INFO  2021-02-09 16:26:15,470 [shard 0] hints_manager - Asked to stop
INFO  2021-02-09 16:26:15,470 [shard 0] hints_manager - Stopped
INFO  2021-02-09 16:26:15,470 [shard 0] hints_manager - Stopped
INFO  2021-02-09 16:26:15,472 [shard 1] hints_manager - Asked to stop
INFO  2021-02-09 16:26:15,473 [shard 1] hints_manager - Asked to stop
INFO  2021-02-09 16:26:15,475 [shard 1] hints_manager - Stopped

Scylla version c16e4a0
Last passed at e1261d1
git log --oneline e1261d10f17c2d219f159c01ddf7fd2b122f9d6a..c16e4a04239798c41fa140c14eabb250b5422983
c16e4a0 migration_manager: Propagate schema changes with reads like we do on writes
4082f57 Merge 'Make commitlog disk limit a hard limit.' from Calle Wilund
af2d1fa Update abseil submodule
b9a5aff distributed_loader: drop execute_futures function
a05adb8 database: Remove global storage proxy reference
8490c9f transport: Remove global storage service reference
4498bb0 API: Fix aggregation in column_familiy
4be718e commitlog: Force earlier cycle/flush iff segment reserve is empty
be8c359 commitlog: Make segment allocation wait iff disk usage > max
ab55a1b commitlog: Do partial (memtable) flushing based on threshold
7c84b16 commitlog: Make flush threshold configurable
c3d9581 table: Add a flush RP mark to table, and shortcut if not above

The text was updated successfully, but these errors were encountered:

bhalevy · 2021-02-17T11:27:35Z

Looks like we hit it here: https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/745/artifact/logs-release.2/1613471599651_nodetool_additional_test.TestNodetool.concurrent_restart_test/node1.log

INFO  2021-02-16 10:31:13,590 [shard 0] hints_manager - Asked to stop
INFO  2021-02-16 10:31:13,590 [shard 0] hints_manager - Stopped
INFO  2021-02-16 10:31:13,590 [shard 0] hints_manager - Asked to stop
INFO  2021-02-16 10:31:13,590 [shard 1] hints_manager - Asked to stop
INFO  2021-02-16 10:31:13,590 [shard 1] hints_manager - Stopped
INFO  2021-02-16 10:31:13,590 [shard 1] hints_manager - Asked to stop
INFO  2021-02-16 10:31:23,190 [shard 1] hints_manager - Stopped

(forced) core dump of scylla here: https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/745/artifact/logs-release.2/1613471599651_nodetool_additional_test.TestNodetool.concurrent_restart_test/node1-scylla.28002.1613471598.core.gz

xemul · 2021-02-19T13:35:07Z

One of hints managers didn't stop. The hint-related tasks (obtained with task_historgram -a) are:

         2: 0x4e30b0 vtable for std::_Sp_counted_ptr_inplace<db::hints::directory_initializer::impl, std::allocator<db::hints::directory_initializer::impl>, (__gnu_cxx::_Lock_policy)2> + 16 
         1: 0x4e0250 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::async<db::hints::manager::end_point_hints_manager::sender::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_32>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_32&&)::{lambda()#2}, seastar::future<void>::then_impl_nrvo<seastar::async<db::hints::manager::end_point_hints_manager::sender::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_32>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_32&&)::{lambda()#2}, seastar::future<void> >(db::hints::manager::end_point_hints_manager::sender::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_32&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::async<db::hints::manager::end_point_hints_manager::sender::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_32>(seastar::thread_attributes, auto:1&&)::{lambda()#2}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4e0410 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<seastar::async<db::hints::manager::end_point_hints_manager::sender::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_32>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_32&&)::{lambda()#3}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<seastar::async<db::hints::manager::end_point_hints_manager::sender::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_32>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_32&&)::{lambda()#3}, false> >(seastar::future<void>::finally_body<seastar::async<db::hints::manager::end_point_hints_manager::sender::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_32>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_32&&)::{lambda()#3}, false>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<seastar::async<db::hints::manager::end_point_hints_manager::sender::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_32>(seastar::thread_attributes, auto:1&&)::{lambda()#3}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4e7088 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<seastar::with_semaphore<seastar::named_semaphore_exception_factory, db::hints::resource_manager::stop()::$_5, std::chrono::_V2::steady_clock>(seastar::basic_semaphore<seastar::named_semaphore_exception_factory, std::chrono::_V2::steady_clock>&, unsigned long, db::hints::resource_manager::stop()::$_5&&)::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::named_semaphore_exception_factory, std::chrono::_V2::steady_clock> >(seastar::semaphore_units<seastar::named_semaphore_exception_factory, std::chrono::_V2::steady_clock>)::{lambda()#1}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<seastar::with_semaphore<seastar::named_semaphore_exception_factory, db::hints::resource_manager::stop()::$_5, std::chrono::_V2::steady_clock>(seastar::basic_semaphore<seastar::named_semaphore_exception_factory, std::chrono::_V2::steady_clock>&, unsigned long, db::hints::resource_manager::stop()::$_5&&)::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::named_semaphore_exception_factory, std::chrono::_V2::steady_clock> >(seastar::semaphore_units<seastar::named_semaphore_exception_factory, std::chrono::_V2::steady_clock>)::{lambda()#1}, false> >(db::hints::resource_manager::stop()::$_5&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<seastar::with_semaphore<seastar::named_semaphore_exception_factory, db::hints::resource_manager::stop()::$_5, std::chrono::_V2::steady_clock>(seastar::basic_semaphore<auto:1, auto:3>&, unsigned long, auto:2&&)::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::named_semaphore_exception_factory, std::chrono::_V2::steady_clock> >(auto:1)::{lambda()#1}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4e5a18 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<seastar::async<db::hints::space_watchdog::start()::$_8>(seastar::thread_attributes, db::hints::space_watchdog::start()::$_8&&)::{lambda()#3}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<seastar::async<db::hints::space_watchdog::start()::$_8>(seastar::thread_attributes, db::hints::space_watchdog::start()::$_8&&)::{lambda()#3}, false> >(seastar::future<void>::finally_body<seastar::async<db::hints::space_watchdog::start()::$_8>(seastar::thread_attributes, db::hints::space_watchdog::start()::$_8&&)::{lambda()#3}, false>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<seastar::async<db::hints::space_watchdog::start()::$_8>(seastar::thread_attributes, auto:1&&)::{lambda()#3}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4e5bd0 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::handle_exception_type<db::hints::space_watchdog::start()::$_9>(db::hints::space_watchdog::start()::$_9&&)::{lambda(auto:1&&)#1}, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::handle_exception_type<db::hints::space_watchdog::start()::$_9>(db::hints::space_watchdog::start()::$_9&&)::{lambda(auto:1&&)#1}>(seastar::future<void>::handle_exception_type<db::hints::space_watchdog::start()::$_9>(db::hints::space_watchdog::start()::$_9&&)::{lambda(auto:1&&)#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::handle_exception_type<db::hints::space_watchdog::start()::$_9>(auto:1&&)::{lambda(auto:1&&)#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4e6f70 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, db::hints::resource_manager::stop()::$_5::operator()() const::{lambda()#2}, seastar::future<void>::then_impl_nrvo<db::hints::resource_manager::stop()::$_5::operator()() const::{lambda()#2}, seastar::future<void> >(db::hints::resource_manager::stop()::$_5::operator()() const::{lambda()#2}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, db::hints::resource_manager::stop()::$_5::operator()() const::{lambda()#2}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4e5890 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::async<db::hints::space_watchdog::start()::$_8>(seastar::thread_attributes, db::hints::space_watchdog::start()::$_8&&)::{lambda()#2}, seastar::future<void>::then_impl_nrvo<seastar::async<db::hints::space_watchdog::start()::$_8>(seastar::thread_attributes, db::hints::space_watchdog::start()::$_8&&)::{lambda()#2}, seastar::future<void> >(db::hints::space_watchdog::start()::$_8&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::async<db::hints::space_watchdog::start()::$_8>(seastar::thread_attributes, auto:1&&)::{lambda()#2}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4df6b0 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, db::hints::manager::end_point_hints_manager::sender::flush_maybe()::$_9, seastar::future<void>::then_impl_nrvo<db::hints::manager::end_point_hints_manager::sender::flush_maybe()::$_9, seastar::future<void> >(db::hints::manager::end_point_hints_manager::sender::flush_maybe()::$_9&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, db::hints::manager::end_point_hints_manager::sender::flush_maybe()::$_9&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4e5ef0 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::with_lock<db::hints::space_watchdog::on_timer()::$_3::operator()(std::filesystem::__cxx11::path, seastar::directory_entry) const::{lambda()#1}>(seastar::shared_mutex&, db::hints::space_watchdog::on_timer()::$_3::operator()(std::filesystem::__cxx11::path, seastar::directory_entry) const::{lambda()#1}&&)::{lambda()#1}, seastar::future<void>::then_impl_nrvo<seastar::with_lock<db::hints::space_watchdog::on_timer()::$_3::operator()(std::filesystem::__cxx11::path, seastar::directory_entry) const::{lambda()#1}>(seastar::shared_mutex&, db::hints::space_watchdog::on_timer()::$_3::operator()(std::filesystem::__cxx11::path, seastar::directory_entry) const::{lambda()#1}&&)::{lambda()#1}, seastar::future<void> >(db::hints::space_watchdog::on_timer()::$_3::operator()(std::filesystem::__cxx11::path, seastar::directory_entry) const::{lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::with_lock<db::hints::space_watchdog::on_timer()::$_3::operator()(std::filesystem::__cxx11::path, seastar::directory_entry) const::{lambda()#1}>(seastar::shared_mutex&, auto:1&&)::{lambda()#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4debf0 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<seastar::async<db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5&&)::{lambda()#3}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<seastar::async<db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5&&)::{lambda()#3}, false> >(seastar::future<void>::finally_body<seastar::async<db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5&&)::{lambda()#3}, false>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<seastar::async<db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5>(seastar::thread_attributes, auto:1&&)::{lambda()#3}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4dde30 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::handle_exception<db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5::operator()() const::{lambda(auto:1)#2}>(db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5::operator()() const::{lambda(auto:1)#2}&&)::{lambda(auto:1&&)#1}, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::handle_exception<db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5::operator()() const::{lambda(auto:1)#2}>(db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5::operator()() const::{lambda(auto:1)#2}&&)::{lambda(auto:1&&)#1}>(seastar::future<void>::handle_exception<db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5::operator()() const::{lambda(auto:1)#2}>(db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5::operator()() const::{lambda(auto:1)#2}&&)::{lambda(auto:1&&)#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::handle_exception<db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5::operator()() const::{lambda(auto:1)#2}>(auto:1&&)::{lambda(auto:1&&)#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4e07c0 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<seastar::async<db::hints::manager::end_point_hints_manager::sender::start()::$_33>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::start()::$_33&&)::{lambda()#3}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<seastar::async<db::hints::manager::end_point_hints_manager::sender::start()::$_33>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::start()::$_33&&)::{lambda()#3}, false> >(seastar::future<void>::finally_body<seastar::async<db::hints::manager::end_point_hints_manager::sender::start()::$_33>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::start()::$_33&&)::{lambda()#3}, false>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<seastar::async<db::hints::manager::end_point_hints_manager::sender::start()::$_33>(seastar::thread_attributes, auto:1&&)::{lambda()#3}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4e6090 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<db::hints::with_file_update_mutex<db::hints::space_watchdog::on_timer()::$_3::operator()(std::filesystem::__cxx11::path, seastar::directory_entry) const::{lambda()#1}>(db::hints::manager::end_point_hints_manager&, db::hints::space_watchdog::on_timer()::$_3::operator()(std::filesystem::__cxx11::path, seastar::directory_entry) const::{lambda()#1}&&)::{lambda()#1}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<db::hints::with_file_update_mutex<db::hints::space_watchdog::on_timer()::$_3::operator()(std::filesystem::__cxx11::path, seastar::directory_entry) const::{lambda()#1}>(db::hints::manager::end_point_hints_manager&, db::hints::space_watchdog::on_timer()::$_3::operator()(std::filesystem::__cxx11::path, seastar::directory_entry) const::{lambda()#1}&&)::{lambda()#1}, false> >(seastar::future<void>::finally_body<db::hints::with_file_update_mutex<db::hints::space_watchdog::on_timer()::$_3::operator()(std::filesystem::__cxx11::path, seastar::directory_entry) const::{lambda()#1}>(db::hints::manager::end_point_hints_manager&, db::hints::space_watchdog::on_timer()::$_3::operator()(std::filesystem::__cxx11::path, seastar::directory_entry) const::{lambda()#1}&&)::{lambda()#1}, false>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<db::hints::with_file_update_mutex<db::hints::space_watchdog::on_timer()::$_3::operator()(std::filesystem::__cxx11::path, seastar::directory_entry) const::{lambda()#1}>(db::hints::manager::end_point_hints_manager&, auto:1&&)::{lambda()#1}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4d8f10 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}::operator()() const::{lambda(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr)#1}::operator()(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr) const::{lambda()#1}, true>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}::operator()() const::{lambda(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr)#1}::operator()(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr) const::{lambda()#1}, true> >(seastar::future<void>::finally_body<db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}::operator()() const::{lambda(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr)#1}::operator()(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr) const::{lambda()#1}, true>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}::operator()() const::{lambda(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr)#1}::operator()(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr) const::{lambda()#1}, true>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4df7d8 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::handle_exception<db::hints::manager::end_point_hints_manager::sender::flush_maybe()::$_31>(db::hints::manager::end_point_hints_manager::sender::flush_maybe()::$_31&&)::{lambda(auto:1&&)#1}, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::handle_exception<db::hints::manager::end_point_hints_manager::sender::flush_maybe()::$_31>(db::hints::manager::end_point_hints_manager::sender::flush_maybe()::$_31&&)::{lambda(auto:1&&)#1}>(seastar::future<void>::handle_exception<db::hints::manager::end_point_hints_manager::sender::flush_maybe()::$_31>(db::hints::manager::end_point_hints_manager::sender::flush_maybe()::$_31&&)::{lambda(auto:1&&)#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::handle_exception<db::hints::manager::end_point_hints_manager::sender::flush_maybe()::$_31>(auto:1&&)::{lambda(auto:1&&)#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4d9128 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}::operator()() const::{lambda(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr)#1}::operator()(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr) const::{lambda()#2}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}::operator()() const::{lambda(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr)#1}::operator()(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr) const::{lambda()#2}, false> >(seastar::future<void>::finally_body<db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}::operator()() const::{lambda(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr)#1}::operator()(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr) const::{lambda()#2}, false>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}::operator()() const::{lambda(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr)#1}::operator()(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr) const::{lambda()#2}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4d97d8 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}, seastar::future<void>::then_impl_nrvo<db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}, seastar::future<void> >(db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4d9918 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<seastar::with_lock<db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}>(seastar::shared_mutex&, db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}&&)::{lambda()#1}::operator()()::{lambda()#1}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<seastar::with_lock<db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}>(seastar::shared_mutex&, db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}&&)::{lambda()#1}::operator()()::{lambda()#1}, false> >(seastar::future<void>::finally_body<seastar::with_lock<db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}>(seastar::shared_mutex&, db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}&&)::{lambda()#1}::operator()()::{lambda()#1}, false>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<seastar::with_lock<db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}>(seastar::shared_mutex&, auto:1&&)::{lambda()#1}::operator()()::{lambda()#1}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4dea38 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::async<db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5&&)::{lambda()#2}, seastar::future<void>::then_impl_nrvo<seastar::async<db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5&&)::{lambda()#2}, seastar::future<void> >(db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::async<db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5>(seastar::thread_attributes, auto:1&&)::{lambda()#2}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4dd258 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<db::hints::manager::stop()::$_2, true>::operator()(seastar::future<void>&&)::{lambda(auto:1&&)#1}, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<db::hints::manager::stop()::$_2, true>::operator()(seastar::future<void>&&)::{lambda(auto:1&&)#1}>(seastar::future<void>::finally_body<db::hints::manager::stop()::$_2, true>::operator()(seastar::future<void>&&)::{lambda(auto:1&&)#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<db::hints::manager::stop()::$_2, true>::operator()(seastar::future<void>&&)::{lambda(auto:1&&)#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4e0618 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::async<db::hints::manager::end_point_hints_manager::sender::start()::$_33>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::start()::$_33&&)::{lambda()#2}, seastar::future<void>::then_impl_nrvo<seastar::async<db::hints::manager::end_point_hints_manager::sender::start()::$_33>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::start()::$_33&&)::{lambda()#2}, seastar::future<void> >(db::hints::manager::end_point_hints_manager::sender::start()::$_33&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::async<db::hints::manager::end_point_hints_manager::sender::start()::$_33>(seastar::thread_attributes, auto:1&&)::{lambda()#2}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4dd020 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<db::hints::manager::stop()::$_2::operator()() const::{lambda()#1}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<db::hints::manager::stop()::$_2::operator()() const::{lambda()#1}, false> >(seastar::future<void>::finally_body<db::hints::manager::stop()::$_2::operator()() const::{lambda()#1}, false>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<db::hints::manager::stop()::$_2::operator()() const::{lambda()#1}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4e6e30 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<db::hints::resource_manager::stop()::$_5::operator()() const::{lambda()#1}, true>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<db::hints::resource_manager::stop()::$_5::operator()() const::{lambda()#1}, true> >(seastar::future<void>::finally_body<db::hints::resource_manager::stop()::$_5::operator()() const::{lambda()#1}, true>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<db::hints::resource_manager::stop()::$_5::operator()() const::{lambda()#1}, true>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16

All tasks are here

xemul · 2021-02-19T14:47:05Z

I see end_points_hints_manager::sender::start tasks here, which makes me think that the sender::_stopped future is not resolved, so the sender::stop waits on it (the hints manager stop waits on the sender::stop in turn)

xemul · 2021-02-19T16:09:53Z

Here's the wait sequence that fits the list of tasks:

sender::stop()
sender::_stopped // set in sender::start()
sender::flush_maybe()
end_point_hints_manager::flush_current_hints()
cptr->shutdown()
commitlog::segment_manager::shutdown()
_reserve_replenisher = segment_manager::replenish_reserve()
segment_manager::allocate_segment()
_disk_deletions.get_shared_future()

the _disk_deletions is woken up by segment_manager::delete_file which may never happen in our case. The wait on _disk_deletions future was introduced by be8c359 commitlog: Make segment allocation wait iff disk usage > max from the "what has changed since last pass" list above.

@bhalevy

slivne · 2021-03-02T11:01:29Z

@elcallio can you please have a look

slivne · 2021-03-11T12:10:57Z

from Calle

If this is a hanging hints manager, it should be addressed by added53

lets see if this is fixed

bhalevy · 2021-03-21T16:12:21Z

This wasn't hit since added53 so let's close the issue and consider reopening if hit again.

amoskong · 2021-04-07T12:41:44Z

I saw a similar stop hung in https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-cdc-100gb-4h-test/207/
The scylla-server stop of db-node-4 hung in hints_manager - Stopped, and it occurred after ENOSPC was triggered on target node (db-node-3), the nemesis released the space and tried to recover the db-node-3 by restart the scylla-server on it.

4.5.dev-0.20210330.6a2377a23

This version doesn't contain the fix (added53)

Note: hung for ~15 mins and reached the systemd stop timeout of scylla-server

2021-03-31T03:30:27+00:00  longevity-cdc-100gb-4h-master-db-node-7d48aadb-3 !INFO    | scylla: [shard 11] hints_manager - Stopped
2021-03-31T03:30:28+00:00  longevity-cdc-100gb-4h-master-db-node-7d48aadb-3 !INFO    | scylla: [shard 4] hints_manager - Stopped
2021-03-31T03:30:29+00:00  longevity-cdc-100gb-4h-master-db-node-7d48aadb-3 !INFO    | scylla: [shard 13] hints_manager - Stopped
2021-03-31T03:45:08+00:00  longevity-cdc-100gb-4h-master-db-node-7d48aadb-3 !WARNING | systemd: scylla-server.service stop-sigterm timed out. Killing.

db logs: https://cloudius-jenkins-test.s3.amazonaws.com/7d48aadb-319c-4662-ba81-8d7a2480e53e/20210331_035036/db-cluster-7d48aadb.zip

kbr- · 2022-11-02T13:07:43Z

We're also hitting hints_manager stop hangs in test.py runs.
https://jenkins.scylladb.com/job/scylla-master/job/next/5385/artifact/testlog/x86_64/release/scylla-53.log

I'm reopening this issue.

bhalevy · 2023-05-11T11:39:14Z

@DoronArazii I suggest to remove the master/high label as this issue is hit rarely and is only blocking the shutdown path.
It is not ideal and needs to be fixed, but not really blocking.

bhalevy · 2023-05-11T11:42:18Z

@denesb since you're working on the hints manager for #11870, I suggest you look into this too.

kbr-scylla · 2023-05-23T13:08:13Z

I'm gonna try investigating the hangs as part of the quality effort.

kbr-scylla · 2023-05-24T13:41:10Z

Reproducer:
kbr-scylla@29c4267

This is what happens:

boot two node cluster
create keyspace with RF=2 + table
kill one of two nodes
on the remaining node, send CL=ONE write to the table with USING TIMEOUT 100ms
the write immediately succeeds because of CL=ONE but leaves a background write to the other node
after 100ms (thanks to USING TIMEOUT 100ms) the background write times out and creates a hint
we sleep for 10s (db::hints::manager::hints_flush_period)
hints_manager tries to send a hint to the other node (gossiper didn't yet mark it as DOWN, which takes ~20 seconds using default settings)
this write uses write_request_timeout_in_ms config value for timeout, which is set to a crazy amount of 300000 (i.e. 5 minutes) in our tests configuration (scylla_cluster.py)
because the other node is dead, the write hangs for entire timeout - 5 minutes
we try to shutdown the node, which waits for hints_manager writes to finish: db::hints::manager::end_point_hints_manager:sender::file_send_gate::close() waited on by db::hints::manager::end_point_hints_manager::sender::send_one_file waited on by db::hints::manager::end_point_hints_manager::sender::send_hints_maybe waited on by db::hints::manager::end_point_hints_manager::sender::_stopped waited ony by db::hints::manager::end_point_hints_manager::sender::stop ...
so hang for 5 minutes, causing test timeout (or sometimes not because the hang may resolve right before we would timeout the test, in which case we'd just get a test that hanged for 5 minutes in the middle for no good reason but passed anyway; I reduced the timeout to 1 minute in the reproducer branch)

This is pretty much what happens in the CI runs, except that the write is done internally to system_distributed.cdc_generation_timestamps and it uses 10 seconds timeout, set here (db/system_distributed_keyspace.cc):

static service::query_state& internal_distributed_query_state() {
    using namespace std::chrono_literals;
    const auto t = 10s;
    static timeout_config tc{ t, t, t, t, t, t, t };
    static thread_local service::client_state cs(service::client_state::internal_tag{}, tc);
    static thread_local service::query_state qs(cs, empty_service_permit());
    return qs;
};

unfortunately the hint write uses the crazy 5 minutes timeout from write_request_timeout_in_ms from our test configuration

possible solutions?

reduce the crazy 5 minute timeout (why so much?)
better: somehow cancel the hint writes on shutdown?

kostja · 2023-05-24T13:47:21Z

Cancelling hint writes in shutdown sounds like a good idea to me but how difficult is it ? Can we just abandon them, without cancelling ?

kbr-scylla · 2023-05-24T13:51:18Z

I don't see any other way than abandoning them - cancelling must mean abandoning in some cases since we may request cancel after the message is out, so we just cancel the waiting for response part.

storage_proxy::send_to_endpoint API used by storage_proxy::send_hint_to_endpoint doesn't provide a way to cancel/abort the write. We'd need to extend it. Or do something different, but I don't see what right now.

bhalevy · 2023-05-25T20:36:47Z

Please see eb79e45 and the following patches in https://github.com/scylladb/scylladb/pull/12296/commits

for a useful mechanism for canceling outstanding messages that also have a timeout

The `view_update_write_response_handler` class, which is a subclass of `abstract_write_response_handler`, was created for a single purpose: to make it possible to 'retire' a handler for a view update write, which means we stop waiting for a response to the write, timing out the handler immediately. This was done to solve issue with node shutdown hanging because it was waiting for a view update to finish; view updates were configured with 5 minute timeout. See scylladb#3966, scylladb#4028. Now we're having a similar problem with hint updates causing shutdown to hang in tests (scylladb#8079). `view_update_write_response_handler` implements retiring by adding itself to an intrusive list which we then iterate over to timeout each handler when we shutdown or when gossiper notifies `storage_proxy` that a node is down. To make it possible to reuse this algorithm for other handlers, move the functionality into `abstract_write_response_handler`. We inherit from `bi::list_base_hook` so it introduces small memory overhead to each write handler (2 pointers) which was only present for view update handlers before. But those handlers are already quite large, the overhead is small compared to their size. Not all handlers are added to the retiring list, this is controlled by the `retirable` parameter passed to the constructor. For now we're only retiring view handlers as before. In following commits we'll also retire hint handlers.

Whether a write handler should be retirable is now controlled by a parameter passed to `create_write_response_handler`. We plumb it down from `send_to_endpoint` which is called by hints manager. This will cause hint write handlers to immediately timeout when we shutdown or when a destination node is marked as dead. Fixes scylladb#8079

…r types The `view_update_write_response_handler` class, which is a subclass of `abstract_write_response_handler`, was created for a single purpose: to make it possible to 'retire' a handler for a view update write, which means we stop waiting for a response to the write, timing out the handler immediately. This was done to solve issue with node shutdown hanging because it was waiting for a view update to finish; view updates were configured with 5 minute timeout. See scylladb#3966, scylladb#4028. Now we're having a similar problem with hint updates causing shutdown to hang in tests (scylladb#8079). `view_update_write_response_handler` implements retiring by adding itself to an intrusive list which we then iterate over to timeout each handler when we shutdown or when gossiper notifies `storage_proxy` that a node is down. To make it possible to reuse this algorithm for other handlers, move the functionality into `abstract_write_response_handler`. We inherit from `bi::list_base_hook` so it introduces small memory overhead to each write handler (2 pointers) which was only present for view update handlers before. But those handlers are already quite large, the overhead is small compared to their size. Not all handlers are added to the retiring list, this is controlled by the `retirable` parameter passed to the constructor. For now we're only retiring view handlers as before. In following commits we'll also retire hint handlers.

Whether a write handler should be retirable is now controlled by a parameter passed to `create_write_response_handler`. We plumb it down from `send_to_endpoint` which is called by hints manager. This will cause hint write handlers to immediately timeout when we shutdown or when a destination node is marked as dead. Fixes scylladb#8079

…types The `view_update_write_response_handler` class, which is a subclass of `abstract_write_response_handler`, was created for a single purpose: to make it possible to cancel a handler for a view update write, which means we stop waiting for a response to the write, timing out the handler immediately. This was done to solve issue with node shutdown hanging because it was waiting for a view update to finish; view updates were configured with 5 minute timeout. See scylladb#3966, scylladb#4028. Now we're having a similar problem with hint updates causing shutdown to hang in tests (scylladb#8079). `view_update_write_response_handler` implements cancelling by adding itself to an intrusive list which we then iterate over to timeout each handler when we shutdown or when gossiper notifies `storage_proxy` that a node is down. To make it possible to reuse this algorithm for other handlers, move the functionality into `abstract_write_response_handler`. We inherit from `bi::list_base_hook` so it introduces small memory overhead to each write handler (2 pointers) which was only present for view update handlers before. But those handlers are already quite large, the overhead is small compared to their size. Not all handlers are added to the cancelling list, this is controlled by the `cancellable` parameter passed to the constructor. For now we're only cancelling view handlers as before. In following commits we'll also cancel hint handlers.

…from Kamil Braun The `view_update_write_response_handler` class, which is a subclass of `abstract_write_response_handler`, was created for a single purpose: to make it possible to cancel a handler for a view update write, which means we stop waiting for a response to the write, timing out the handler immediately. This was done to solve issue with node shutdown hanging because it was waiting for a view update to finish; view updates were configured with 5 minute timeout. See #3966, #4028. Now we're having a similar problem with hint updates causing shutdown to hang in tests (#8079). `view_update_write_response_handler` implements cancelling by adding itself to an intrusive list which we then iterate over to timeout each handler when we shutdown or when gossiper notifies `storage_proxy` that a node is down. To make it possible to reuse this algorithm for other handlers, move the functionality into `abstract_write_response_handler`. We inherit from `bi::list_base_hook` so it introduces small memory overhead to each write handler (2 pointers) which was only present for view update handlers before. But those handlers are already quite large, the overhead is small compared to their size. Use this new functionality to also cancel hint write handlers when we shutdown. This fixes #8079. Closes #14047 * github.com:scylladb/scylladb: test: reproducer for hints manager shutdown hang test: pylib: ScyllaCluster: generalize config type for `server_add` test: pylib: scylla_cluster: add explicit timeout for graceful server stop service: storage_proxy: make hint write handlers cancellable service: storage_proxy: rename `view_update_handlers_list` service: storage_proxy: make it possible to cancel all write handler types

…types The `view_update_write_response_handler` class, which is a subclass of `abstract_write_response_handler`, was created for a single purpose: to make it possible to cancel a handler for a view update write, which means we stop waiting for a response to the write, timing out the handler immediately. This was done to solve issue with node shutdown hanging because it was waiting for a view update to finish; view updates were configured with 5 minute timeout. See scylladb#3966, scylladb#4028. Now we're having a similar problem with hint updates causing shutdown to hang in tests (scylladb#8079). `view_update_write_response_handler` implements cancelling by adding itself to an intrusive list which we then iterate over to timeout each handler when we shutdown or when gossiper notifies `storage_proxy` that a node is down. To make it possible to reuse this algorithm for other handlers, move the functionality into `abstract_write_response_handler`. We inherit from `bi::list_base_hook` so it introduces small memory overhead to each write handler (2 pointers) which was only present for view update handlers before. But those handlers are already quite large, the overhead is small compared to their size. Not all handlers are added to the cancelling list, this is controlled by the `cancellable` parameter passed to the constructor. For now we're only cancelling view handlers as before. In following commits we'll also cancel hint handlers.

Whether a write handler should be cancellable is now controlled by a parameter passed to `create_write_response_handler`. We plumb it down from `send_to_endpoint` which is called by hints manager. This will cause hint write handlers to immediately timeout when we shutdown or when a destination node is marked as dead. Fixes scylladb#8079

…types The `view_update_write_response_handler` class, which is a subclass of `abstract_write_response_handler`, was created for a single purpose: to make it possible to cancel a handler for a view update write, which means we stop waiting for a response to the write, timing out the handler immediately. This was done to solve issue with node shutdown hanging because it was waiting for a view update to finish; view updates were configured with 5 minute timeout. See scylladb#3966, scylladb#4028. Now we're having a similar problem with hint updates causing shutdown to hang in tests (scylladb#8079). `view_update_write_response_handler` implements cancelling by adding itself to an intrusive list which we then iterate over to timeout each handler when we shutdown or when gossiper notifies `storage_proxy` that a node is down. To make it possible to reuse this algorithm for other handlers, move the functionality into `abstract_write_response_handler`. We inherit from `bi::list_base_hook` so it introduces small memory overhead to each write handler (2 pointers) which was only present for view update handlers before. But those handlers are already quite large, the overhead is small compared to their size. Not all handlers are added to the cancelling list, this is controlled by the `cancellable` parameter passed to the constructor. For now we're only cancelling view handlers as before. In following commits we'll also cancel hint handlers.

Whether a write handler should be cancellable is now controlled by a parameter passed to `create_write_response_handler`. We plumb it down from `send_to_endpoint` which is called by hints manager. This will cause hint write handlers to immediately timeout when we shutdown or when a destination node is marked as dead. Fixes scylladb#8079

denesb · 2023-12-18T09:10:09Z

Seems like this issue mainly affects tests, so not backporting.

bhalevy assigned xemul Feb 15, 2021

bhalevy added the tests/dtest label Feb 15, 2021

slivne added this to the 4.5 milestone Feb 15, 2021

xemul mentioned this issue Mar 2, 2021

Scylla stuck during systemctl stop scylla-server.service (while running TerminateAndRemoveNodeMonkey) #8191

Closed

slivne assigned elcallio and unassigned xemul Mar 2, 2021

slivne added area/commitlog Issues related to the commit log. type/bug labels Mar 2, 2021

xemul mentioned this issue Mar 12, 2021

lwt_random_test.TestRandomPaxos.test_topology_add_decommission_reboot #8264

Closed

bhalevy closed this as completed Mar 21, 2021

xemul mentioned this issue May 10, 2021

scylla service restart takes 20 minutes after enospc nemesis #8559

Closed

kbr- reopened this Nov 2, 2022

kbr- mentioned this issue Nov 10, 2022

test.py: use internal id to manage servers #11747

Merged

alecco mentioned this issue Nov 10, 2022

test.py: don't stop cluster's site if not started #11939

Closed

DoronArazii modified the milestones: 4.5, 5.2, 5.x Nov 27, 2022

mykaul added the master/high label Jan 4, 2023

mykaul modified the milestones: 5.x, 5.2 Jan 4, 2023

bhalevy assigned denesb and unassigned elcallio May 11, 2023

DoronArazii removed the master/high label May 14, 2023

kbr-scylla assigned kostja and kbr-scylla May 23, 2023

kbr-scylla mentioned this issue May 26, 2023

service: storage_proxy: make hint write handlers cancellable #14047

Merged

scylladb-promoter closed this as completed in 0ef35ce May 30, 2023

scylladb-promoter added the Backport candidate label May 30, 2023

DoronArazii modified the milestones: 5.3, 5.4 May 30, 2023

denesb removed the Backport candidate label Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cql_tracing_test.TestCqlTracing.tracing_startup_test hangs on shutdown in dtest debug #8079

cql_tracing_test.TestCqlTracing.tracing_startup_test hangs on shutdown in dtest debug #8079

bhalevy commented Feb 15, 2021

bhalevy commented Feb 17, 2021

xemul commented Feb 19, 2021

xemul commented Feb 19, 2021

xemul commented Feb 19, 2021 •

edited

slivne commented Mar 2, 2021

slivne commented Mar 11, 2021

bhalevy commented Mar 21, 2021

amoskong commented Apr 7, 2021

kbr- commented Nov 2, 2022

bhalevy commented May 11, 2023 •

edited

bhalevy commented May 11, 2023

kbr-scylla commented May 23, 2023

kbr-scylla commented May 24, 2023

kostja commented May 24, 2023

kbr-scylla commented May 24, 2023

bhalevy commented May 25, 2023 •

edited

denesb commented Dec 18, 2023

cql_tracing_test.TestCqlTracing.tracing_startup_test hangs on shutdown in dtest debug #8079

cql_tracing_test.TestCqlTracing.tracing_startup_test hangs on shutdown in dtest debug #8079

Comments

bhalevy commented Feb 15, 2021

bhalevy commented Feb 17, 2021

xemul commented Feb 19, 2021

xemul commented Feb 19, 2021

xemul commented Feb 19, 2021 • edited

slivne commented Mar 2, 2021

slivne commented Mar 11, 2021

bhalevy commented Mar 21, 2021

amoskong commented Apr 7, 2021

kbr- commented Nov 2, 2022

bhalevy commented May 11, 2023 • edited

bhalevy commented May 11, 2023

kbr-scylla commented May 23, 2023

kbr-scylla commented May 24, 2023

kostja commented May 24, 2023

kbr-scylla commented May 24, 2023

bhalevy commented May 25, 2023 • edited

denesb commented Dec 18, 2023

xemul commented Feb 19, 2021 •

edited

bhalevy commented May 11, 2023 •

edited

bhalevy commented May 25, 2023 •

edited