Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cql_tracing_test.TestCqlTracing.tracing_startup_test hangs on shutdown in dtest debug #8079

Closed
bhalevy opened this issue Feb 15, 2021 · 63 comments · Fixed by #14047
Closed
Assignees
Labels
area/commitlog Issues related to the commit log. P2 High Priority symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework tests/dtest type/bug
Milestone

Comments

@bhalevy
Copy link
Member

bhalevy commented Feb 15, 2021

Regression started at https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/721/artifact/logs-debug.2/1612888868753_cql_tracing_test.TestCqlTracing.tracing_startup_test/node2.log

INFO  2021-02-09 16:26:10,988 [shard 0] compaction_manager - Asked to stop
INFO  2021-02-09 16:26:10,989 [shard 0] compaction_manager - Stopping 0 ongoing compactions
INFO  2021-02-09 16:26:10,989 [shard 0] init - Signal received; shutting down
INFO  2021-02-09 16:26:10,989 [shard 0] init - Shutting down redis service
INFO  2021-02-09 16:26:10,989 [shard 0] init - Shutting down redis service was successful
INFO  2021-02-09 16:26:10,990 [shard 0] init - Shutting down view builder
INFO  2021-02-09 16:26:10,990 [shard 0] view - Stopping view builder
INFO  2021-02-09 16:26:11,007 [shard 1] compaction_manager - Asked to stop
INFO  2021-02-09 16:26:11,008 [shard 1] compaction_manager - Stopping 0 ongoing compactions
INFO  2021-02-09 16:26:11,028 [shard 0] compaction_manager - Stopped
INFO  2021-02-09 16:26:11,030 [shard 1] view - Stopping view builder
INFO  2021-02-09 16:26:11,087 [shard 1] compaction_manager - Stopped
INFO  2021-02-09 16:26:11,210 [shard 0] storage_service - Node 127.0.12.1 state jump to normal
INFO  2021-02-09 16:26:11,326 [shard 0] init - Shutting down view builder was successful
INFO  2021-02-09 16:26:11,326 [shard 0] init - Shutting down local storage
INFO  2021-02-09 16:26:11,329 [shard 0] storage_service - Drain on shutdown: starts
INFO  2021-02-09 16:26:11,331 [shard 0] storage_service - Stop transport: starts
INFO  2021-02-09 16:26:11,332 [shard 0] storage_service - Shutting down native transport
INFO  2021-02-09 16:26:11,817 [shard 0] cql_server_controller - CQL server stopped
INFO  2021-02-09 16:26:11,832 [shard 0] storage_service - Shutting down native transport was successful
INFO  2021-02-09 16:26:11,832 [shard 0] storage_service - Shutting down rpc server
INFO  2021-02-09 16:26:11,884 [shard 0] thrift_controller - Thrift server stopped
INFO  2021-02-09 16:26:11,936 [shard 0] storage_service - Shutting down rpc server was successful
INFO  2021-02-09 16:26:11,936 [shard 0] storage_service - Stop transport: shutdown rpc and cql server done
INFO  2021-02-09 16:26:11,938 [shard 0] gossip - My status = NORMAL
INFO  2021-02-09 16:26:11,938 [shard 0] gossip - Announcing shutdown
INFO  2021-02-09 16:26:12,027 [shard 0] storage_service - Node 127.0.12.2 state jump to normal
INFO  2021-02-09 16:26:14,386 [shard 0] gossip - Disable and wait for gossip loop started
INFO  2021-02-09 16:26:14,386 [shard 0] gossip - Disable and wait for gossip loop finished
INFO  2021-02-09 16:26:14,400 [shard 0] gossip - Gossip is now stopped
INFO  2021-02-09 16:26:14,401 [shard 0] storage_service - Stop transport: stop_gossiping done
INFO  2021-02-09 16:26:14,462 [shard 0] storage_service - messaging_service stopped
INFO  2021-02-09 16:26:14,462 [shard 0] storage_service - Stop transport: shutdown messaging_service done
INFO  2021-02-09 16:26:14,463 [shard 0] storage_service - stream_manager stopped
INFO  2021-02-09 16:26:14,463 [shard 0] storage_service - Stop transport: shutdown stream_manager done
INFO  2021-02-09 16:26:14,463 [shard 0] storage_service - Stop transport: done
INFO  2021-02-09 16:26:14,463 [shard 0] storage_service - Drain on shutdown: stop_transport done
INFO  2021-02-09 16:26:14,464 [shard 0] tracing - Asked to shut down
INFO  2021-02-09 16:26:14,470 [shard 1] tracing - Asked to shut down
INFO  2021-02-09 16:26:14,475 [shard 0] tracing - Tracing is down
INFO  2021-02-09 16:26:15,465 [shard 1] tracing - Tracing is down
INFO  2021-02-09 16:26:15,467 [shard 0] storage_service - Drain on shutdown: tracing is stopped
INFO  2021-02-09 16:26:15,468 [shard 0] storage_service - Drain on shutdown: system distributed keyspace stopped
INFO  2021-02-09 16:26:15,470 [shard 0] hints_manager - Asked to stop
INFO  2021-02-09 16:26:15,470 [shard 0] hints_manager - Asked to stop
INFO  2021-02-09 16:26:15,470 [shard 0] hints_manager - Stopped
INFO  2021-02-09 16:26:15,470 [shard 0] hints_manager - Stopped
INFO  2021-02-09 16:26:15,472 [shard 1] hints_manager - Asked to stop
INFO  2021-02-09 16:26:15,473 [shard 1] hints_manager - Asked to stop
INFO  2021-02-09 16:26:15,475 [shard 1] hints_manager - Stopped

Scylla version c16e4a0
Last passed at e1261d1
git log --oneline e1261d10f17c2d219f159c01ddf7fd2b122f9d6a..c16e4a04239798c41fa140c14eabb250b5422983
c16e4a0 migration_manager: Propagate schema changes with reads like we do on writes
4082f57 Merge 'Make commitlog disk limit a hard limit.' from Calle Wilund
af2d1fa Update abseil submodule
b9a5aff distributed_loader: drop execute_futures function
a05adb8 database: Remove global storage proxy reference
8490c9f transport: Remove global storage service reference
4498bb0 API: Fix aggregation in column_familiy
4be718e commitlog: Force earlier cycle/flush iff segment reserve is empty
be8c359 commitlog: Make segment allocation wait iff disk usage > max
ab55a1b commitlog: Do partial (memtable) flushing based on threshold
7c84b16 commitlog: Make flush threshold configurable
c3d9581 table: Add a flush RP mark to table, and shortcut if not above

@slivne slivne added this to the 4.5 milestone Feb 15, 2021
@bhalevy
Copy link
Member Author

bhalevy commented Feb 17, 2021

Looks like we hit it here: https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/745/artifact/logs-release.2/1613471599651_nodetool_additional_test.TestNodetool.concurrent_restart_test/node1.log

INFO  2021-02-16 10:31:13,590 [shard 0] hints_manager - Asked to stop
INFO  2021-02-16 10:31:13,590 [shard 0] hints_manager - Stopped
INFO  2021-02-16 10:31:13,590 [shard 0] hints_manager - Asked to stop
INFO  2021-02-16 10:31:13,590 [shard 1] hints_manager - Asked to stop
INFO  2021-02-16 10:31:13,590 [shard 1] hints_manager - Stopped
INFO  2021-02-16 10:31:13,590 [shard 1] hints_manager - Asked to stop
INFO  2021-02-16 10:31:23,190 [shard 1] hints_manager - Stopped

(forced) core dump of scylla here: https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/745/artifact/logs-release.2/1613471599651_nodetool_additional_test.TestNodetool.concurrent_restart_test/node1-scylla.28002.1613471598.core.gz

@xemul
Copy link
Contributor

xemul commented Feb 19, 2021

One of hints managers didn't stop. The hint-related tasks (obtained with task_historgram -a) are:

         2: 0x4e30b0 vtable for std::_Sp_counted_ptr_inplace<db::hints::directory_initializer::impl, std::allocator<db::hints::directory_initializer::impl>, (__gnu_cxx::_Lock_policy)2> + 16 
         1: 0x4e0250 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::async<db::hints::manager::end_point_hints_manager::sender::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_32>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_32&&)::{lambda()#2}, seastar::future<void>::then_impl_nrvo<seastar::async<db::hints::manager::end_point_hints_manager::sender::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_32>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_32&&)::{lambda()#2}, seastar::future<void> >(db::hints::manager::end_point_hints_manager::sender::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_32&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::async<db::hints::manager::end_point_hints_manager::sender::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_32>(seastar::thread_attributes, auto:1&&)::{lambda()#2}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4e0410 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<seastar::async<db::hints::manager::end_point_hints_manager::sender::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_32>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_32&&)::{lambda()#3}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<seastar::async<db::hints::manager::end_point_hints_manager::sender::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_32>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_32&&)::{lambda()#3}, false> >(seastar::future<void>::finally_body<seastar::async<db::hints::manager::end_point_hints_manager::sender::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_32>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_32&&)::{lambda()#3}, false>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<seastar::async<db::hints::manager::end_point_hints_manager::sender::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_32>(seastar::thread_attributes, auto:1&&)::{lambda()#3}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4e7088 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<seastar::with_semaphore<seastar::named_semaphore_exception_factory, db::hints::resource_manager::stop()::$_5, std::chrono::_V2::steady_clock>(seastar::basic_semaphore<seastar::named_semaphore_exception_factory, std::chrono::_V2::steady_clock>&, unsigned long, db::hints::resource_manager::stop()::$_5&&)::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::named_semaphore_exception_factory, std::chrono::_V2::steady_clock> >(seastar::semaphore_units<seastar::named_semaphore_exception_factory, std::chrono::_V2::steady_clock>)::{lambda()#1}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<seastar::with_semaphore<seastar::named_semaphore_exception_factory, db::hints::resource_manager::stop()::$_5, std::chrono::_V2::steady_clock>(seastar::basic_semaphore<seastar::named_semaphore_exception_factory, std::chrono::_V2::steady_clock>&, unsigned long, db::hints::resource_manager::stop()::$_5&&)::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::named_semaphore_exception_factory, std::chrono::_V2::steady_clock> >(seastar::semaphore_units<seastar::named_semaphore_exception_factory, std::chrono::_V2::steady_clock>)::{lambda()#1}, false> >(db::hints::resource_manager::stop()::$_5&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<seastar::with_semaphore<seastar::named_semaphore_exception_factory, db::hints::resource_manager::stop()::$_5, std::chrono::_V2::steady_clock>(seastar::basic_semaphore<auto:1, auto:3>&, unsigned long, auto:2&&)::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::named_semaphore_exception_factory, std::chrono::_V2::steady_clock> >(auto:1)::{lambda()#1}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4e5a18 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<seastar::async<db::hints::space_watchdog::start()::$_8>(seastar::thread_attributes, db::hints::space_watchdog::start()::$_8&&)::{lambda()#3}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<seastar::async<db::hints::space_watchdog::start()::$_8>(seastar::thread_attributes, db::hints::space_watchdog::start()::$_8&&)::{lambda()#3}, false> >(seastar::future<void>::finally_body<seastar::async<db::hints::space_watchdog::start()::$_8>(seastar::thread_attributes, db::hints::space_watchdog::start()::$_8&&)::{lambda()#3}, false>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<seastar::async<db::hints::space_watchdog::start()::$_8>(seastar::thread_attributes, auto:1&&)::{lambda()#3}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4e5bd0 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::handle_exception_type<db::hints::space_watchdog::start()::$_9>(db::hints::space_watchdog::start()::$_9&&)::{lambda(auto:1&&)#1}, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::handle_exception_type<db::hints::space_watchdog::start()::$_9>(db::hints::space_watchdog::start()::$_9&&)::{lambda(auto:1&&)#1}>(seastar::future<void>::handle_exception_type<db::hints::space_watchdog::start()::$_9>(db::hints::space_watchdog::start()::$_9&&)::{lambda(auto:1&&)#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::handle_exception_type<db::hints::space_watchdog::start()::$_9>(auto:1&&)::{lambda(auto:1&&)#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4e6f70 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, db::hints::resource_manager::stop()::$_5::operator()() const::{lambda()#2}, seastar::future<void>::then_impl_nrvo<db::hints::resource_manager::stop()::$_5::operator()() const::{lambda()#2}, seastar::future<void> >(db::hints::resource_manager::stop()::$_5::operator()() const::{lambda()#2}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, db::hints::resource_manager::stop()::$_5::operator()() const::{lambda()#2}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4e5890 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::async<db::hints::space_watchdog::start()::$_8>(seastar::thread_attributes, db::hints::space_watchdog::start()::$_8&&)::{lambda()#2}, seastar::future<void>::then_impl_nrvo<seastar::async<db::hints::space_watchdog::start()::$_8>(seastar::thread_attributes, db::hints::space_watchdog::start()::$_8&&)::{lambda()#2}, seastar::future<void> >(db::hints::space_watchdog::start()::$_8&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::async<db::hints::space_watchdog::start()::$_8>(seastar::thread_attributes, auto:1&&)::{lambda()#2}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4df6b0 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, db::hints::manager::end_point_hints_manager::sender::flush_maybe()::$_9, seastar::future<void>::then_impl_nrvo<db::hints::manager::end_point_hints_manager::sender::flush_maybe()::$_9, seastar::future<void> >(db::hints::manager::end_point_hints_manager::sender::flush_maybe()::$_9&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, db::hints::manager::end_point_hints_manager::sender::flush_maybe()::$_9&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4e5ef0 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::with_lock<db::hints::space_watchdog::on_timer()::$_3::operator()(std::filesystem::__cxx11::path, seastar::directory_entry) const::{lambda()#1}>(seastar::shared_mutex&, db::hints::space_watchdog::on_timer()::$_3::operator()(std::filesystem::__cxx11::path, seastar::directory_entry) const::{lambda()#1}&&)::{lambda()#1}, seastar::future<void>::then_impl_nrvo<seastar::with_lock<db::hints::space_watchdog::on_timer()::$_3::operator()(std::filesystem::__cxx11::path, seastar::directory_entry) const::{lambda()#1}>(seastar::shared_mutex&, db::hints::space_watchdog::on_timer()::$_3::operator()(std::filesystem::__cxx11::path, seastar::directory_entry) const::{lambda()#1}&&)::{lambda()#1}, seastar::future<void> >(db::hints::space_watchdog::on_timer()::$_3::operator()(std::filesystem::__cxx11::path, seastar::directory_entry) const::{lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::with_lock<db::hints::space_watchdog::on_timer()::$_3::operator()(std::filesystem::__cxx11::path, seastar::directory_entry) const::{lambda()#1}>(seastar::shared_mutex&, auto:1&&)::{lambda()#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4debf0 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<seastar::async<db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5&&)::{lambda()#3}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<seastar::async<db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5&&)::{lambda()#3}, false> >(seastar::future<void>::finally_body<seastar::async<db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5&&)::{lambda()#3}, false>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<seastar::async<db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5>(seastar::thread_attributes, auto:1&&)::{lambda()#3}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4dde30 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::handle_exception<db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5::operator()() const::{lambda(auto:1)#2}>(db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5::operator()() const::{lambda(auto:1)#2}&&)::{lambda(auto:1&&)#1}, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::handle_exception<db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5::operator()() const::{lambda(auto:1)#2}>(db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5::operator()() const::{lambda(auto:1)#2}&&)::{lambda(auto:1&&)#1}>(seastar::future<void>::handle_exception<db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5::operator()() const::{lambda(auto:1)#2}>(db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5::operator()() const::{lambda(auto:1)#2}&&)::{lambda(auto:1&&)#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::handle_exception<db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5::operator()() const::{lambda(auto:1)#2}>(auto:1&&)::{lambda(auto:1&&)#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4e07c0 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<seastar::async<db::hints::manager::end_point_hints_manager::sender::start()::$_33>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::start()::$_33&&)::{lambda()#3}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<seastar::async<db::hints::manager::end_point_hints_manager::sender::start()::$_33>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::start()::$_33&&)::{lambda()#3}, false> >(seastar::future<void>::finally_body<seastar::async<db::hints::manager::end_point_hints_manager::sender::start()::$_33>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::start()::$_33&&)::{lambda()#3}, false>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<seastar::async<db::hints::manager::end_point_hints_manager::sender::start()::$_33>(seastar::thread_attributes, auto:1&&)::{lambda()#3}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4e6090 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<db::hints::with_file_update_mutex<db::hints::space_watchdog::on_timer()::$_3::operator()(std::filesystem::__cxx11::path, seastar::directory_entry) const::{lambda()#1}>(db::hints::manager::end_point_hints_manager&, db::hints::space_watchdog::on_timer()::$_3::operator()(std::filesystem::__cxx11::path, seastar::directory_entry) const::{lambda()#1}&&)::{lambda()#1}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<db::hints::with_file_update_mutex<db::hints::space_watchdog::on_timer()::$_3::operator()(std::filesystem::__cxx11::path, seastar::directory_entry) const::{lambda()#1}>(db::hints::manager::end_point_hints_manager&, db::hints::space_watchdog::on_timer()::$_3::operator()(std::filesystem::__cxx11::path, seastar::directory_entry) const::{lambda()#1}&&)::{lambda()#1}, false> >(seastar::future<void>::finally_body<db::hints::with_file_update_mutex<db::hints::space_watchdog::on_timer()::$_3::operator()(std::filesystem::__cxx11::path, seastar::directory_entry) const::{lambda()#1}>(db::hints::manager::end_point_hints_manager&, db::hints::space_watchdog::on_timer()::$_3::operator()(std::filesystem::__cxx11::path, seastar::directory_entry) const::{lambda()#1}&&)::{lambda()#1}, false>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<db::hints::with_file_update_mutex<db::hints::space_watchdog::on_timer()::$_3::operator()(std::filesystem::__cxx11::path, seastar::directory_entry) const::{lambda()#1}>(db::hints::manager::end_point_hints_manager&, auto:1&&)::{lambda()#1}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4d8f10 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}::operator()() const::{lambda(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr)#1}::operator()(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr) const::{lambda()#1}, true>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}::operator()() const::{lambda(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr)#1}::operator()(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr) const::{lambda()#1}, true> >(seastar::future<void>::finally_body<db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}::operator()() const::{lambda(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr)#1}::operator()(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr) const::{lambda()#1}, true>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}::operator()() const::{lambda(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr)#1}::operator()(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr) const::{lambda()#1}, true>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4df7d8 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::handle_exception<db::hints::manager::end_point_hints_manager::sender::flush_maybe()::$_31>(db::hints::manager::end_point_hints_manager::sender::flush_maybe()::$_31&&)::{lambda(auto:1&&)#1}, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::handle_exception<db::hints::manager::end_point_hints_manager::sender::flush_maybe()::$_31>(db::hints::manager::end_point_hints_manager::sender::flush_maybe()::$_31&&)::{lambda(auto:1&&)#1}>(seastar::future<void>::handle_exception<db::hints::manager::end_point_hints_manager::sender::flush_maybe()::$_31>(db::hints::manager::end_point_hints_manager::sender::flush_maybe()::$_31&&)::{lambda(auto:1&&)#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::handle_exception<db::hints::manager::end_point_hints_manager::sender::flush_maybe()::$_31>(auto:1&&)::{lambda(auto:1&&)#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4d9128 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}::operator()() const::{lambda(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr)#1}::operator()(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr) const::{lambda()#2}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}::operator()() const::{lambda(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr)#1}::operator()(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr) const::{lambda()#2}, false> >(seastar::future<void>::finally_body<db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}::operator()() const::{lambda(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr)#1}::operator()(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr) const::{lambda()#2}, false>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}::operator()() const::{lambda(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr)#1}::operator()(utils::loading_shared_values<gms::inet_address, db::commitlog, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, utils::do_nothing_loading_shared_values_stats, 16ul>::entry_ptr) const::{lambda()#2}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4d97d8 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}, seastar::future<void>::then_impl_nrvo<db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}, seastar::future<void> >(db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4d9918 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<seastar::with_lock<db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}>(seastar::shared_mutex&, db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}&&)::{lambda()#1}::operator()()::{lambda()#1}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<seastar::with_lock<db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}>(seastar::shared_mutex&, db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}&&)::{lambda()#1}::operator()()::{lambda()#1}, false> >(seastar::future<void>::finally_body<seastar::with_lock<db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}>(seastar::shared_mutex&, db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}&&)::{lambda()#1}::operator()()::{lambda()#1}, false>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<seastar::with_lock<db::hints::manager::end_point_hints_manager::flush_current_hints()::$_8::operator()() const::{lambda()#1}>(seastar::shared_mutex&, auto:1&&)::{lambda()#1}::operator()()::{lambda()#1}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4dea38 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::async<db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5&&)::{lambda()#2}, seastar::future<void>::then_impl_nrvo<seastar::async<db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5&&)::{lambda()#2}, seastar::future<void> >(db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::async<db::hints::manager::end_point_hints_manager::stop(seastar::bool_class<db::hints::manager::drain_tag>)::$_5>(seastar::thread_attributes, auto:1&&)::{lambda()#2}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4dd258 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<db::hints::manager::stop()::$_2, true>::operator()(seastar::future<void>&&)::{lambda(auto:1&&)#1}, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<db::hints::manager::stop()::$_2, true>::operator()(seastar::future<void>&&)::{lambda(auto:1&&)#1}>(seastar::future<void>::finally_body<db::hints::manager::stop()::$_2, true>::operator()(seastar::future<void>&&)::{lambda(auto:1&&)#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<db::hints::manager::stop()::$_2, true>::operator()(seastar::future<void>&&)::{lambda(auto:1&&)#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4e0618 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::async<db::hints::manager::end_point_hints_manager::sender::start()::$_33>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::start()::$_33&&)::{lambda()#2}, seastar::future<void>::then_impl_nrvo<seastar::async<db::hints::manager::end_point_hints_manager::sender::start()::$_33>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::start()::$_33&&)::{lambda()#2}, seastar::future<void> >(db::hints::manager::end_point_hints_manager::sender::start()::$_33&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::async<db::hints::manager::end_point_hints_manager::sender::start()::$_33>(seastar::thread_attributes, auto:1&&)::{lambda()#2}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4dd020 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<db::hints::manager::stop()::$_2::operator()() const::{lambda()#1}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<db::hints::manager::stop()::$_2::operator()() const::{lambda()#1}, false> >(seastar::future<void>::finally_body<db::hints::manager::stop()::$_2::operator()() const::{lambda()#1}, false>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<db::hints::manager::stop()::$_2::operator()() const::{lambda()#1}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 
         1: 0x4e6e30 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<db::hints::resource_manager::stop()::$_5::operator()() const::{lambda()#1}, true>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<db::hints::resource_manager::stop()::$_5::operator()() const::{lambda()#1}, true> >(seastar::future<void>::finally_body<db::hints::resource_manager::stop()::$_5::operator()() const::{lambda()#1}, true>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<db::hints::resource_manager::stop()::$_5::operator()() const::{lambda()#1}, true>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 

All tasks are here

@xemul
Copy link
Contributor

xemul commented Feb 19, 2021

I see end_points_hints_manager::sender::start tasks here, which makes me think that the sender::_stopped future is not resolved, so the sender::stop waits on it (the hints manager stop waits on the sender::stop in turn)

@xemul
Copy link
Contributor

xemul commented Feb 19, 2021

Here's the wait sequence that fits the list of tasks:

sender::stop()
sender::_stopped // set in sender::start()
sender::flush_maybe()
end_point_hints_manager::flush_current_hints()
cptr->shutdown()
commitlog::segment_manager::shutdown()
_reserve_replenisher = segment_manager::replenish_reserve()
segment_manager::allocate_segment()
_disk_deletions.get_shared_future()

the _disk_deletions is woken up by segment_manager::delete_file which may never happen in our case. The wait on _disk_deletions future was introduced by be8c359 commitlog: Make segment allocation wait iff disk usage > max from the "what has changed since last pass" list above.

@bhalevy

@slivne
Copy link
Contributor

slivne commented Mar 2, 2021

@elcallio can you please have a look

@slivne
Copy link
Contributor

slivne commented Mar 11, 2021

from Calle

If this is a hanging hints manager, it should be addressed by added53

  • lets see if this is fixed

@bhalevy
Copy link
Member Author

bhalevy commented Mar 21, 2021

This wasn't hit since added53 so let's close the issue and consider reopening if hit again.

@bhalevy bhalevy closed this as completed Mar 21, 2021
@amoskong
Copy link
Contributor

amoskong commented Apr 7, 2021

I saw a similar stop hung in https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-cdc-100gb-4h-test/207/
The scylla-server stop of db-node-4 hung in hints_manager - Stopped, and it occurred after ENOSPC was triggered on target node (db-node-3), the nemesis released the space and tried to recover the db-node-3 by restart the scylla-server on it.

  • 4.5.dev-0.20210330.6a2377a23

This version doesn't contain the fix (added53)

Note: hung for ~15 mins and reached the systemd stop timeout of scylla-server

2021-03-31T03:30:27+00:00  longevity-cdc-100gb-4h-master-db-node-7d48aadb-3 !INFO    | scylla: [shard 11] hints_manager - Stopped
2021-03-31T03:30:28+00:00  longevity-cdc-100gb-4h-master-db-node-7d48aadb-3 !INFO    | scylla: [shard 4] hints_manager - Stopped
2021-03-31T03:30:29+00:00  longevity-cdc-100gb-4h-master-db-node-7d48aadb-3 !INFO    | scylla: [shard 13] hints_manager - Stopped
2021-03-31T03:45:08+00:00  longevity-cdc-100gb-4h-master-db-node-7d48aadb-3 !WARNING | systemd: scylla-server.service stop-sigterm timed out. Killing.

db logs: https://cloudius-jenkins-test.s3.amazonaws.com/7d48aadb-319c-4662-ba81-8d7a2480e53e/20210331_035036/db-cluster-7d48aadb.zip

@kbr-
Copy link
Contributor

kbr- commented Nov 2, 2022

We're also hitting hints_manager stop hangs in test.py runs.
https://jenkins.scylladb.com/job/scylla-master/job/next/5385/artifact/testlog/x86_64/release/scylla-53.log

I'm reopening this issue.

@bhalevy
Copy link
Member Author

bhalevy commented May 11, 2023

@DoronArazii I suggest to remove the master/high label as this issue is hit rarely and is only blocking the shutdown path.
It is not ideal and needs to be fixed, but not really blocking.

@bhalevy
Copy link
Member Author

bhalevy commented May 11, 2023

@denesb since you're working on the hints manager for #11870, I suggest you look into this too.

@kbr-scylla
Copy link
Contributor

I'm gonna try investigating the hangs as part of the quality effort.

@kbr-scylla
Copy link
Contributor

Reproducer:
kbr-scylla@29c4267

This is what happens:

  • boot two node cluster
  • create keyspace with RF=2 + table
  • kill one of two nodes
  • on the remaining node, send CL=ONE write to the table with USING TIMEOUT 100ms
  • the write immediately succeeds because of CL=ONE but leaves a background write to the other node
  • after 100ms (thanks to USING TIMEOUT 100ms) the background write times out and creates a hint
  • we sleep for 10s (db::hints::manager::hints_flush_period)
  • hints_manager tries to send a hint to the other node (gossiper didn't yet mark it as DOWN, which takes ~20 seconds using default settings)
  • this write uses write_request_timeout_in_ms config value for timeout, which is set to a crazy amount of 300000 (i.e. 5 minutes) in our tests configuration (scylla_cluster.py)
  • because the other node is dead, the write hangs for entire timeout - 5 minutes
  • we try to shutdown the node, which waits for hints_manager writes to finish: db::hints::manager::end_point_hints_manager:sender::file_send_gate::close() waited on by db::hints::manager::end_point_hints_manager::sender::send_one_file waited on by db::hints::manager::end_point_hints_manager::sender::send_hints_maybe waited on by db::hints::manager::end_point_hints_manager::sender::_stopped waited ony by db::hints::manager::end_point_hints_manager::sender::stop ...
  • so hang for 5 minutes, causing test timeout (or sometimes not because the hang may resolve right before we would timeout the test, in which case we'd just get a test that hanged for 5 minutes in the middle for no good reason but passed anyway; I reduced the timeout to 1 minute in the reproducer branch)

This is pretty much what happens in the CI runs, except that the write is done internally to system_distributed.cdc_generation_timestamps and it uses 10 seconds timeout, set here (db/system_distributed_keyspace.cc):

static service::query_state& internal_distributed_query_state() {
    using namespace std::chrono_literals;
    const auto t = 10s;
    static timeout_config tc{ t, t, t, t, t, t, t };
    static thread_local service::client_state cs(service::client_state::internal_tag{}, tc);
    static thread_local service::query_state qs(cs, empty_service_permit());
    return qs;
};

unfortunately the hint write uses the crazy 5 minutes timeout from write_request_timeout_in_ms from our test configuration

possible solutions?

  • reduce the crazy 5 minute timeout (why so much?)
  • better: somehow cancel the hint writes on shutdown?

@kostja
Copy link
Contributor

kostja commented May 24, 2023

Cancelling hint writes in shutdown sounds like a good idea to me but how difficult is it ? Can we just abandon them, without cancelling ?

@kbr-scylla
Copy link
Contributor

I don't see any other way than abandoning them - cancelling must mean abandoning in some cases since we may request cancel after the message is out, so we just cancel the waiting for response part.

storage_proxy::send_to_endpoint API used by storage_proxy::send_hint_to_endpoint doesn't provide a way to cancel/abort the write. We'd need to extend it. Or do something different, but I don't see what right now.

@bhalevy
Copy link
Member Author

bhalevy commented May 25, 2023

Please see eb79e45 and the following patches in https://github.com/scylladb/scylladb/pull/12296/commits

for a useful mechanism for canceling outstanding messages that also have a timeout

kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue May 26, 2023
The `view_update_write_response_handler` class, which is a subclass of
`abstract_write_response_handler`, was created for a single purpose: to
make it possible to 'retire' a handler for a view update write, which
means we stop waiting for a response to the write, timing out the
handler immediately. This was done to solve issue with node shutdown
hanging because it was waiting for a view update to finish; view updates
were configured with 5 minute timeout. See scylladb#3966, scylladb#4028.

Now we're having a similar problem with hint updates causing shutdown to
hang in tests (scylladb#8079).

`view_update_write_response_handler` implements retiring by adding
itself to an intrusive list which we then iterate over to timeout each
handler when we shutdown or when gossiper notifies `storage_proxy` that
a node is down.

To make it possible to reuse this algorithm for other handlers, move the
functionality into `abstract_write_response_handler`. We inherit from
`bi::list_base_hook` so it introduces small memory overhead to each
write handler (2 pointers) which was only present for view update
handlers before. But those handlers are already quite large, the
overhead is small compared to their size.

Not all handlers are added to the retiring list, this is controlled by
the `retirable` parameter passed to the constructor. For now we're only
retiring view handlers as before. In following commits we'll also retire
hint handlers.
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue May 26, 2023
Whether a write handler should be retirable is now controlled by a
parameter passed to `create_write_response_handler`. We plumb it down
from `send_to_endpoint` which is called by hints manager.

This will cause hint write handlers to immediately timeout when we
shutdown or when a destination node is marked as dead.

Fixes scylladb#8079
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue May 26, 2023
…r types

The `view_update_write_response_handler` class, which is a subclass of
`abstract_write_response_handler`, was created for a single purpose: to
make it possible to 'retire' a handler for a view update write, which
means we stop waiting for a response to the write, timing out the
handler immediately. This was done to solve issue with node shutdown
hanging because it was waiting for a view update to finish; view updates
were configured with 5 minute timeout. See scylladb#3966, scylladb#4028.

Now we're having a similar problem with hint updates causing shutdown to
hang in tests (scylladb#8079).

`view_update_write_response_handler` implements retiring by adding
itself to an intrusive list which we then iterate over to timeout each
handler when we shutdown or when gossiper notifies `storage_proxy` that
a node is down.

To make it possible to reuse this algorithm for other handlers, move the
functionality into `abstract_write_response_handler`. We inherit from
`bi::list_base_hook` so it introduces small memory overhead to each
write handler (2 pointers) which was only present for view update
handlers before. But those handlers are already quite large, the
overhead is small compared to their size.

Not all handlers are added to the retiring list, this is controlled by
the `retirable` parameter passed to the constructor. For now we're only
retiring view handlers as before. In following commits we'll also retire
hint handlers.
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue May 26, 2023
Whether a write handler should be retirable is now controlled by a
parameter passed to `create_write_response_handler`. We plumb it down
from `send_to_endpoint` which is called by hints manager.

This will cause hint write handlers to immediately timeout when we
shutdown or when a destination node is marked as dead.

Fixes scylladb#8079
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue May 29, 2023
…types

The `view_update_write_response_handler` class, which is a subclass of
`abstract_write_response_handler`, was created for a single purpose: to
make it possible to cancel a handler for a view update write, which
means we stop waiting for a response to the write, timing out the
handler immediately. This was done to solve issue with node shutdown
hanging because it was waiting for a view update to finish; view updates
were configured with 5 minute timeout. See scylladb#3966, scylladb#4028.

Now we're having a similar problem with hint updates causing shutdown to
hang in tests (scylladb#8079).

`view_update_write_response_handler` implements cancelling by adding
itself to an intrusive list which we then iterate over to timeout each
handler when we shutdown or when gossiper notifies `storage_proxy` that
a node is down.

To make it possible to reuse this algorithm for other handlers, move the
functionality into `abstract_write_response_handler`. We inherit from
`bi::list_base_hook` so it introduces small memory overhead to each
write handler (2 pointers) which was only present for view update
handlers before. But those handlers are already quite large, the
overhead is small compared to their size.

Not all handlers are added to the cancelling list, this is controlled by
the `cancellable` parameter passed to the constructor. For now we're
only cancelling view handlers as before. In following commits we'll also
cancel hint handlers.
avikivity added a commit that referenced this issue May 29, 2023
…from Kamil Braun

The `view_update_write_response_handler` class, which is a subclass of
`abstract_write_response_handler`, was created for a single purpose:
to make it possible to cancel a handler for a view update write,
which means we stop waiting for a response to the write, timing out
the handler immediately. This was done to solve issue with node
shutdown hanging because it was waiting for a view update to finish;
view updates were configured with 5 minute timeout. See #3966, #4028.

Now we're having a similar problem with hint updates causing shutdown
to hang in tests (#8079).

`view_update_write_response_handler` implements cancelling by adding
itself to an intrusive list which we then iterate over to timeout each
handler when we shutdown or when gossiper notifies `storage_proxy`
that a node is down.

To make it possible to reuse this algorithm for other handlers, move
the functionality into `abstract_write_response_handler`. We inherit
from `bi::list_base_hook` so it introduces small memory overhead to
each write handler (2 pointers) which was only present for view update
handlers before. But those handlers are already quite large, the
overhead is small compared to their size.

Use this new functionality to also cancel hint write handlers when we
shutdown. This fixes #8079.

Closes #14047

* github.com:scylladb/scylladb:
  test: reproducer for hints manager shutdown hang
  test: pylib: ScyllaCluster: generalize config type for `server_add`
  test: pylib: scylla_cluster: add explicit timeout for graceful server stop
  service: storage_proxy: make hint write handlers cancellable
  service: storage_proxy: rename `view_update_handlers_list`
  service: storage_proxy: make it possible to cancel all write handler types
@DoronArazii DoronArazii modified the milestones: 5.3, 5.4 May 30, 2023
margdoc pushed a commit to margdoc/scylla that referenced this issue Jun 6, 2023
…types

The `view_update_write_response_handler` class, which is a subclass of
`abstract_write_response_handler`, was created for a single purpose: to
make it possible to cancel a handler for a view update write, which
means we stop waiting for a response to the write, timing out the
handler immediately. This was done to solve issue with node shutdown
hanging because it was waiting for a view update to finish; view updates
were configured with 5 minute timeout. See scylladb#3966, scylladb#4028.

Now we're having a similar problem with hint updates causing shutdown to
hang in tests (scylladb#8079).

`view_update_write_response_handler` implements cancelling by adding
itself to an intrusive list which we then iterate over to timeout each
handler when we shutdown or when gossiper notifies `storage_proxy` that
a node is down.

To make it possible to reuse this algorithm for other handlers, move the
functionality into `abstract_write_response_handler`. We inherit from
`bi::list_base_hook` so it introduces small memory overhead to each
write handler (2 pointers) which was only present for view update
handlers before. But those handlers are already quite large, the
overhead is small compared to their size.

Not all handlers are added to the cancelling list, this is controlled by
the `cancellable` parameter passed to the constructor. For now we're
only cancelling view handlers as before. In following commits we'll also
cancel hint handlers.
margdoc pushed a commit to margdoc/scylla that referenced this issue Jun 6, 2023
Whether a write handler should be cancellable is now controlled by a
parameter passed to `create_write_response_handler`. We plumb it down
from `send_to_endpoint` which is called by hints manager.

This will cause hint write handlers to immediately timeout when we
shutdown or when a destination node is marked as dead.

Fixes scylladb#8079
elcallio pushed a commit to elcallio/scylla that referenced this issue Jun 7, 2023
…types

The `view_update_write_response_handler` class, which is a subclass of
`abstract_write_response_handler`, was created for a single purpose: to
make it possible to cancel a handler for a view update write, which
means we stop waiting for a response to the write, timing out the
handler immediately. This was done to solve issue with node shutdown
hanging because it was waiting for a view update to finish; view updates
were configured with 5 minute timeout. See scylladb#3966, scylladb#4028.

Now we're having a similar problem with hint updates causing shutdown to
hang in tests (scylladb#8079).

`view_update_write_response_handler` implements cancelling by adding
itself to an intrusive list which we then iterate over to timeout each
handler when we shutdown or when gossiper notifies `storage_proxy` that
a node is down.

To make it possible to reuse this algorithm for other handlers, move the
functionality into `abstract_write_response_handler`. We inherit from
`bi::list_base_hook` so it introduces small memory overhead to each
write handler (2 pointers) which was only present for view update
handlers before. But those handlers are already quite large, the
overhead is small compared to their size.

Not all handlers are added to the cancelling list, this is controlled by
the `cancellable` parameter passed to the constructor. For now we're
only cancelling view handlers as before. In following commits we'll also
cancel hint handlers.
elcallio pushed a commit to elcallio/scylla that referenced this issue Jun 7, 2023
Whether a write handler should be cancellable is now controlled by a
parameter passed to `create_write_response_handler`. We plumb it down
from `send_to_endpoint` which is called by hints manager.

This will cause hint write handlers to immediately timeout when we
shutdown or when a destination node is marked as dead.

Fixes scylladb#8079
@denesb
Copy link
Contributor

denesb commented Dec 18, 2023

Seems like this issue mainly affects tests, so not backporting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/commitlog Issues related to the commit log. P2 High Priority symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework tests/dtest type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.