Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

~timer crashed during shutdown in simple_add_new_node_while_schema_changes_test: Assertion `!hook.is_linked()' failed. #5999

Closed
bhalevy opened this issue Mar 10, 2020 · 12 comments
Assignees

Comments

@bhalevy
Copy link
Member

bhalevy commented Mar 10, 2020

Seen in https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/417/artifact/logs-release.2/1583810078409_update_cluster_layout_tests.TestUpdateClusterLayout.simple_add_new_node_while_schema_changes_test/node4.log

INFO  2020-03-10 03:04:29,395 [shard 0] storage_service - JOINING: Starting to bootstrap...
INFO  2020-03-10 03:04:29,397 [shard 0] repair - bootstrap_with_repair: started with keyspaces={ks, system_auth, system_traces, system_distributed}
INFO  2020-03-10 03:04:29,402 [shard 0] repair - bootstrap_with_repair: started with keyspace=ks, nr_ranges=256
...
INFO  2020-03-10 03:04:29,687 [shard 0] repair - repair id 1 on shard 0 failed: std::_Nested_exception<no_such_column_family> (Can't find a column family cf in keyspace ks): std::out_of_range ()
INFO  2020-03-10 03:04:29,687 [shard 0] repair - repair id 1 failed: std::runtime_error ({shard 0: std::_Nested_exception<no_such_column_family> (Can't find a column family cf in keyspace ks): std::out_of_range (), shard 1: std::_Nested_exception<no_such_column_family> (Can't find a column family cf in keyspace ks): std::out_of_range ()})
INFO  2020-03-10 03:04:29,687 [shard 0] repair - repair id 1 to sync data for keyspace=ks, status=failed: std::runtime_error ({shard 0: std::_Nested_exception<no_such_column_family> (Can't find a column family cf in keyspace ks): std::out_of_range (), shard 1: std::_Nested_exception<no_such_column_family> (Can't find a column family cf in keyspace ks): std::out_of_range ()})
INFO  2020-03-10 03:04:29,688 [shard 0] init - Shutting down gossiping
INFO  2020-03-10 03:04:29,688 [shard 0] gossip - My status = BOOT
WARN  2020-03-10 03:04:29,688 [shard 0] gossip - No local state or state is in silent shutdown, not announcing shutdown
INFO  2020-03-10 03:04:29,688 [shard 0] gossip - Disable and wait for gossip loop started
INFO  2020-03-10 03:04:29,688 [shard 0] gossip - Disable and wait for gossip loop finished
INFO  2020-03-10 03:04:29,688 [shard 0] gossip - Gossip is now stopped
INFO  2020-03-10 03:04:29,688 [shard 0] init - Shutting down gossiping was successful
INFO  2020-03-10 03:04:29,688 [shard 0] init - Shutting down storage service notifications
INFO  2020-03-10 03:04:29,688 [shard 0] init - Shutting down storage service notifications was successful
INFO  2020-03-10 03:04:29,688 [shard 0] init - Shutting down repair service
INFO  2020-03-10 03:04:29,688 [shard 0] init - Shutting down repair service was successful
INFO  2020-03-10 03:04:29,688 [shard 0] init - Shutting down cdc
INFO  2020-03-10 03:04:29,689 [shard 0] init - Shutting down cdc was successful
INFO  2020-03-10 03:04:29,689 [shard 0] init - Shutting down migration manager
INFO  2020-03-10 03:04:29,689 [shard 0] migration_manager - stopping migration service
INFO  2020-03-10 03:04:29,689 [shard 1] migration_manager - stopping migration service
INFO  2020-03-10 03:04:29,689 [shard 0] init - Shutting down migration manager was successful
INFO  2020-03-10 03:04:29,689 [shard 0] init - Shutting down database
INFO  2020-03-10 03:04:29,689 [shard 0] compaction_manager - Asked to stop
INFO  2020-03-10 03:04:29,689 [shard 0] compaction_manager - Stopped
INFO  2020-03-10 03:04:29,689 [shard 1] compaction_manager - Asked to stop
INFO  2020-03-10 03:04:29,689 [shard 1] compaction_manager - Stopped
WARN  2020-03-10 03:04:29,748 [shard 0] commitlog - Segment CommitLog-1-1525730617.log is dirty and is left on disk.
INFO  2020-03-10 03:04:29,749 [shard 0] init - Shutting down database: waiting for background jobs...
INFO  2020-03-10 03:04:29,749 [shard 0] init - Shutting down database was successful
INFO  2020-03-10 03:04:29,754 [shard 0] init - Shutting down migration manager notifier
INFO  2020-03-10 03:04:29,757 [shard 0] init - Shutting down migration manager notifier was successful
INFO  2020-03-10 03:04:29,757 [shard 0] init - Shutting down token metadata
scylla: /usr/include/boost/intrusive/detail/generic_hook.hpp:48: void boost::intrusive::detail::destructor_impl(Hook&, boost::intrusive::detail::link_dispatch<boost::intrusive::safe_link>) [with Hook = boost::intrusive::generic_hook<boost::intrusive::CircularListAlgorithms, boost::intrusive::list_node_traits<void*>, boost::intrusive::member_tag, boost::intrusive::safe_link, boost::intrusive::NoBaseHookId>]: Assertion `!hook.is_linked()' failed.
Aborting on shard 0.
Backtrace:
  0x0000000002e80e12
  0x0000000002e2447b
  0x0000000002e24775
  0x0000000002e24810
  0x00007f2dca668b1f
  /jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/d895f5e1311118cc93e8c84384914a0a9bde2267/scylla//opt/scylladb/libreloc/libc.so.6+0x000000000003c624
  /jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/d895f5e1311118cc93e8c84384914a0a9bde2267/scylla//opt/scylladb/libreloc/libc.so.6+0x00000000000258d8
  /jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/d895f5e1311118cc93e8c84384914a0a9bde2267/scylla//opt/scylladb/libreloc/libc.so.6+0x00000000000257a8
  /jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/d895f5e1311118cc93e8c84384914a0a9bde2267/scylla//opt/scylladb/libreloc/libc.so.6+0x0000000000034a65
  0x0000000002e8055a
  0x0000000000d0da2f
  0x0000000002e209f0
  0x0000000002e20bff
  0x0000000002e5b1ad
  0x0000000002dbe46f
  0x0000000002dbf59e
  0x0000000000c9c1d4
  /jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/d895f5e1311118cc93e8c84384914a0a9bde2267/scylla//opt/scylladb/libreloc/libc.so.6+0x00000000000271a2
  0x0000000000bb602d

Decoded bracktrace:

$ addr2line -Cfpie libexec/scylla 
  0x0000000002e8055a
  0x0000000000d0da2f
  0x0000000002e209f0
  0x0000000002e20bff
  0x0000000002e5b1ad
  0x0000000002dbe46f
  0x0000000002dbf59e
  0x0000000000c9c1d4
void boost::intrusive::detail::destructor_impl<boost::intrusive::generic_hook<(boost::intrusive::algo_types)0, boost::intrusive::list_node_traits<void*>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, (boost::intrusive::base_hook_type)0> >(boost::intrusive::generic_hook<(boost::intrusive::algo_types)0, boost::intrusive::list_node_traits<void*>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, (boost::intrusive::base_hook_type)0>&, boost::intrusive::detail::link_dispatch<(boost::intrusive::link_mode_type)1>) at /usr/include/boost/intrusive/detail/generic_hook.hpp:48
 (inlined by) boost::intrusive::generic_hook<(boost::intrusive::algo_types)0, boost::intrusive::list_node_traits<void*>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, (boost::intrusive::base_hook_type)0>::~generic_hook() at /usr/include/boost/intrusive/detail/generic_hook.hpp:193
 (inlined by) boost::intrusive::list_member_hook<void, void, void>::~list_member_hook() at /usr/include/boost/intrusive/list_hook.hpp:207
 (inlined by) seastar::timer<seastar::lowres_clock>::~timer() at ./build/release/seastar/./seastar/src/core/reactor.cc:589
seastar::expiring_fifo<seastar::promise<>, seastar::promise_expiry<>, seastar::lowres_clock>::entry::~entry() at ././seastar/include/seastar/core/expiring_fifo.hh:61
 (inlined by) std::default_delete<seastar::expiring_fifo<seastar::promise<>, seastar::promise_expiry<>, seastar::lowres_clock>::entry>::operator()(seastar::expiring_fifo<seastar::promise<>, seastar::promise_expiry<>, seastar::lowres_clock>::entry*) const at /usr/include/c++/9/bits/unique_ptr.h:81
 (inlined by) std::default_delete<seastar::expiring_fifo<seastar::promise<>, seastar::promise_expiry<>, seastar::lowres_clock>::entry>::operator()(seastar::expiring_fifo<seastar::promise<>, seastar::promise_expiry<>, seastar::lowres_clock>::entry*) const at /usr/include/c++/9/bits/unique_ptr.h:75
 (inlined by) std::unique_ptr<seastar::expiring_fifo<seastar::promise<>, seastar::promise_expiry<>, seastar::lowres_clock>::entry, std::default_delete<seastar::expiring_fifo<seastar::promise<>, seastar::promise_expiry<>, seastar::lowres_clock>::entry> >::reset(seastar::expiring_fifo<seastar::promise<>, seastar::promise_expiry<>, seastar::lowres_clock>::entry*) at /usr/include/c++/9/bits/unique_ptr.h:394
 (inlined by) seastar::expiring_fifo<seastar::promise<>, seastar::promise_expiry<>, seastar::lowres_clock>::pop_front() at ././seastar/include/seastar/core/expiring_fifo.hh:200
 (inlined by) seastar::shared_future<>::shared_state::resolve(seastar::future<>&&) at ././seastar/include/seastar/core/shared_future.hh:141
 (inlined by) seastar::shared_future<>::shared_state::get_future(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(seastar::future<>&&)#1}::operator()(seastar::future<>&&) at ././seastar/include/seastar/core/shared_future.hh:153
 (inlined by) seastar::future<> seastar::do_void_futurize_helper<void>::apply<seastar::shared_future<>::shared_state::get_future(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(seastar::future<>&&)#1}, seastar::future<> >(seastar::shared_future<>::shared_state::get_future(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(seastar::future<>&&)#1}&&, seastar::future<>&&) at ././seastar/include/seastar/core/future.hh:1594
 (inlined by) seastar::future<> seastar::futurize<void>::apply<seastar::shared_future<>::shared_state::get_future(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(seastar::future<>&&)#1}, seastar::future<> >(seastar::shared_future<>::shared_state::get_future(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(seastar::future<>&&)#1}&&, seastar::future<>&&) at ././seastar/include/seastar/core/future.hh:1643
 (inlined by) seastar::future<>::then_wrapped_common<false, void, seastar::shared_future<>::shared_state::get_future(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(seastar::future<>&&)#1}>(seastar::shared_future<>::shared_state::get_future(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(seastar::future<>&&)#1}&&)::{lambda()#1}::operator()() const::{lambda(seastar::future_state<>&&)#1}::operator()(seastar::future_state) at ././seastar/include/seastar/core/future.hh:1270
 (inlined by) seastar::continuation<seastar::future<>::then_wrapped_common<false, void, seastar::shared_future<>::shared_state::get_future(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(seastar::future<>&&)#1}>(seastar::shared_future<>::shared_state::get_future(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(seastar::future<>&&)#1}&&)::{lambda()#1}::operator()() const::{lambda(seastar::future_state<>&&)#1}>::run_and_dispose() at ././seastar/include/seastar/core/future.hh:511
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2091
seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2506
seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2489
 (inlined by) seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2666
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:199
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:115
main at ./main.cc:482

Core dump here: https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/417/artifact/logs-release.2/1583810078409_update_cluster_layout_tests.TestUpdateClusterLayout.simple_add_new_node_while_schema_changes_test/node4-scylla.45929.1583809469.core.gz
Reloc package here: https://jenkins.scylladb.com/view/master/job/scylla-master/job/build/501/artifact/scylla/build/release/scylla-package.tar.gz
And on s3: http://downloads.scylladb.com/relocatable/unstable/master/2020-03-09T21:59:53Z/scylla-package.tar.gz

@xemul
Copy link
Contributor

xemul commented Mar 10, 2020

Destroying a timer not unlisted from reactor lists.

@xemul
Copy link
Contributor

xemul commented Mar 10, 2020

Aborted start. The join_token_ring -> bootstrap resulted in exception after set_mode(mode::JOINING, "Starting to bootstrap...", true);, then the unroll began.

From logs:

storage_service - JOINING: Starting to bootstrap...
... <repair/compaction stuff>
init - Shutting down gossiping

@xemul
Copy link
Contributor

xemul commented Mar 10, 2020

The timer is the one one expiring_fifo. The latter is used by reader_concurrency_semaphore and logalloc::region_group from scylla and shared_future from seastar

@xemul
Copy link
Contributor

xemul commented Mar 10, 2020

Crash is in shared_future::resolve -> expiring_fifo::pop_front -> unique_ptr::reset -> ... -> timer::~delete

@xemul
Copy link
Contributor

xemul commented Mar 11, 2020

#7 seastar::timerseastar::lowres_clock::~timer (this=0x6000000c83a8, __in_chrg=) at ./seastar/src/core/reactor.cc:589
(gdb) p (seastar::timerseastar::lowres_clock*)0x6000000c83a8
$1 = (seastar::timerseastar::lowres_clock *) 0x6000000c83a8
(gdb) p $1->_expired
$2 = false
(gdb) p $1->_queued
$3 = false
(gdb) p $1->_armed
$4 = false

@xemul
Copy link
Contributor

xemul commented Mar 11, 2020

#17 seastar::future<>::then_wrapped_common<false, void, seastar::shared_future<>::shared_state::get_future(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(seastar::future<>&&)#1}>(seastar::shared_future<>::shared_state::get_future(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(seastar::future<>&&)#1}&&)::{lambda()#1}::operator()() const::{lambda(seastar::future_state<>&&)#1}::operator()(seastar::future_state) (state=..., this=0x6000019719b8) at ./seastar/include/seastar/core/future.hh:1270
(gdb) p (seastar::future<>*)0x6000019719b8
$13 = (seastar::future<> *) 0x6000019719b8
(gdb) p $13->_promise
$15 = (seastar::internal::promise_base *) 0x0

@xemul
Copy link
Contributor

xemul commented Mar 11, 2020

The entry being deleted is the container of the timer:

#7  seastar::timer<seastar::lowres_clock>::~timer (this=0x6000000c83a8, __in_chrg=<optimized out>) at ./seastar/src/core/reactor.cc:589

timer is 0x6000000c83a8. Its offset in expiring_fifo::entry is

(gdb) print (int)(&(((seastar::expiring_fifo<seastar::promise<>, seastar::promise_expiry<>, seastar::lowres_clock>::entry *)0)->tr))
$26 = 40

so the entry is

(gdb) p (seastar::expiring_fifo<seastar::promise<>, seastar::promise_expiry<>, seastar::lowres_clock>::entry*)(0x6000000c83a8-40)
(gdb) p $27->payload 
$30 = {<std::_Optional_base<seastar::promise<>, false, false>> = {<std::_Optional_base_impl<seastar::promise<>, std::_Optional_base<seastar::promise<>, false, false> >> = {<No data fields>}, _M_payload = {<std::_Optional_payload<seastar::promise<>, true, false, false>> = {<std::_Optional_payload_base<seastar::promise<> >> = {_M_payload = {_M_empty = {<No data fields>}, _M_value = {<seastar::internal::promise_base_with_type<>> = {<seastar::internal::promise_base> = {_future = 0x0, _state = 0x0, _task = 0x0}, static copy_noexcept = <error reading variable: Missing ELF symbol "seastar::internal::promise_base_with_type<>::copy_noexcept".>}, _local_state = {<seastar::future_state_base> = {_u = {st = seastar::future_state_base::state::future, ex = {_M_exception_object = 0x1}}}, <seastar::internal::uninitialized_wrapper<std::tuple<> >> = {<seastar::internal::uninitialized_wrapper_base<std::tuple<>, true>> = {<std::tuple<>> = {<No data fields>}, <No data fields>}, <No data fields>}, static copy_noexcept = <error reading variable: Missing ELF symbol "seastar::future_state<>::copy_noexcept".>, static has_trivial_move_and_destroy = true}}}, _M_engaged = true}, <No data fields>}, <No data fields>}}, <std::_Enable_copy_move<false, false, true, true, std::optional<seastar::promise<> > >> = {<No data fields>}, <No data fields>}

@xemul
Copy link
Contributor

xemul commented Mar 11, 2020

Timer's list is

{next_ = 0x1, prev_ = 0x0}

this is why it thinks it's is_linked().

@xemul
Copy link
Contributor

xemul commented Mar 11, 2020

mm notifier is already stopped, but

(gdb) scylla tasks
(task*) 0x6000013d11c0  vtable for seastar::continuation<seastar::future<>::then_impl<seastar::async<service::migration_notifier::drop_keyspace(seastar::basic_sstring<char, unsigned int, 15u, true> const&)::{lambda()#1}>(seastar::thread_attributes, std::decay&&, (std::decay<service::migration_notifier::drop_keyspace(seastar::basic_sstring<char, unsigned int, 15u, true> const&)::{lambda()#1}>::type&&)...)::{lambda(seastar::async<{lambda()#1}>(seastar::futurize<std::result_of<std::decay<auto:1>::type ()>::type>::type, seastar::thread_attributes, std::decay<auto:1>::type&&)::work&)#1}::operator()(seastar::async<{lambda()#1}>(seastar::futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, seastar::thread_attributes, std::decay<{lambda()#1}>::type&&)::work)::{lambda()#2}, seastar::future<> >(seastar::thread_attributes)::{lambda()#1}::operator()() const::{lambda(seastar::future_state<>&&)#1}> + 16 

@xemul
Copy link
Contributor

xemul commented Mar 11, 2020

@bhalevy , if my drilling is correct, I tend to think this is dup of #5903 -- the migration manager's fiber is not stopped. The patch for it is on the mailing list.

@xemul
Copy link
Contributor

xemul commented Mar 11, 2020

[PATCH v2] migration_manager: Run background schema merge in gate

tgrabiec pushed a commit that referenced this issue Mar 16, 2020
The call for merge_schema_from in some cases is run in the
background and thus is not aborted/waited on shutdown. This
may result in use-after-free one of which is

merge_schema_from
 -> read_schema_for_keyspace
     -> db::system_keyspace::query
         -> storage_proxy::query
             -> query_partition_key_range_concurrent

in the latter function the proxy._token_metadata is accessed,
while the respective object can be already free (unlike the
storage_proxy itself that's still leaked on shutdown).

Related bug: #5903, #5999 (cannot reproduce though)
Tests: unit(dev), manual start-stop
       dtest(consistency.TestConsistency, dev)
       dtest(schema_management, dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Reviewed-by: Pekka Enberg <penberg@scylladb.com>
Message-Id: <20200316150348.31118-1-xemul@scylladb.com>
@bhalevy
Copy link
Member Author

bhalevy commented Mar 17, 2020

14de126 is on master.
Closing this issue, will reopen if reproduced.

@bhalevy bhalevy closed this as completed Mar 17, 2020
avikivity added a commit to avikivity/scylladb that referenced this issue Dec 16, 2021
…erge

The definitions_update() verb captures a shared_ptr to storage_proxy
to keep it alive while the background task executes.

This was introduced in (2016!):

commit 1429213
Author: Pekka Enberg <penberg@scylladb.com>
Date:   Mon Mar 14 17:57:08 2016 +0200

    main: Defer migration manager RPC verb registration after commitlog replay

    Defer registering migration manager RPC verbs after commitlog has has
    been replayed so that our own schema is fully loaded before other other
    nodes start querying it or sending schema updates.
    Message-Id: <1457971028-7325-1-git-send-email-penberg@scylladb.com>

when moving this code from storage_proxy.cc.

Later, better protection with a gate was added:

commit 14de126
Author: Pavel Emelyanov <xemul@scylladb.com>
Date:   Mon Mar 16 18:03:48 2020 +0300

    migration_manager: Run background schema merge in gate

    The call for merge_schema_from in some cases is run in the
    background and thus is not aborted/waited on shutdown. This
    may result in use-after-free one of which is

    merge_schema_from
     -> read_schema_for_keyspace
         -> db::system_keyspace::query
             -> storage_proxy::query
                 -> query_partition_key_range_concurrent

    in the latter function the proxy._token_metadata is accessed,
    while the respective object can be already free (unlike the
    storage_proxy itself that's still leaked on shutdown).

    Related bug: scylladb#5903, scylladb#5999 (cannot reproduce though)
    Tests: unit(dev), manual start-stop
           dtest(consistency.TestConsistency, dev)
           dtest(schema_management, dev)

    Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
    Reviewed-by: Pekka Enberg <penberg@scylladb.com>
    Message-Id: <20200316150348.31118-1-xemul@scylladb.com>

Since now the task execution is protected by the gate and
therefore migration_manager lifetime (which is contained within
that of storage_proxy, as it is constructed afterwards), capturing
the shared_ptr is not needed, and we therefore remove it, as
it uses the deprecated global storage_proxy accessors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants