Multiple node core dump during decommission operation of other node (conversion to host ID related) #16668

fruch · 2024-01-07T09:48:05Z

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

While decommissioning node-9, multiple nodes (node-3, node-5), had coredumps with the following error
and failing the whole load since 2 nodes were down, and also failing the decommission:

2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3      !ERR | scylla[6166]:  [shard  0: gms] token_metadata - endpoint for host_id 5fa31aad-4354-48ff-a3ae-ccdafa5f92ff is not found, at: 0x611fd1e 0x6120330 0x6120618 0x5bdffa7 0x3ef709a 0x4120f29 0x3f16cd1 0x13a30fa 0x5c1cf9f 0x5c1e287 0x5c1d5e8 0x5bafbc7 0x5baed7c 0x1336d79 0x13387d0 0x13353fc /opt/scylladb/libreloc/libc.so.6+0x27b89 /opt/scylladb/libreloc/libc.so.6+0x27c4a 0x1332ca4
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<seastar::with_semaphore<seastar::semaphore_default_exception_factory, gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(gms::inet_address&) const::{lambda()#1}, std::chrono::_V2::steady_clock>(seastar::basic_semaphore<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>&, unsigned long, gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(gms::inet_address&) const::{lambda()#1}&&)::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>)::{lambda()#1}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<seastar::with_semaphore<seastar::semaphore_default_exception_factory, gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(gms::inet_address&) const::{lambda()#1}, std::chrono::_V2::steady_clock>(seastar::basic_semaphore<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>&, unsigned long, gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(gms::inet_address&) const::{lambda()#1}&&)::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>)::{lambda()#1}, false> >(gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(gms::inet_address&) const::{lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<seastar::with_semaphore<seastar::semaphore_default_exception_factory, gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(auto:1&&) const::{lambda()#1}, std::chrono::_V2::steady_clock>(seastar::basic_semaphore<auto:1, auto:3>&, unsigned long, auto:2&&)::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(auto:1)::{lambda()#1}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>
   --------
   seastar::coroutine::parallel_for_each<gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0>
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]: Aborting on shard 0.
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]: Backtrace:
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x5c0b3e8
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x5c41c91
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   /opt/scylladb/libreloc/libc.so.6+0x3dbaf
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   /opt/scylladb/libreloc/libc.so.6+0x8e883
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   /opt/scylladb/libreloc/libc.so.6+0x3dafd
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   /opt/scylladb/libreloc/libc.so.6+0x2687e
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x5be0027
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x3ef709a
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x4120f29
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x3f16cd1
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x13a30fa
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x5c1cf9f
2024-01-06T06:57:57.862+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x5c1e287
2024-01-06T06:57:57.862+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x5c1d5e8
2024-01-06T06:57:57.862+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x5bafbc7
2024-01-06T06:57:57.862+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x5baed7c
2024-01-06T06:57:57.862+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x1336d79
2024-01-06T06:57:57.862+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x13387d0
2024-01-06T06:57:57.862+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x13353fc
2024-01-06T06:57:57.862+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   /opt/scylladb/libreloc/libc.so.6+0x27b89
2024-01-06T06:57:57.862+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   /opt/scylladb/libreloc/libc.so.6+0x27c4a
2024-01-06T06:57:57.862+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x1332ca4

2024-01-06 07:16:27.907 <2024-01-06 06:57:57.000>: (CoreDumpEvent Severity.ERROR) period_type=one-time event_id=2425c925-c2c6-48f3-b447-0be1717d2876 node=Node longevity-tls-50gb-3d-master-db-node-5329f695-3 [3.249.178.182 | 10.4.9.246] (seed: True)
corefile_url=https://storage.cloud.google.com/upload.scylladb.com/core.scylla.112.5991a193a26741db95420f046b1e1093.6166.1704524277000000/core.scylla.112.5991a193a26741db95420f046b1e1093.6166.1704524277000000.gz
backtrace=           PID: 6166 (scylla)
UID: 112 (scylla)
GID: 118 (scylla)
Signal: 6 (ABRT)
Timestamp: Sat 2024-01-06 06:57:57 UTC (1min 48s ago)
Command Line: /usr/bin/scylla --blocked-reactor-notify-ms 25 --abort-on-lsa-bad-alloc 1 --abort-on-seastar-bad-alloc --abort-on-internal-error 1 --abort-on-ebadf 1 --enable-sstable-key-validation 1 --log-to-syslog 1 --log-to-stdout 0 --default-log-level info --network-stack posix --io-properties-file=/etc/scylla.d/io_properties.yaml --cpuset 1-7,9-15 --lock-memory=1
Executable: /opt/scylladb/libexec/scylla
Control Group: /scylla.slice/scylla-server.slice/scylla-server.service
Unit: scylla-server.service
Slice: scylla-server.slice
Boot ID: 5991a193a26741db95420f046b1e1093
Machine ID: 06b5d0f9b5a84c899486186dc641d921
Hostname: longevity-tls-50gb-3d-master-db-node-5329f695-3
Storage: /var/lib/systemd/coredump/core.scylla.112.5991a193a26741db95420f046b1e1093.6166.1704524277000000 (present)
Disk Size: 114.5G
Message: Process 6166 (scylla) of user 112 dumped core.
...
Found module scylla with build-id: f21e4548b69223a75d01fd3bb9d4c9c2b1b71a6d
Stack trace of thread 6166:
#0  0x00007f012daea884 __pthread_kill_implementation (libc.so.6 + 0x8e884)
#1  0x00007f012da99afe raise (libc.so.6 + 0x3dafe)
#2  0x00007f012da8287f abort (libc.so.6 + 0x2687f)
#3  0x0000000005be0028 _ZN7seastar17on_internal_errorERNS_6loggerESt17basic_string_viewIcSt11char_traitsIcEE (scylla + 0x59e0028)
#4  0x0000000003ef709b _ZNK7locator14token_metadata24get_endpoint_for_host_idEN5utils11tagged_uuidINS_11host_id_tagEEE (scylla + 0x3cf709b)
#5  0x0000000004120f2a _ZN7seastar20noncopyable_functionIFSt8optionalIN7locator16endpoint_dc_rackEEN5utils11tagged_uuidINS2_11host_id_tagEEEEE17direct_vtable_forIZN7service15storage_service27update_topology_change_infoENS_13lw_shared_ptrINS2_14token_metadataEEENS_13basic_sstringIcjLj15ELb1EEEE3$_0E4callEPKSA_S8_ (scylla + 0x3f20f2a)
#6  0x0000000003f16cd2 _ZN7locator19token_metadata_impl27update_topology_change_infoERN7seastar20noncopyable_functionIFSt8optionalINS_16endpoint_dc_rackEEN5utils11tagged_uuidINS_11host_id_tagEEEEEE.resume (scylla + 0x3d16cd2)
#7  0x00000000013a30fb _ZN7seastar8internal21coroutine_traits_baseIvE12promise_type15run_and_disposeEv (scylla + 0x11a30fb)
#8  0x0000000005c1cfa0 _ZN7seastar7reactor14run_some_tasksEv (scylla + 0x5a1cfa0)
#9  0x0000000005c1e288 _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e288)
#10 0x0000000005c1d5e9 _ZN7seastar7reactor3runEv (scylla + 0x5a1d5e9)
#11 0x0000000005bafbc8 _ZN7seastar12app_template14run_deprecatedEiPPcOSt8functionIFvvEE (scylla + 0x59afbc8)
#12 0x0000000005baed7d _ZN7seastar12app_template3runEiPPcOSt8functionIFNS_6futureIiEEvEE (scylla + 0x59aed7d)
#13 0x0000000001336d7a _ZL11scylla_mainiPPc (scylla + 0x1136d7a)
#14 0x00000000013387d1 _ZNKSt8functionIFiiPPcEEclEiS1_ (scylla + 0x11387d1)
#15 0x00000000013353fd main (scylla + 0x11353fd)
#16 0x00007f012da83b8a __libc_start_call_main (libc.so.6 + 0x27b8a)
#17 0x00007f012da83c4b __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x27c4b)
#18 0x0000000001332ca5 _start (scylla + 0x1132ca5)
Stack trace of thread 6181:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6183:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6186:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6182:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6192:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6167:
#0  0x0000000005c6b560 _ZN7seastar8internal12io_geteventsEmllPNS0_9linux_abi8io_eventEPK8timespecb (scylla + 0x5a6b560)
#1  0x0000000005c66bdf _ZN7seastar19aio_storage_context16reap_completionsEb (scylla + 0x5a66bdf)
#2  0x0000000005c67a42 _ZN7seastar19reactor_backend_aio23reap_kernel_completionsEv (scylla + 0x5a67a42)
#3  0x0000000005c41269 _ZNSt17_Function_handlerIFbvEZN7seastar7reactor6do_runEvE3$_5E9_M_invokeERKSt9_Any_data (scylla + 0x5a41269)
#4  0x0000000005c1e2c6 _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e2c6)
#5  0x0000000005c42284 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a42284)
#6  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#7  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#8  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6190:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6188:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6184:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6168:
#0  0x0000000002a38fb2 _ZNSt8__detail9__variant17__gen_vtable_implINS0_12_Multi_arrayIPFNS0_21__deduce_visit_resultIbEEO18overloaded_functorIJZN4cql34expr13recurse_untilERKNS7_10expressionERKN7seastar20noncopyable_functionIFbSA_EEEE3$_0ZNS7_13recurse_untilESA_SG_E3$_1ZNS7_13recurse_untilESA_SG_E3$_2ZNS7_13recurse_untilESA_SG_E3$_3ZNS7_13recurse_untilESA_SG_E3$_4ZNS7_13recurse_untilESA_SG_E3$_5ZNS7_13recurse_untilESA_SG_E3$_6ZNS7_13recurse_untilESA_SG_E3$_7ZNS7_13recurse_untilESA_SG_E3$_8ZNS7_13recurse_untilESA_SG_E3$_9ZNS7_13recurse_untilESA_SG_E4$_10EERSt7variantIJNS7_11conjunctionENS7_15binary_operatorENS7_12column_valueENS7_21unresolved_identifierENS7_25column_mutation_attributeENS7_13function_callENS7_4castENS7_15field_selectionENS7_13bind_variableENS7_16untyped_constantENS7_8constantENS7_17tuple_constructorENS7_22collection_constructorENS7_20usertype_constructorENS7_9subscriptENS7_9temporaryEEEEJEEESt16integer_sequenceImJLm1EEEE14__visit_invokeEST_S1C_ (scylla + 0x2838fb2)
#1  0x0000000002a0623d _ZN4cql34expr13recurse_untilERKNS0_10expressionERKN7seastar20noncopyable_functionIFbS3_EEE (scylla + 0x280623d)
#2  0x0000000003451fdd _ZNK4cql312restrictions22statement_restrictions24get_partition_key_rangesERKNS_13query_optionsE (scylla + 0x3251fdd)
#3  0x0000000002c359f7 _ZNK4cql310statements16select_statement10do_executeERNS_15query_processorERN7service11query_stateERKNS_13query_optionsE (scylla + 0x2a359f7)
#4  0x0000000002cada48 _ZN7seastar20noncopyable_functionIFNS_6futureINS_10shared_ptrIN13cql_transport8messages14result_messageEEEEEPKN4cql310statements16select_statementERNS8_15query_processorERN7service11query_stateERKNS8_13query_optionsEEE17direct_vtable_forISt7_Mem_fnIMSA_KFS7_SE_SH_SK_EEE4callEPKSM_SC_SE_SH_SK_ (scylla + 0x2aada48)
#5  0x0000000002cadfbd _ZN7seastar20noncopyable_functionIFNS_6futureINS_10shared_ptrIN13cql_transport8messages14result_messageEEEEEPKN4cql310statements16select_statementERNS8_15query_processorERN7service11query_stateERKNS8_13query_optionsEEE17direct_vtable_forIZNS_35inheriting_concrete_execution_stageIS7_JSC_SE_SH_SK_EE20make_stage_for_groupENS_16scheduling_groupEEUlSC_SE_SH_SK_E_E4callEPKSM_SC_SE_SH_SK_ (scylla + 0x2aadfbd)
#6  0x0000000002cadc56 _ZN7seastar24concrete_execution_stageINS_6futureINS_10shared_ptrIN13cql_transport8messages14result_messageEEEEEJPKN4cql310statements16select_statementERNS8_15query_processorERN7service11query_stateERKNS8_13query_optionsEEE8do_flushEv (scylla + 0x2aadc56)
#7  0x0000000005bb7b43 _ZN7seastar11lambda_taskIZNS_15execution_stage5flushEvE3$_0E15run_and_disposeEv (scylla + 0x59b7b43)
#8  0x0000000005c1cfa0 _ZN7seastar7reactor14run_some_tasksEv (scylla + 0x5a1cfa0)
#9  0x0000000005c1e288 _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e288)
#10 0x0000000005c42284 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a42284)
#11 0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#12 0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#13 0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6178:
#0  0x00007f012db66b4d syscall (libc.so.6 + 0x10ab4d)
#1  0x0000000005c6b7b1 _ZN7seastar8internal13io_pgeteventsEmllPNS0_9linux_abi8io_eventEPK8timespecPK10__sigset_tb (scylla + 0x5a6b7b1)
#2  0x0000000005c675d3 _ZN7seastar19reactor_backend_aio12await_eventsEiPK10__sigset_t (scylla + 0x5a675d3)
#3  0x0000000005c67cfd _ZN7seastar19reactor_backend_aio23wait_and_process_eventsEPK10__sigset_t (scylla + 0x5a67cfd)
#4  0x0000000005c1e57d _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e57d)
#5  0x0000000005c42284 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a42284)
#6  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#7  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#8  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6180:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6179:
#0  0x0000000005c2a469 _ZN7seastar3smp11poll_queuesEv (scylla + 0x5a2a469)
#1  0x0000000005c60d1b _ZN7seastar7reactor10smp_pollfn4pollEv (scylla + 0x5a60d1b)
#2  0x0000000005c41269 _ZNSt17_Function_handlerIFbvEZN7seastar7reactor6do_runEvE3$_5E9_M_invokeERKSt9_Any_data (scylla + 0x5a41269)
#3  0x0000000005c1e2c6 _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e2c6)
#4  0x0000000005c42284 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a42284)
#5  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#6  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#7  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6187:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6177:
#0  0x0000000005bca29c _ZN7seastar6memory24drain_cross_cpu_freelistEv (scylla + 0x59ca29c)
#1  0x0000000005c41269 _ZNSt17_Function_handlerIFbvEZN7seastar7reactor6do_runEvE3$_5E9_M_invokeERKSt9_Any_data (scylla + 0x5a41269)
#2  0x0000000005c1e2c6 _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e2c6)
#3  0x0000000005c42284 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a42284)
#4  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#5  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#6  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6185:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6169:
#0  0x0000000002083ef1 _ZN8logalloc11region_impl5allocEPK15migrate_fn_typemm (scylla + 0x1e83ef1)
#1  0x0000000001e35833 _ZNK18compact_radix_tree4treeI13cell_and_hashjE9node_head5cloneIRZN3rowC1ERK6schema11column_kindRKS5_E3$_0EESt4pairIPS3_NSt15__exception_ptr13exception_ptrEEOT_j (scylla + 0x1c35833)
#2  0x0000000001dfaddf _ZN13deletable_rowC2ERK6schemaRKS_ (scylla + 0x1bfaddf)
#3  0x0000000001dab503 _ZN18mutation_partitionC1ERK6schemaRKS_ (scylla + 0x1bab503)
#4  0x0000000001efc7f3 _ZN15partition_entry5applyERN8logalloc6regionER16mutation_cleanerRK6schemaRK18mutation_partitionS7_R26mutation_application_stats (scylla + 0x1cfc7f3)
#5  0x0000000001cb6534 _ZN7replica8memtable5applyERK15frozen_mutationRKN7seastar13lw_shared_ptrIK6schemaEEON2db9rp_handleE (scylla + 0x1ab6534)
#6  0x0000000001b7b329 _ZN7replica5table5applyERK15frozen_mutationN7seastar13lw_shared_ptrIK6schemaEEON2db9rp_handleENSt6chrono10time_pointINS4_12lowres_clockENSC_8durationIlSt5ratioILl1ELl1000000000EEEEEE (scylla + 0x197b329)
#7  0x00000000019fd237 _ZN7replica8database8do_applyEN7seastar13lw_shared_ptrIK6schemaEERK15frozen_mutationN7tracing15trace_state_ptrENSt6chrono10time_pointINS1_12lowres_clockENSB_8durationIlSt5ratioILl1ELl1000000000EEEEEENS1_10bool_classIN2db14force_sync_tagEEESt7variantIJSt9monostateNSK_24per_partition_rate_limit12account_onlyENSP_19account_and_enforceEEE (scylla + 0x17fd237)
#8  0x0000000001a93d22 _ZN7seastar20noncopyable_functionIFNS_6futureIvEEPN7replica8databaseENS_13lw_shared_ptrIK6schemaEERK15frozen_mutationN7tracing15trace_state_ptrENSt6chrono10time_pointINS_12lowres_clockENSF_8durationIlSt5ratioILl1ELl1000000000EEEEEENS_10bool_classIN2db14force_sync_tagEEESt7variantIJSt9monostateNSO_24per_partition_rate_limit12account_onlyENST_19account_and_enforceEEEEE17direct_vtable_forISt7_Mem_fnIMS4_FS2_S9_SC_SE_SM_SQ_SW_EEE4callEPKSY_S5_S9_SC_SE_SM_SQ_SW_ (scylla + 0x1893d22)
#9  0x0000000001ab1f5c _ZN7seastar20noncopyable_functionIFNS_6futureIvEEPN7replica8databaseENS_13lw_shared_ptrIK6schemaEERK15frozen_mutationN7tracing15trace_state_ptrENSt6chrono10time_pointINS_12lowres_clockENSF_8durationIlSt5ratioILl1ELl1000000000EEEEEENS_10bool_classIN2db14force_sync_tagEEESt7variantIJSt9monostateNSO_24per_partition_rate_limit12account_onlyENST_19account_and_enforceEEEEE17direct_vtable_forIZNS_35inheriting_concrete_execution_stageIS2_JS5_S9_SC_SE_SM_SQ_SW_EE20make_stage_for_groupENS_16scheduling_groupEEUlS5_S9_SC_SE_SM_SQ_SW_E_E4callEPKSY_S5_S9_SC_SE_SM_SQ_SW_ (scylla + 0x18b1f5c)
#10 0x0000000001ab1b29 _ZN7seastar24concrete_execution_stageINS_6futureIvEEJPN7replica8databaseENS_13lw_shared_ptrIK6schemaEERK15frozen_mutationN7tracing15trace_state_ptrENSt6chrono10time_pointINS_12lowres_clockENSF_8durationIlSt5ratioILl1ELl1000000000EEEEEENS_10bool_classIN2db14force_sync_tagEEESt7variantIJSt9monostateNSO_24per_partition_rate_limit12account_onlyENST_19account_and_enforceEEEEE8do_flushEv (scylla + 0x18b1b29)
#11 0x0000000005bb7b43 _ZN7seastar11lambda_taskIZNS_15execution_stage5flushEvE3$_0E15run_and_disposeEv (scylla + 0x59b7b43)
#12 0x0000000005c1cfa0 _ZN7seastar7reactor14run_some_tasksEv (scylla + 0x5a1cfa0)
#13 0x0000000005c1e288 _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e288)
#14 0x0000000005c42284 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a42284)
#15 0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#16 0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#17 0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6170:
#0  0x0000000001fd9010 _ZN9row_cache15make_reader_optEN7seastar13lw_shared_ptrIK6schemaEE13reader_permitRK20nonwrapping_intervalIN3dht13ring_positionEERKN5query15partition_sliceEPK18tombstone_gc_stateN7tracing15trace_state_ptrENS0_10bool_classIN17streamed_mutation14forwarding_tagEEENSL_IN15mutation_reader30partition_range_forwarding_tagEEE (scylla + 0x1dd9010)
#1  0x0000000001b20cca _ZNK7replica5table14make_reader_v2EN7seastar13lw_shared_ptrIK6schemaEE13reader_permitRK20nonwrapping_intervalIN3dht13ring_positionEERKN5query15partition_sliceEN7tracing15trace_state_ptrENS1_10bool_classIN17streamed_mutation14forwarding_tagEEENSJ_IN15mutation_reader30partition_range_forwarding_tagEEE (scylla + 0x1920cca)
#2  0x0000000001c04b23 _ZNSt17_Function_handlerIF23flat_mutation_reader_v2N7seastar13lw_shared_ptrIK6schemaEE13reader_permitRK20nonwrapping_intervalIN3dht13ring_positionEERKN5query15partition_sliceEN7tracing15trace_state_ptrENS1_10bool_classIN17streamed_mutation14forwarding_tagEEENSJ_IN15mutation_reader30partition_range_forwarding_tagEEEEZNK7replica5table18as_mutation_sourceEvE3$_0E9_M_invokeERKSt9_Any_dataOS5_OS6_SC_SG_OSI_OSM_OSP_ (scylla + 0x1a04b23)
#3  0x0000000001b83f85 _ZN5query7querierC2ERK15mutation_sourceN7seastar13lw_shared_ptrIK6schemaEE13reader_permit20nonwrapping_intervalIN3dht13ring_positionEENS_15partition_sliceEN7tracing15trace_state_ptrENS_12querier_base14querier_configE (scylla + 0x1983f85)
#4  0x0000000001b8066f _ZN7replica5table5queryEN7seastar13lw_shared_ptrIK6schemaEE13reader_permitRKN5query12read_commandENS7_14result_optionsERKSt6vectorI20nonwrapping_intervalIN3dht13ring_positionEESaISG_EEN7tracing15trace_state_ptrERNS7_21result_memory_limiterENSt6chrono10time_pointINS1_12lowres_clockENSP_8durationIlSt5ratioILl1ELl1000000000EEEEEEPSt8optionalINS7_7querierEE (scylla + 0x198066f)
#5  0x0000000001aa9039 _ZN7seastar20noncopyable_functionIFNS_6futureIvEE13reader_permitEE19indirect_vtable_forIZN7replica8database5queryENS_13lw_shared_ptrIK6schemaEERKN5query12read_commandENSD_14result_optionsERKSt6vectorI20nonwrapping_intervalIN3dht13ring_positionEESaISM_EEN7tracing15trace_state_ptrENSt6chrono10time_pointINS_12lowres_clockENST_8durationIlSt5ratioILl1ELl1000000000EEEEEESt7variantIJSt9monostateN2db24per_partition_rate_limit12account_onlyENS14_19account_and_enforceEEEE3$_0E4callEPKS5_S3_ (scylla + 0x18a9039)
#6  0x0000000004694fe1 _ZN28reader_concurrency_semaphore14execution_loopEv.resume (scylla + 0x4494fe1)
#7  0x00000000013a30fb _ZN7seastar8internal21coroutine_traits_baseIvE12promise_type15run_and_disposeEv (scylla + 0x11a30fb)
#8  0x0000000005c1cfa0 _ZN7seastar7reactor14run_some_tasksEv (scylla + 0x5a1cfa0)
#9  0x0000000005c1e288 _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e288)
#10 0x0000000005c42284 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a42284)
#11 0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#12 0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#13 0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6176:
#0  0x0000000006003c95 _ZN7seastar8io_queue13poll_io_queueEv (scylla + 0x5e03c95)
#1  0x0000000005c60f29 _ZN7seastar7reactor26io_queue_submission_pollfn4pollEv (scylla + 0x5a60f29)
#2  0x0000000005c41269 _ZNSt17_Function_handlerIFbvEZN7seastar7reactor6do_runEvE3$_5E9_M_invokeERKSt9_Any_data (scylla + 0x5a41269)
#3  0x0000000005c1e2c6 _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e2c6)
#4  0x0000000005c42284 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a42284)
#5  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#6  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#7  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6193:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6189:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6175:
#0  0x00007ffe309b46e8 n/a (linux-vdso.so.1 + 0x6e8)
#1  0x00007ffe309b480a n/a (linux-vdso.so.1 + 0x80a)
#2  0x00007f012db322fd clock_gettime@@GLIBC_2.17 (libc.so.6 + 0xd62fd)
#3  0x00007f012dd37565 _ZNSt6chrono3_V212steady_clock3nowEv (libstdc++.so.6 + 0xd9565)
#4  0x0000000005c1e2a1 _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e2a1)
#5  0x0000000005c42284 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a42284)
#6  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#7  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#8  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6173:
#0  0x00007f012db66b4d syscall (libc.so.6 + 0x10ab4d)
#1  0x0000000005c67ac6 _ZN7seastar19reactor_backend_aio18kernel_submit_workEv (scylla + 0x5a67ac6)
#2  0x0000000005c41269 _ZNSt17_Function_handlerIFbvEZN7seastar7reactor6do_runEvE3$_5E9_M_invokeERKSt9_Any_data (scylla + 0x5a41269)
#3  0x0000000005c1e2c6 _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e2c6)
#4  0x0000000005c42284 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a42284)
#5  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#6  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#7  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6174:
#0  0x000000000600a395 _ZN7seastar10fair_queue17dispatch_requestsESt8functionIFvRNS_16fair_queue_entryEEE (scylla + 0x5e0a395)
#1  0x0000000006003ca9 _ZN7seastar8io_queue13poll_io_queueEv (scylla + 0x5e03ca9)
#2  0x0000000005c60f29 _ZN7seastar7reactor26io_queue_submission_pollfn4pollEv (scylla + 0x5a60f29)
#3  0x0000000005c41269 _ZNSt17_Function_handlerIFbvEZN7seastar7reactor6do_runEvE3$_5E9_M_invokeERKSt9_Any_data (scylla + 0x5a41269)
#4  0x0000000005c1e2c6 _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e2c6)
#5  0x0000000005c42284 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a42284)
#6  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#7  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#8  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6191:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6172:
#0  0x0000000001fd9284 _ZN9row_cache15make_reader_optEN7seastar13lw_shared_ptrIK6schemaEE13reader_permitRK20nonwrapping_intervalIN3dht13ring_positionEERKN5query15partition_sliceEPK18tombstone_gc_stateN7tracing15trace_state_ptrENS0_10bool_classIN17streamed_mutation14forwarding_tagEEENSL_IN15mutation_reader30partition_range_forwarding_tagEEE (scylla + 0x1dd9284)
#1  0x0000000001b20cca _ZNK7replica5table14make_reader_v2EN7seastar13lw_shared_ptrIK6schemaEE13reader_permitRK20nonwrapping_intervalIN3dht13ring_positionEERKN5query15partition_sliceEN7tracing15trace_state_ptrENS1_10bool_classIN17streamed_mutation14forwarding_tagEEENSJ_IN15mutation_reader30partition_range_forwarding_tagEEE (scylla + 0x1920cca)
#2  0x0000000001c04b23 _ZNSt17_Function_handlerIF23flat_mutation_reader_v2N7seastar13lw_shared_ptrIK6schemaEE13reader_permitRK20nonwrapping_intervalIN3dht13ring_positionEERKN5query15partition_sliceEN7tracing15trace_state_ptrENS1_10bool_classIN17streamed_mutation14forwarding_tagEEENSJ_IN15mutation_reader30partition_range_forwarding_tagEEEEZNK7replica5table18as_mutation_sourceEvE3$_0E9_M_invokeERKSt9_Any_dataOS5_OS6_SC_SG_OSI_OSM_OSP_ (scylla + 0x1a04b23)
#3  0x0000000001b83f85 _ZN5query7querierC2ERK15mutation_sourceN7seastar13lw_shared_ptrIK6schemaEE13reader_permit20nonwrapping_intervalIN3dht13ring_positionEENS_15partition_sliceEN7tracing15trace_state_ptrENS_12querier_base14querier_configE (scylla + 0x1983f85)
#4  0x0000000001b8066f _ZN7replica5table5queryEN7seastar13lw_shared_ptrIK6schemaEE13reader_permitRKN5query12read_commandENS7_14result_optionsERKSt6vectorI20nonwrapping_intervalIN3dht13ring_positionEESaISG_EEN7tracing15trace_state_ptrERNS7_21result_memory_limiterENSt6chrono10time_pointINS1_12lowres_clockENSP_8durationIlSt5ratioILl1ELl1000000000EEEEEEPSt8optionalINS7_7querierEE (scylla + 0x198066f)
#5  0x0000000001aa9039 _ZN7seastar20noncopyable_functionIFNS_6futureIvEE13reader_permitEE19indirect_vtable_forIZN7replica8database5queryENS_13lw_shared_ptrIK6schemaEERKN5query12read_commandENSD_14result_optionsERKSt6vectorI20nonwrapping_intervalIN3dht13ring_positionEESaISM_EEN7tracing15trace_state_ptrENSt6chrono10time_pointINS_12lowres_clockENST_8durationIlSt5ratioILl1ELl1000000000EEEEEESt7variantIJSt9monostateN2db24per_partition_rate_limit12account_onlyENS14_19account_and_enforceEEEE3$_0E4callEPKS5_S3_ (scylla + 0x18a9039)
#6  0x0000000004694fe1 _ZN28reader_concurrency_semaphore14execution_loopEv.resume (scylla + 0x4494fe1)
#7  0x00000000013a30fb _ZN7seastar8internal21coroutine_traits_baseIvE12promise_type15run_and_disposeEv (scylla + 0x11a30fb)
#8  0x0000000005c1cfa0 _ZN7seastar7reactor14run_some_tasksEv (scylla + 0x5a1cfa0)
#9  0x0000000005c1e288 _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e288)
#10 0x0000000005c42284 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a42284)
#11 0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#12 0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#13 0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6171:
#0  0x0000000001757c20 _ZN7seastar6futureIvE16handle_exceptionIZZZZZNS_3rpc11recv_helperIN4netw10serializerESt8functionIFNS0_INS3_12no_wait_typeEEERKNS3_11client_infoENS3_14opt_time_pointE15frozen_mutationN5utils12small_vectorIN3gms12inet_addressELm3EEESI_jmNS3_8optionalISt8optionalIN7tracing10trace_infoEEEENSK_ISt7variantIJSt9monostateN2db24per_partition_rate_limit12account_onlyENST_19account_and_enforceEEEEENSK_IN7service13fencing_tokenEEEEES9_JSE_SJ_SI_jmSP_SX_S10_ENS3_19do_want_client_infoENS3_18do_want_time_pointEEEDaNS3_9signatureIFT1_DpT2_EEEOT0_T3_T4_ENUlNS_10shared_ptrINS3_6server10connectionEEESL_INSt6chrono10time_pointINS_12lowres_clockENS1J_8durationIlSt5ratioILl1ELl1000000000EEEEEEElNS3_7rcv_bufEE_clES1I_S1R_lS1S_ENUlT_E_clINS_15semaphore_unitsINS_35semaphore_default_exception_factoryES1L_EEEEDaS1U_ENUlvE_clEvENUlS9_E_clES9_EUlNSt15__exception_ptr13exception_ptrEE_EES1_OS1U_ (scylla + 0x1557c20)
#1  0x000000000175739f _ZZZZZN7seastar3rpc11recv_helperIN4netw10serializerESt8functionIFNS_6futureINS0_12no_wait_typeEEERKNS0_11client_infoENS0_14opt_time_pointE15frozen_mutationN5utils12small_vectorIN3gms12inet_addressELm3EEESG_jmNS0_8optionalISt8optionalIN7tracing10trace_infoEEEENSI_ISt7variantIJSt9monostateN2db24per_partition_rate_limit12account_onlyENSR_19account_and_enforceEEEEENSI_IN7service13fencing_tokenEEEEES7_JSC_SH_SG_jmSN_SV_SY_ENS0_19do_want_client_infoENS0_18do_want_time_pointEEEDaNS0_9signatureIFT1_DpT2_EEEOT0_T3_T4_ENUlNS_10shared_ptrINS0_6server10connectionEEESJ_INSt6chrono10time_pointINS_12lowres_clockENS1H_8durationIlSt5ratioILl1ELl1000000000EEEEEEElNS0_7rcv_bufEE_clES1G_S1P_lS1Q_ENUlT_E_clINS_15semaphore_unitsINS_35semaphore_default_exception_factoryES1J_EEEEDaS1S_ENUlvE_clEvENUlS7_E_clES7_ (scylla + 0x155739f)
#2  0x0000000001759638 _ZN7seastar12continuationINS_8internal22promise_base_with_typeIvEEZZZZNS_3rpc11recv_helperIN4netw10serializerESt8functionIFNS_6futureINS4_12no_wait_typeEEERKNS4_11client_infoENS4_14opt_time_pointE15frozen_mutationN5utils12small_vectorIN3gms12inet_addressELm3EEESK_jmNS4_8optionalISt8optionalIN7tracing10trace_infoEEEENSM_ISt7variantIJSt9monostateN2db24per_partition_rate_limit12account_onlyENSV_19account_and_enforceEEEEENSM_IN7service13fencing_tokenEEEEESB_JSG_SL_SK_jmSR_SZ_S12_ENS4_19do_want_client_infoENS4_18do_want_time_pointEEEDaNS4_9signatureIFT1_DpT2_EEEOT0_T3_T4_ENUlNS_10shared_ptrINS4_6server10connectionEEESN_INSt6chrono10time_pointINS_12lowres_clockENS1L_8durationIlSt5ratioILl1ELl1000000000EEEEEEElNS4_7rcv_bufEE_clES1K_S1T_lS1U_ENUlT_E_clINS_15semaphore_unitsINS_35semaphore_default_exception_factoryES1N_EEEEDaS1W_ENUlvE_clEvEUlSB_E_ZNSB_17then_wrapped_nrvoINS9_IvEES23_EENS_8futurizeIS1W_E4typeES1E_EUlOS3_RS23_ONS_12future_stateISA_EEE_SA_E15run_and_disposeEv (scylla + 0x1559638)
#3  0x0000000005c1cfa0 _ZN7seastar7reactor14run_some_tasksEv (scylla + 0x5a1cfa0)
#4  0x0000000005c1e288 _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e288)
#5  0x0000000005c42284 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a42284)
#6  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#7  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#8  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
download_instructions=gsutil cp gs://upload.scylladb.com/core.scylla.112.5991a193a26741db95420f046b1e1093.6166.1704524277000000/core.scylla.112.5991a193a26741db95420f046b1e1093.6166.1704524277000000.gz .
gunzip /var/lib/systemd/coredump/core.scylla.112.5991a193a26741db95420f046b1e1093.6166.1704524277000000.gz

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Kernel Version: 5.15.0-1051-aws
Scylla version (or git commit hash): 5.5.0~dev-20240105.7e84e03f5231 with build-id f21e4548b69223a75d01fd3bb9d4c9c2b1b71a6d

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

longevity-tls-50gb-3d-master-db-node-5329f695-9 (54.155.147.60 | 10.4.10.205) (shards: 14)
longevity-tls-50gb-3d-master-db-node-5329f695-8 (3.250.158.2 | 10.4.11.205) (shards: 14)
longevity-tls-50gb-3d-master-db-node-5329f695-7 (18.200.236.161 | 10.4.9.127) (shards: 14)
longevity-tls-50gb-3d-master-db-node-5329f695-6 (34.245.194.88 | 10.4.9.178) (shards: 14)
longevity-tls-50gb-3d-master-db-node-5329f695-5 (3.255.183.111 | 10.4.9.98) (shards: 14)
longevity-tls-50gb-3d-master-db-node-5329f695-4 (18.201.16.110 | 10.4.9.5) (shards: 14)
longevity-tls-50gb-3d-master-db-node-5329f695-3 (3.249.178.182 | 10.4.9.246) (shards: 14)
longevity-tls-50gb-3d-master-db-node-5329f695-2 (18.201.13.191 | 10.4.11.217) (shards: 14)
longevity-tls-50gb-3d-master-db-node-5329f695-10 (3.253.72.9 | 10.4.9.41) (shards: 14)
longevity-tls-50gb-3d-master-db-node-5329f695-1 (3.254.197.193 | 10.4.8.89) (shards: 14)

OS / Image: ami-077f3a25a749656b7 (aws: undefined_region)

Test: longevity-50gb-3days-test
Test id: 5329f695-3131-4153-a22e-a2bce1a8af32
Test name: scylla-master/longevity/longevity-50gb-3days-test
Test config file(s):

longevity-50GB-3days-authorization-and-tls-ssl.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 5329f695-3131-4153-a22e-a2bce1a8af32
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 5329f695-3131-4153-a22e-a2bce1a8af32

Logs:

db-cluster-5329f695.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5329f695-3131-4153-a22e-a2bce1a8af32/20240106_073250/db-cluster-5329f695.tar.gz
sct-runner-events-5329f695.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5329f695-3131-4153-a22e-a2bce1a8af32/20240106_073250/sct-runner-events-5329f695.tar.gz
sct-5329f695.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5329f695-3131-4153-a22e-a2bce1a8af32/20240106_073250/sct-5329f695.log.tar.gz
loader-set-5329f695.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5329f695-3131-4153-a22e-a2bce1a8af32/20240106_073250/loader-set-5329f695.tar.gz
monitor-set-5329f695.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5329f695-3131-4153-a22e-a2bce1a8af32/20240106_073250/monitor-set-5329f695.tar.gz

Jenkins job URL
Argus

The text was updated successfully, but these errors were encountered:

fruch · 2024-01-07T09:50:28Z

last week the test was passing that nemesis with success,

those are the changes merged in scylla since:

🟢 ❯ git log --oneline 331d9ce788e2..7e84e03f5231 | grep Merge
bf068dd023 Merge `handle error in cdc generation propagation during bootstrap` from Gleb
f942bf4a1f Merge 'Do not update endpoint state via gossiper::add_saved_endpoint once it was updated via gossip' from Benny Halevy
20531872a7 Merge 'test: randomized_nemesis_test: add formatter for append_entry' from Kefu Chai
715e062d4a Merge 'table, memtable: share log structured allocator statistics across all tablets in a table' from Avi Kivity
949658590f Merge 'raft topology: do not update token metadata in on_alive and on_remove' from Patryk Jędrzejczak
7f6955b883 Merge 'test: make use of concurrent bootstrap' from Patryk Jędrzejczak
8ba0decda5 Merge 'System.peers: enforce host_id' from Benny Halevy

Sound it might be related to 8ba0dec, but I'll let @bhalevy comment on that

mykaul · 2024-01-07T11:55:56Z

Decoded:

2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]: Backtrace:2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:
[Backtrace #0]
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:68
 (inlined by) seastar::backtrace_buffer::append_backtrace() at ./build/release/seastar/./seastar/src/core/reactor.cc:826
 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:856
seastar::print_with_backtrace(char const*, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:868
 (inlined by) seastar::sigabrt_action() at ./build/release/seastar/./seastar/src/core/reactor.cc:4062
 (inlined by) operator() at ./build/release/seastar/./seastar/src/core/reactor.cc:4038
 (inlined by) __invoke at ./build/release/seastar/./seastar/src/core/reactor.cc:4034
/data/scylla-s3-reloc.cache/by-build-id/f21e4548b69223a75d01fd3bb9d4c9c2b1b71a6d/extracted/scylla/libreloc/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=7026fe8c129a523e07856d7c96306663ceab6e24, for GNU/Linux 3.2.0, not stripped

__GI___sigaction at :?
__pthread_kill_implementation at ??:?
__GI_raise at :?
__GI_abort at :?
seastar::on_internal_error(seastar::logger&, std::basic_string_view<char, std::char_traits<char> >) at ./build/release/seastar/./seastar/src/core/on_internal_error.cc:57
locator::token_metadata_impl::get_endpoint_for_host_id(utils::tagged_uuid<locator::host_id_tag>) const at ./locator/token_metadata.cc:550
 (inlined by) locator::token_metadata::get_endpoint_for_host_id(utils::tagged_uuid<locator::host_id_tag>) const at ./locator/token_metadata.cc:975
operator() at ./service/storage_service.cc:6305
 (inlined by) seastar::noncopyable_function<std::optional<locator::endpoint_dc_rack> (utils::tagged_uuid<locator::host_id_tag>)>::direct_vtable_for<service::storage_service::update_topology_change_info(seastar::lw_shared_ptr<locator::token_metadata>, seastar::basic_sstring<char, unsigned int, 15u, true>)::$_0>::call(seastar::noncopyable_function<std::optional<locator::endpoint_dc_rack> (utils::tagged_uuid<locator::host_id_tag>)> const*, utils::tagged_uuid<locator::host_id_tag>) at ././seastar/include/seastar/util/noncopyable_function.hh:129
seastar::noncopyable_function<std::optional<locator::endpoint_dc_rack> (utils::tagged_uuid<locator::host_id_tag>)>::operator()(utils::tagged_uuid<locator::host_id_tag>) const at ././seastar/include/seastar/util/noncopyable_function.hh:215
 (inlined by) locator::token_metadata_impl::update_topology_change_info(seastar::noncopyable_function<std::optional<locator::endpoint_dc_rack> (utils::tagged_uuid<locator::host_id_tag>)>&) at ./locator/token_metadata.cc:753
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume() const at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/coroutine:240
 (inlined by) seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:125
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2666
 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3129
seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3305
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3188
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:276
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:167
scylla_main(int, char**) at ./main.cc:670
std::function<int (int, char**)>::operator()(int, char**) const at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/std_function.h:591
main at ./main.cc:2081
__libc_start_call_main at ??:?
__libc_start_main_alias_2 at :?
_start at ??:?

Which reminds me #14974 ?

fruch · 2024-01-07T12:03:47Z

what happened to https://backtrace.scylladb.com/ ?

bhalevy · 2024-01-07T12:08:41Z

what happened to https://backtrace.scylladb.com/ ?

it doesn't support https, only http.

bhalevy · 2024-01-07T12:19:21Z

last week the test was passing that nemesis with success,

those are the changes merged in scylla since:

🟢 ❯ git log --oneline 331d9ce788e2..7e84e03f5231 | grep Merge
bf068dd023 Merge `handle error in cdc generation propagation during bootstrap` from Gleb
f942bf4a1f Merge 'Do not update endpoint state via gossiper::add_saved_endpoint once it was updated via gossip' from Benny Halevy
20531872a7 Merge 'test: randomized_nemesis_test: add formatter for append_entry' from Kefu Chai
715e062d4a Merge 'table, memtable: share log structured allocator statistics across all tablets in a table' from Avi Kivity
949658590f Merge 'raft topology: do not update token metadata in on_alive and on_remove' from Patryk Jędrzejczak
7f6955b883 Merge 'test: make use of concurrent bootstrap' from Patryk Jędrzejczak
8ba0decda5 Merge 'System.peers: enforce host_id' from Benny Halevy

Sound it might be related to 8ba0dec, but I'll let @bhalevy comment on that

The internal error was added by @gusev-p in 5a1418f
as part of 26cbd28.

@gusev-p, with _raft_topology_change_enabled, we handle the case where the host id is not found in

scylladb/service/storage_service.cc

Lines 6295 to 6302 in 7e84e03

    
           const auto* node = _topology_state_machine._topology.find(server_id); 
        
           if (node) { 
        
               return locator::endpoint_dc_rack { 
        
                   .dc = node->second.datacenter, 
        
                   .rack = node->second.rack, 
        
               }; 
        
           } 
        
           return std::nullopt;

so without _raft_topology_change_enabled, shouldn't we use get_endpoint_for_host_id_if_known instead of get_endpoint_for_host_id here?

scylladb/service/storage_service.cc

Line 6305 in 7e84e03

return get_dc_rack_for(tm.get_endpoint_for_host_id(host_id));

gusev-p · 2024-01-07T13:05:44Z

last week the test was passing that nemesis with success,
those are the changes merged in scylla since:
🟢 ❯ git log --oneline 331d9ce788e2..7e84e03f5231 | grep Merge
bf068dd023 Merge `handle error in cdc generation propagation during bootstrap` from Gleb
f942bf4a1f Merge 'Do not update endpoint state via gossiper::add_saved_endpoint once it was updated via gossip' from Benny Halevy
20531872a7 Merge 'test: randomized_nemesis_test: add formatter for append_entry' from Kefu Chai
715e062d4a Merge 'table, memtable: share log structured allocator statistics across all tablets in a table' from Avi Kivity
949658590f Merge 'raft topology: do not update token metadata in on_alive and on_remove' from Patryk Jędrzejczak
7f6955b883 Merge 'test: make use of concurrent bootstrap' from Patryk Jędrzejczak
8ba0decda5 Merge 'System.peers: enforce host_id' from Benny Halevy
Sound it might be related to 8ba0dec, but I'll let @bhalevy comment on that
The internal error was added by @gusev-p in 5a1418f as part of 26cbd28.

@gusev-p, with _raft_topology_change_enabled, we handle the case where the host id is not found in

scylladb/service/storage_service.cc

Lines 6295 to 6302 in 7e84e03

const auto* node = _topology_state_machine._topology.find(server_id);

if (node) {

return locator::endpoint_dc_rack {

.dc = node->second.datacenter,

.rack = node->second.rack,

};

}

return std::nullopt;

so without _raft_topology_change_enabled, shouldn't we use get_endpoint_for_host_id_if_known instead of get_endpoint_for_host_id here?

scylladb/service/storage_service.cc

Line 6305 in 7e84e03

return get_dc_rack_for(tm.get_endpoint_for_host_id(host_id));

That's depressing that we get the real CI feedback only from longevity tests month after the changes were merged( Neither test.py, nor dtests caught that.

Regarding the code, we discussed this particular line in PR review, but in the utter crap that this github UI is I now barely can find anything. The upshot: we relied on the token_metadata to know the IPs of the nodes it manages in gossiper topology mode. Before the changes the IPs themselves were used to identify the nodes, meaning we knew the IPs each time update_topology_change_info was called. And the refactoring itself strived to maintain the same workflow for gossiper mode, meaning each time update_topology_change_info is called IPs should be known. Obviously, there is a flaw in this reasoning. We relied on dtests to check wether is's true or not, and it doesn't work in this way(

with _raft_topology_change_enabled, we handle the case where the host id is not found in

It's not exactly the same case. In _raft_topology_change_enabled mode, we check whether the entire node exists in _topology_state_machine._topology or not, and in gossiper mode we ask the token_metadata itself for an IP for the given id. The equivalent of the raft if is in get_dc_rack_for function - it returns std::nullopt if it can't find IP in gossiper.

Probably I should dig into the scenario of this longevity test and figure out what exactly was broken by my refactoring.

bhalevy · 2024-01-07T13:31:53Z

How about the following fix that passes a node& from to the dc_rack_fn so it can use either the host_id or endpoint,
as the latter should be node in the token_metadata topology:

diff --git a/locator/token_metadata.cc b/locator/token_metadata.cc
index 9f72708e12..1a629886ed 100644
--- a/locator/token_metadata.cc
+++ b/locator/token_metadata.cc
@@ -750,7 +750,8 @@ future<> token_metadata_impl::update_topology_change_info(dc_rack_fn& get_dc_rac
         }
         // apply new_normal_tokens
         for (auto& [endpoint, tokens]: new_normal_tokens) {
-            target_token_metadata->update_topology(endpoint, get_dc_rack(endpoint), node::state::normal);
+            auto* node = _topology.find_node(endpoint);
+            target_token_metadata->update_topology(endpoint, get_dc_rack(*node), node::state::normal);
             co_await target_token_metadata->update_normal_tokens(std::move(tokens), endpoint);
         }
         // apply leaving endpoints
diff --git a/locator/token_metadata.hh b/locator/token_metadata.hh
index b798b47ab0..5982718f57 100644
--- a/locator/token_metadata.hh
+++ b/locator/token_metadata.hh
@@ -74,6 +74,8 @@ struct host_id_or_endpoint {
 class token_metadata_impl;
 struct topology_change_info;
 
+using dc_rack_fn = seastar::noncopyable_function<std::optional<endpoint_dc_rack>(const locator::node&)>;
+
 class token_metadata final {
     std::unique_ptr<token_metadata_impl> _impl;
 private:
diff --git a/locator/types.hh b/locator/types.hh
index 3f2783f3fe..ceb672b8f2 100644
--- a/locator/types.hh
+++ b/locator/types.hh
@@ -31,6 +31,4 @@ struct endpoint_dc_rack {
     bool operator==(const endpoint_dc_rack&) const = default;
 };
 
-using dc_rack_fn = seastar::noncopyable_function<std::optional<endpoint_dc_rack>(host_id)>;
-
 } // namespace locator
diff --git a/service/storage_service.cc b/service/storage_service.cc
index 076c458ce3..5b205ce162 100644
--- a/service/storage_service.cc
+++ b/service/storage_service.cc
@@ -6289,9 +6289,9 @@ future<> storage_service::update_topology_change_info(mutable_token_metadata_ptr
     assert(this_shard_id() == 0);
 
     try {
-        locator::dc_rack_fn get_dc_rack_by_host_id([this, &tm = *tmptr] (locator::host_id host_id) -> std::optional<locator::endpoint_dc_rack> {
+        locator::dc_rack_fn get_dc_rack_by_host_id([this] (const locator::node& n) -> std::optional<locator::endpoint_dc_rack> {
             if (_raft_topology_change_enabled) {
-                const auto server_id = raft::server_id(host_id.uuid());
+                const auto server_id = raft::server_id(n.host_id().uuid());
                 const auto* node = _topology_state_machine._topology.find(server_id);
                 if (node) {
                     return locator::endpoint_dc_rack {
@@ -6302,7 +6302,7 @@ future<> storage_service::update_topology_change_info(mutable_token_metadata_ptr
                 return std::nullopt;
             }
 
-            return get_dc_rack_for(tm.get_endpoint_for_host_id(host_id));
+            return get_dc_rack_for(n.endpoint());
         });
         co_await tmptr->update_topology_change_info(get_dc_rack_by_host_id);
     } catch (...) {
diff --git a/test/boost/token_metadata_test.cc b/test/boost/token_metadata_test.cc
index 29317ae07d..71f36c987d 100644
--- a/test/boost/token_metadata_test.cc
+++ b/test/boost/token_metadata_test.cc
@@ -21,13 +21,17 @@ namespace {
         return host_id{utils::UUID(0, id)};
     }
 
-    endpoint_dc_rack get_dc_rack(host_id) {
+    endpoint_dc_rack unknown_dc_rack() {
         return {
             .dc = "unk-dc",
             .rack = "unk-rack"
         };
     }
 
+    endpoint_dc_rack get_dc_rack(locator::host_id) {
+        return unknown_dc_rack();
+    }
+
     mutable_token_metadata_ptr create_token_metadata(host_id this_host_id) {
         return make_lw_shared<token_metadata>(token_metadata::config {
             topology::config {
@@ -39,7 +43,9 @@ namespace {
 
     template <typename Strategy>
     mutable_vnode_erm_ptr create_erm(mutable_token_metadata_ptr tmptr, replication_strategy_config_options opts = {}) {
-        dc_rack_fn get_dc_rack_fn = get_dc_rack;
+        dc_rack_fn get_dc_rack_fn = [] (const locator::node&) {
+            return unknown_dc_rack();
+        };
         tmptr->update_topology_change_info(get_dc_rack_fn).get();
         auto strategy = seastar::make_shared<Strategy>(replication_strategy_params(opts, std::nullopt));
         return calculate_effective_replication_map(std::move(strategy), tmptr).get0();

mykaul · 2024-01-07T14:03:39Z

that's depressing that we get the real CI feedback only from longevity tests month after the changes were merged( Neither test.py, nor dtests caught that.

@gusev-p - that's a legitimate feedback - can you follow up on how missed this in either/both test suites?

gusev-p · 2024-01-07T16:15:11Z

How about the following fix that passes a node& from to the dc_rack_fn so it can use either the host_id or endpoint,
as the latter should be node in the token_metadata topology:

This won't help much, in our case endpoint() will be empty and the effect is the same as get_endpoint_for_host_id_if_known(host_id).value_or(inet_address{}). We'll discuss on a daily tomorrow how to handle this.

kbr-scylla · 2024-01-08T11:46:00Z

Neither test.py, nor dtests caught that.

test.py are running mostly in raft-topology mode now (except only a few specific test cases).

dtests dunno. The issue is most likely a timing race (as most with gossiper are) and perhaps the larger a cluster is, the easier it is to reproduce; and in dtests we don't test such large clusters. Or (more likely I think) it's because longevity test is running on a real distributed cluster (multiple machines) and network latencies are needed to reproduce this. Hmm... could it be that nodes are not getting gossip messages in time?

kbr-scylla · 2024-01-08T11:50:55Z

Regarding the code, we discussed this particular line in #15903, but in the utter crap that this github UI is I now barely can find anything.

You probably mean this
#15903 (comment)

kbr-scylla · 2024-01-08T13:19:57Z

The node that was decommissioning was actually node-10.

The crashes happened while the node was announcing that it left the ring

Jan 06 06:57:56.455247 longevity-tls-50gb-3d-master-db-node-5329f695-10 scylla[6090]:  [shard  0:strm] storage_service - Announcing that I have left the ring for 30000ms
...
Jan 06 06:58:26.455296 longevity-tls-50gb-3d-master-db-node-5329f695-10 scylla[6090]:  [shard  0:strm] storage_service - decommission[e7d77f31-f33f-4d42-bd03-0aa4f2f11818]: left token ring

the aborts happened in this time period, e.g. node-9:

Jan 06 06:57:58.547864 longevity-tls-50gb-3d-master-db-node-5329f695-9 scylla[6086]:  [shard  0: gms] token_metadata - endpoint for host_id 5fa31aad-4354-48ff-a3ae-ccdafa5f92ff is not found, at: 0x611fd1e 0x6120330 0x6120618 0x5bdffa7 0x3ef709a 0x4120f29 0x3f16cd1 0x13a30fa 0x5c1cf9f 0x5c1e287 0x5c1d5e8 0x5bafbc7 0x5baed7c 0x1336d79 0x13387d0 0x13353fc /opt/scylladb/libreloc/libc.so.6+0x27b89 /opt/scylladb/libreloc/libc.so.6+0x27c4a 0x1332ca4

the host ID they're trying to map (5fa31aad-4354-48ff-a3ae-ccdafa5f92ff) is of the decommissioning node.

kbr-scylla · 2024-01-08T13:56:16Z

Hmm

        // apply new_normal_tokens
        for (auto& [endpoint, tokens]: new_normal_tokens) {
            target_token_metadata->update_topology(endpoint, get_dc_rack(endpoint), node::state::normal);
            co_await target_token_metadata->update_normal_tokens(std::move(tokens), endpoint);
        }
        // apply leaving endpoints
        for (const auto& endpoint: _leaving_endpoints) {
            target_token_metadata->remove_endpoint(endpoint);
        }

The crash is happening when trying to map IPs of endpoints in new_normal_tokens.

Curiously, this also includes leaving endpoints if there are any -- those are being removed in the lines below, after we attempted to map their IPs.

IIUC we could modify this code so we don't need mappings for leaving endpoints -- after all we're adding them and immediately removing them from target_token_metadata, so we're just introducing redundant intermediate stage which seems to be the only place to use the mappings.

Still the question remains, why do we sometimes have the mappings and sometimes not.

kbr-scylla · 2024-01-08T13:58:59Z

Wait, I might've misunderstood. Leaving endpoints should not be part of new_normal_tokens

kbr-scylla · 2024-01-08T14:03:08Z

I think this is a clue. Our decommissioning node has replaced another node before:

Jan 06 06:56:21.981466 longevity-tls-50gb-3d-master-db-node-5329f695-4 scylla[6200]:  [shard  0: gms] storage_service - handle_state_normal: Nodes 10.4.9.127 and 10.4.9.41 have the same token -1916910330658437065. Ignoring 10.4.9.127
Jan 06 06:56:21.981505 longevity-tls-50gb-3d-master-db-node-5329f695-4 scylla[6200]:  [shard  0: gms] storage_service - handle_state_normal: endpoints_to_remove endpoint=10.4.9.127
Jan 06 06:56:22.125801 longevity-tls-50gb-3d-master-db-node-5329f695-4 scylla[6200]:  [shard  0: gms] gossip - Removed endpoint 10.4.9.127

(this log is from a node which did not crash)

Perhaps on nodes where the crash happened, the state of old node was still lingering and somehow messed everything up.

BTW. on node-4 which did not crash, we can see 10.4.9.127 state being removed over and over again. This looks like #14991

kbr-scylla · 2024-01-08T14:06:03Z

So... our culprit is still inside _replacing_endpoints? And that's why it is included in new_normal_tokens?

Just gossiper things.

gusev-p · 2024-01-08T19:35:07Z

10.4.9.41/5fa31aad-4354-48ff-a3ae-ccdafa5f92ff is the node which first replaced some other node and then was decommissioned.

failed node logs (node 3):

Jan 06 06:29:46.003906 longevity-tls-50gb-3d-master-db-node-5329f695-3 scylla[6166]:  [shard  0: gms] gossip - InetAddress 10.4.9.41 is now UP, status = UNKNOWN

    Jan 06 06:31:35.729212 longevity-tls-50gb-3d-master-db-node-5329f695-3 scylla[6166]:  [shard  0:strm] storage_service - replace[349238da-585f-4011-bb9a-050db4cfbbbd]: Added replacing_node=10.4.9.41/5fa31aad-4354-48ff-a3ae-ccdafa5f92ff to replace existing_node=10.4.9.127/94ecd347-1d25-433e-84e1-cb1333917bca, coordinator=10.4.9.41/5fa31aad-4354-48ff-a3ae-ccdafa5f92ff

    Jan 06 06:31:35.729231 longevity-tls-50gb-3d-master-db-node-5329f695-3 scylla[6166]:  [shard  0:strm] token_metadata - Added node 5fa31aad-4354-48ff-a3ae-ccdafa5f92ff as pending replacing endpoint which replaces existing node 94ecd347-1d25-433e-84e1-cb1333917bca

    Jan 06 06:33:46.797641 longevity-tls-50gb-3d-master-db-node-5329f695-3 scylla[6166]:  [shard  0:strm] storage_service - replace[349238da-585f-4011-bb9a-050db4cfbbbd]: Marked ops done from coordinator=10.4.9.41
        at exactly the same time the same 'Marked ops done' happens on a healthy node,
but there it's immediately followed by 'handle_state_normal', which is missing here
for some reason.

healthy node (node2):

Jan 06 06:29:46.006827 longevity-tls-50gb-3d-master-db-node-5329f695-2 scylla[6144]:  [shard  0: gms] gossip - InetAddress 10.4.9.41 is now UP, status = UNKNOWN

    Jan 06 06:31:35.731959 longevity-tls-50gb-3d-master-db-node-5329f695-2 scylla[6144]:  [shard  0:strm] token_metadata - Added node 5fa31aad-4354-48ff-a3ae-ccdafa5f92ff as pending replacing endpoint which replaces existing node 94ecd347-1d25-433e-84e1-cb1333917bca

    Jan 06 06:33:46.797671 longevity-tls-50gb-3d-master-db-node-5329f695-2 scylla[6144]:  [shard  0:strm] storage_service - replace[349238da-585f-4011-bb9a-050db4cfbbbd]: Marked ops done from coordinator=10.4.9.41
Jan 06 06:33:48.243623 longevity-tls-50gb-3d-master-db-node-5329f695-2 scylla[6144]:  [shard  0: gms] storage_service - handle_state_normal: remove endpoint=10.4.9.127 token=-1916910330658437065
     we see the 'handle_state_normal' which removes the old node (since its tokens
are now owned by the new node), 'remove_endpoint' is called for the replaced node,
and this also removes the mapping to the new node from _replacing_endpoints in token_metadata

so handle_state_normal was not called on node 3, there remained a mapping in _replacing_endpoints containing 5fa31aad-4354-48ff-a3ae-ccdafa5f92ff as a value, then decommission came for 5fa31aad-4354-48ff-a3ae-ccdafa5f92ff, storage_service::excise was called for it, tmptr->remove_endpoint(*host_id); was called, which removed the node from token_metadata.topology, but not from _replacing_endpoints. Then update_topology_change_info caused the crash.

This sequence of events is similar to this one in that token_metadata_impl::remove_endpoint removes from _replacing_endpoints only by key.

So, the upshot:

I think we need to make token_metadata_impl::remove_endpoint to remove _replacing_endpoints by both the key and the value;
It's unclear why handle_state_normal wasn't called after Marked ops done on the failed nodes. Maybe this code in gossiper placed it into quarantine?

// check for dead state removal
auto expire_time = get_expire_time_for_endpoint(endpoint);
const auto host_id = get_host_id(endpoint);
if (!is_alive && (now > expire_time)
    && (!get_token_metadata_ptr()->is_normal_token_owner(host_id))) {
    logger.debug("time is expiring for endpoint : {} ({})", endpoint, expire_time.time_since_epoch().count());
    co_await evict_from_membership(endpoint, pid);
}

@kbr-scylla @bhalevy

kbr-scylla · 2024-01-09T14:31:55Z

SCT actually reported a bunch of errors pointing to the discrepancies, for example:

2024-01-06 06:42:12.150: (ClusterHealthValidatorEvent Severity.ERROR) period_type=one-time event_id=1e74743f-689f-494f-91ec-7dcfe98251da during_nemesis=RunUniqueSequence: type=NodeStatus node=longevity-tls-50gb-3d-master-db-node-5329f695-5 error=Current node Node longevity-tls-50gb-3d-master-db-node-5329f695-5 [3.255.183.111 | 10.4.9.98] (seed: True). The node Node longevity-tls-50gb-3d-master-db-node-5329f695-10 [3.253.72.9 | 10.4.9.41] (seed: True) exists in the gossip but doesn't exist in the nodetool.status
2024-01-06 06:42:12.213: (ClusterHealthValidatorEvent Severity.ERROR) period_type=one-time event_id=ce1b79c8-f0ba-47fa-98d3-3bbc2031a3ae during_nemesis=RunUniqueSequence: type=NodeStatus node=longevity-tls-50gb-3d-master-db-node-5329f695-5 error=Current node Node longevity-tls-50gb-3d-master-db-node-5329f695-5 [3.255.183.111 | 10.4.9.98] (seed: True). Wrong node status. Node Node longevity-tls-50gb-3d-master-db-node-5329f695-2 [18.201.13.191 | 10.4.11.217] (seed: True) status in nodetool.status is UN, but status in gossip shutdown
2024-01-06 06:42:12.231: (ClusterHealthValidatorEvent Severity.ERROR) period_type=one-time event_id=4131652e-5301-4ff1-bfa4-543b5536293e during_nemesis=RunUniqueSequence: type=NodeSchemaVersion node=longevity-tls-50gb-3d-master-db-node-5329f695-5 error=Current node Node longevity-tls-50gb-3d-master-db-node-5329f695-5 [3.255.183.111 | 10.4.9.98] (seed: True). Node Node longevity-tls-50gb-3d-master-db-node-5329f695-1 [3.254.197.193 | 10.4.8.89] (seed: True) (not target node) exists in the nodetool.status but missed in gossip.
2024-01-06 06:42:12.253: (ClusterHealthValidatorEvent Severity.ERROR) period_type=one-time event_id=ad214989-862b-42ab-8289-82ea5e4f8a2c during_nemesis=RunUniqueSequence: type=NodeSchemaVersion node=longevity-tls-50gb-3d-master-db-node-5329f695-5 error=Current node Node longevity-tls-50gb-3d-master-db-node-5329f695-5 [3.255.183.111 | 10.4.9.98] (seed: True). Node Node longevity-tls-50gb-3d-master-db-node-5329f695-10 [3.253.72.9 | 10.4.9.41] (seed: True) (not target node) exists in the gossip but missed in SYSTEM.PEERS.
2024-01-06 06:42:12.276: (ClusterHealthValidatorEvent Severity.ERROR) period_type=one-time event_id=7ac69ffa-ee55-4110-bc7a-e9676c9b5105 during_nemesis=RunUniqueSequence: type=NodeSchemaVersion node=longevity-tls-50gb-3d-master-db-node-5329f695-5 error=Current node Node longevity-tls-50gb-3d-master-db-node-5329f695-5 [3.255.183.111 | 10.4.9.98] (seed: True). Nodes 10.4.8.89 exists in the SYSTEM.PEERS but missed in gossip.

(this is from sct-runner-events, events.log)

kbr-scylla · 2024-01-09T15:38:04Z

Attempts at creating a fast local reproducer failed.

It looks like we'll have to run longevity with custom builds with more logging, or perhaps enable more logging on master.

The logging in gossiper/storage_service and handle_state_normal is very inconsistent. The decisions which logs to put on INFO level and which on DEBUG seemed to be done randomly, and most cases are not covered by a single INFO log.

Apparently SCT provides nodetool gossipinfo output between operations. We have output from between the replace and decommission.
Node 2 (healthy):

< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG > /10.4.9.41
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   generation:1704522584
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   heartbeat:1391
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X9:org.apache.cassandra.locator.Ec2Snitch
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   RPC_ADDRESS:10.4.9.41
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   HOST_ID:5fa31aad-4354-48ff-a3ae-ccdafa5f92ff
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   SCHEMA:a6725fa6-ac53-11ee-780b-aa82ae73c005
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   RACK:1c
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X1:AGGREGATE_STORAGE_OPTIONS,ALTERNATOR_TTL,CDC,CDC_GENERATIONS_V2,COLLECTION_INDEXING,COMPUTED_COLUMNS,CORRECT_COUNTER_ORDER,CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX,CORRECT_NON_COMPOUND_RANGE_TOMBSTONES,CORRECT_STATIC_COMPACT_IN_MC,COUNTERS,DIGEST_FOR_NULL_VALUES,DIGEST_INSENSITIVE_TO_EXPIRY,DIGEST_MULTIPARTITION_READ,EMPTY_REPLICA_MUTATION_PAGES,EMPTY_REPLICA_PAGES,GROUP0_SCHEMA_VERSIONING,HINTED_HANDOFF_SEPARATE_CONNECTION,INDEXES,LARGE_COLLECTION_DETECTION,LARGE_PARTITIONS,LA_SSTABLE_FORMAT,LWT,MATERIALIZED_VIEWS,MC_SSTABLE_FORMAT,MD_SSTABLE_FORMAT,ME_SSTABLE_FORMAT,NONFROZEN_UDTS,PARALLELIZED_AGGREGATION,PER_TABLE_CACHING,PER_TABLE_PARTITIONERS,RANGE_SCAN_DATA_VARIANT,RANGE_TOMBSTONES,ROLES,ROW_LEVEL_REPAIR,SCHEMA_COMMITLOG,SCHEMA_TABLES_V3,SECONDARY_INDEXES_ON_STATIC_COLUMNS,SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT,STREAM_WITH_RPC_STREAM,SUPPORTS_RAFT_CLUSTER_MANAGEMENT,TABLE_DIGEST_INSENSITIVE_TO_EXPIRY,TOMBSTONE_GC_OPTIONS,TRUNCATION_TABLE,TYPED_ERRORS_IN_READ_RPC,UDA,UDA_NATIVE_PARALLELIZED_AGGREGATION,UNBOUNDED_RANGE_TOMBSTONES,UUID_SSTABLE_IDENTIFIERS,VIEW_VIRTUAL_COLUMNS,WRITE_FAILURE_REPLY,XXHASH
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X3:3
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   RELEASE_VERSION:3.0.8
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   DC:eu-west
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   NET_VERSION:0
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   LOAD:10326500191
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   STATUS:NORMAL,5583842826136721579
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X4:1
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X8:v2;1704520650893;707d8d86-a7d5-4182-bb48-d94139848208
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X6:14
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X7:12
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X2:mview.users:0.947625;system_distributed_everywhere.cdc_generation_descriptions_v2:0.000000;system_traces.sessions:0.000000;system_distributed.cdc_generation_timestamps:0.000000;system_auth.role_permissions:0.000000;system_auth.role_members:0.000000;system_distributed.cdc_streams_descriptions_v2:0.000000;system_traces.events:0.000000;mview.users_by_first_name:0.130302;system_traces.node_slow_log_time_idx:0.000000;system_auth.roles:0.999807;system_traces.node_slow_log:0.000000;mview.users_by_last_name:0.155384;system_distributed.service_levels:1.000000;system_distributed.view_build_status:0.000000;keyspace1.standard1:0.092051;system_traces.sessions_time_idx:0.000000;system_auth.role_attributes:0.000000;
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X5:1076:877238681:1704523174210

node 3 (crashed):

< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG > /10.4.9.41
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   generation:1704522584
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   heartbeat:1433
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   RPC_ADDRESS:10.4.9.41
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   SCHEMA:a6725fa6-ac53-11ee-780b-aa82ae73c005
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   HOST_ID:5fa31aad-4354-48ff-a3ae-ccdafa5f92ff
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   LOAD:10326500191
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X8:v2;1704520650893;707d8d86-a7d5-4182-bb48-d94139848208
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X5:522:877238681:1704523188211
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X1:AGGREGATE_STORAGE_OPTIONS,ALTERNATOR_TTL,CDC,CDC_GENERATIONS_V2,COLLECTION_INDEXING,COMPUTED_COLUMNS,CORRECT_COUNTER_ORDER,CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX,CORRECT_NON_COMPOUND_RANGE_TOMBSTONES,CORRECT_STATIC_COMPACT_IN_MC,COUNTERS,DIGEST_FOR_NULL_VALUES,DIGEST_INSENSITIVE_TO_EXPIRY,DIGEST_MULTIPARTITION_READ,EMPTY_REPLICA_MUTATION_PAGES,EMPTY_REPLICA_PAGES,GROUP0_SCHEMA_VERSIONING,HINTED_HANDOFF_SEPARATE_CONNECTION,INDEXES,LARGE_COLLECTION_DETECTION,LARGE_PARTITIONS,LA_SSTABLE_FORMAT,LWT,MATERIALIZED_VIEWS,MC_SSTABLE_FORMAT,MD_SSTABLE_FORMAT,ME_SSTABLE_FORMAT,NONFROZEN_UDTS,PARALLELIZED_AGGREGATION,PER_TABLE_CACHING,PER_TABLE_PARTITIONERS,RANGE_SCAN_DATA_VARIANT,RANGE_TOMBSTONES,ROLES,ROW_LEVEL_REPAIR,SCHEMA_COMMITLOG,SCHEMA_TABLES_V3,SECONDARY_INDEXES_ON_STATIC_COLUMNS,SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT,STREAM_WITH_RPC_STREAM,SUPPORTS_RAFT_CLUSTER_MANAGEMENT,TABLE_DIGEST_INSENSITIVE_TO_EXPIRY,TOMBSTONE_GC_OPTIONS,TRUNCATION_TABLE,TYPED_ERRORS_IN_READ_RPC,UDA,UDA_NATIVE_PARALLELIZED_AGGREGATION,UNBOUNDED_RANGE_TOMBSTONES,UUID_SSTABLE_IDENTIFIERS,VIEW_VIRTUAL_COLUMNS,WRITE_FAILURE_REPLY,XXHASH
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X2:mview.users:0.973837;system_distributed_everywhere.cdc_generation_descriptions_v2:0.000000;system_traces.sessions:0.000000;system_distributed.cdc_generation_timestamps:0.000000;system_auth.role_permissions:0.000000;system_auth.role_members:0.000000;system_distributed.cdc_streams_descriptions_v2:0.000000;system_traces.events:0.000000;mview.users_by_first_name:0.206648;system_traces.node_slow_log_time_idx:0.000000;system_auth.roles:0.999974;system_traces.node_slow_log:0.000000;mview.users_by_last_name:0.243549;system_distributed.service_levels:1.000000;system_distributed.view_build_status:0.000000;keyspace1.standard1:0.114186;system_traces.sessions_time_idx:0.000000;system_auth.role_attributes:0.000000;
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X6:14
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X9:org.apache.cassandra.locator.Ec2Snitch
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   DC:eu-west
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   RACK:1c
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X4:1
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X3:3
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X7:12
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   NET_VERSION:0
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   RELEASE_VERSION:3.0.8

We see that node 3 is getting gossip updates regularly -- the heartbeat is newer than heartbeat in node 2's output (because nodetool gossipinfo was called on node 3 after node 2). And we see that status of node 10 is NORMAL. So node 3 did get the new status -- but for some reason

either handle_state_normal wasn't called -- maybe because gossiper is deadlocked somewhere?
or it was called, but entered one of the other branches which didn't result in any INFO log

Note that gossiper deadlocking and not calling handlers wouldn't be the first time...

I need to send a PR with more logging:

to cover each case of handle_state_normal with at least one INFO level log (this is easy)
and somehow to detect gossiper deadlock if it happened (this is hard)

In a longevity test reported in scylladb#16668 we observed that NORMAL state is not being properly handled for a node that replaced another node. Either handle_state_normal is not being called, or it is but getting stuck in the middle. Which is the case couldn't be determined from the logs, and attempts at creating a local reproducer failed. Improve the INFO level logging in handle_state_normal to aid debugging in the future. The amount of logs is still constant per-node. Even though some log messages report all tokens owned by a node, handle_state_normal calls are still rare. The most "spammy" situation is when a node starts and calls handle_state_normal for every other node in the cluster, but it is a once-per-startup event.

In a longevity test reported in scylladb#16668 we observed that NORMAL state is not being properly handled for a node that replaced another node. Either handle_state_normal is not being called, or it is but getting stuck in the middle. Which is the case couldn't be determined from the logs, and attempts at creating a local reproducer failed. One hypothesis is that `gossiper` is stuck on `lock_endpoint`. We dealt with gossiper deadlocks in the past (e.g. scylladb#7127). Modify the code so it reports an error if `lock_endpoint` waits for the lock for more than a minute. When the issue reproduces again in longevity, we will see if `lock_endpoint` got stuck.

bhalevy · 2024-01-19T07:28:52Z

I think we need to make token_metadata_impl::remove_endpoint to remove _replacing_endpoints by both the key and the value;

I agree.
@gusev-p See #16731

kbr-scylla · 2024-01-24T13:47:18Z

@bhalevy I'm worried that #16731 will prevent this failure from reproducing, masking the root cause of the issue.

The root cause here is that nodes in the cluster never learned that the replacing node transitioned to NORMAL.

We don't have any other known test to catch that problem, except this longevity one.

That's why we should aim to reproduce it first, with more logs, try to find the root cause, before we merge #16731.

bhalevy · 2024-01-24T20:41:02Z

@bhalevy I'm worried that #16731 will prevent this failure from reproducing, masking the root cause of the issue.

The root cause here is that nodes in the cluster never learned that the replacing node transitioned to NORMAL.

We don't have any other known test to catch that problem, except this longevity one.

That's why we should aim to reproduce it first, with more logs, try to find the root cause, before we merge #16731.

ok. makes sense

kbr-scylla · 2024-02-02T15:47:37Z

And we see that status of node 10 is NORMAL. So node 3 did get the new status -- but for some reason

Argh how did I not see this -- STATUS for node 10 is missing from gossipinfo on the node which crashed (node 3)... even though node 3's endpoint_state is newer in this output compared to healthy node 2's status (the generation, heartbeat pair is greater)

bhalevy · 2024-03-12T14:57:08Z

Seen in https://argus.scylladb.com/test/da141da4-1427-469f-a136-a7db8a47d5fa/runs?additionalRuns[]=22943c8d-3b16-40a6-b047-c1dd133fc26f

kbr-scylla · 2024-03-12T15:08:43Z

Seen in https://argus.scylladb.com/test/da141da4-1427-469f-a136-a7db8a47d5fa/runs?additionalRuns[]=22943c8d-3b16-40a6-b047-c1dd133fc26f

That was before fd32e2e

Start time: 2024-02-04 18:21:52
End time: 2024-02-04 20:02:24
Scylla version: 5.5.0~dev-20240202.52e6398ad64d
Build id: 59019e20ed174f607214637907e85c7e293b26af

mykaul · 2024-04-03T08:46:34Z

What's the latest on this one?

kbr-scylla · 2024-04-03T09:08:12Z

Missing STATUS=NORMAL update was recently spotted on CI
#18118 (comment)

Which means the issue is likely still present, they have the same root cause.

kbr-scylla · 2024-04-04T11:40:55Z

Found the root cause.
Will send fix soon.

In testing, we've observed multiple cases where nodes would fail to observe updated application states of other nodes in gossiper. For example: - in scylladb#16902, a node would finish bootstrapping and enter NORMAL state, propagating this information through gossiper. However, other nodes would never observe that the node entered NORMAL state, still thinking that it is in joining state. This would lead to further bad consequences down the line. - in scylladb#15393, a node got stuck in bootstrap, waiting for schema versions to converge. Convergence would never be achieved and the test eventually timed out. The node was observing outdated schema state of some existing node in gossip. I created a test that would bootstrap 3 nodes, then wait until they all observe each other as NORMAL, with timeout. Unfortunately, thousands of runs of this test on different machines failed to reproduce the problem. After banging my head against the wall failing to reproduce, I decided to sprinkle randomized sleeps across multiple places in gossiper code and finally: the test started catching the problem in about 1 in 1000 runs. With additional logging and additional head-banging, I determined the root cause. The following scenario can happen, 2 nodes are sufficient, let's call them A and B: - Node B calls `add_local_application_state` to update its gossiper state, for example, to propagate its new NORMAL status. - `add_local_application_state` takes a copy of the endpoint_state, and updates the copy: ``` auto local_state = *ep_state_before; for (auto& p : states) { auto& state = p.first; auto& value = p.second; value = versioned_value::clone_with_higher_version(value); local_state.add_application_state(state, value); } ``` `clone_with_higher_version` bumps `version` inside gms/version_generator.cc. - `add_local_application_state` calls `gossiper.replicate(...)` - `replicate` works in 2 phases to achieve exception safety: in first phase it copies the updated `local_state` to all shards into a separate map. In second phase the values from separate map are used to overwrite the endpoint_state map used for gossiping. Due to the cross-shard calls of the 1 phase, there is a yield before the second phase. *During this yield* the following happens: - `gossiper::run()` loop on B executes and bumps node B's `heart_beat`. This uses the monotonic version_generator, so it uses a higher version then the ones we used for states added above. Let's call this new version X. Note that X is larger than the versions used by application_states added above. - now node B handles a SYN or ACK message from node A, creating an ACK or ACK2 message in response. This message contains: - old application states (now including the update described above, because `replicate` is still sleeping before phase 2), - but bumped heart_beat == X from `gossiper::run()` loop, and sends the message. - node A receives the message and remembers that the max version across all states (including heart_beat) of node B is X. This means that it will no longer request or apply states from node B with versions smaller than X. - `gossiper.replicate(...)` on B wakes up, and overwrites endpoint_state with the ones it saved in phase 1. In particular it reverts heart_beat back to smaller value, but the larger problem is that it saves updated application_states that use versions smaller than X. - now when node B sends the updated application_states in ACK or ACK2 message to node A, node A will ignore them, because their versions are smaller than X. Or node B will never send them, because whenever node A requests states from node B, it only requests states with versions > X. Either way, node A will fail to observe new states of node B. If I understand correctly, this is a regression introduced in 38c2347, which introduced a yield in `replicate`. Before that, the updated state would be saved atomically on shard 0, there could be no `heart_beat` bump in-between making a copy of the local state, updating it, and then saving it. With the description above, it's easy to make a consistent ~100% reproducer for the problem -- introduce a longer sleep in `add_local_application_state` before second phase of replicate, to increase the chance that gossiper loop will execute and bump heart_beat version during the yield. Further commit adds a test based on that. The fix is to bump the heart_beat under local endpoint lock, which is also taken by `replicate`. Fixes: scylladb#15393 Fixes: scylladb#15602 Fixes: scylladb#16668 Fixes: scylladb#16902 Fixes: scylladb#17493 Fixes: scylladb#18118 Fixes: scylladb/scylla-enterprise#3720

In testing, we've observed multiple cases where nodes would fail to observe updated application states of other nodes in gossiper. For example: - in scylladb#16902, a node would finish bootstrapping and enter NORMAL state, propagating this information through gossiper. However, other nodes would never observe that the node entered NORMAL state, still thinking that it is in joining state. This would lead to further bad consequences down the line. - in scylladb#15393, a node got stuck in bootstrap, waiting for schema versions to converge. Convergence would never be achieved and the test eventually timed out. The node was observing outdated schema state of some existing node in gossip. I created a test that would bootstrap 3 nodes, then wait until they all observe each other as NORMAL, with timeout. Unfortunately, thousands of runs of this test on different machines failed to reproduce the problem. After banging my head against the wall failing to reproduce, I decided to sprinkle randomized sleeps across multiple places in gossiper code and finally: the test started catching the problem in about 1 in 1000 runs. With additional logging and additional head-banging, I determined the root cause. The following scenario can happen, 2 nodes are sufficient, let's call them A and B: - Node B calls `add_local_application_state` to update its gossiper state, for example, to propagate its new NORMAL status. - `add_local_application_state` takes a copy of the endpoint_state, and updates the copy: ``` auto local_state = *ep_state_before; for (auto& p : states) { auto& state = p.first; auto& value = p.second; value = versioned_value::clone_with_higher_version(value); local_state.add_application_state(state, value); } ``` `clone_with_higher_version` bumps `version` inside gms/version_generator.cc. - `add_local_application_state` calls `gossiper.replicate(...)` - `replicate` works in 2 phases to achieve exception safety: in first phase it copies the updated `local_state` to all shards into a separate map. In second phase the values from separate map are used to overwrite the endpoint_state map used for gossiping. Due to the cross-shard calls of the 1 phase, there is a yield before the second phase. *During this yield* the following happens: - `gossiper::run()` loop on B executes and bumps node B's `heart_beat`. This uses the monotonic version_generator, so it uses a higher version then the ones we used for states added above. Let's call this new version X. Note that X is larger than the versions used by application_states added above. - now node B handles a SYN or ACK message from node A, creating an ACK or ACK2 message in response. This message contains: - old application states (now including the update described above, because `replicate` is still sleeping before phase 2), - but bumped heart_beat == X from `gossiper::run()` loop, and sends the message. - node A receives the message and remembers that the max version across all states (including heart_beat) of node B is X. This means that it will no longer request or apply states from node B with versions smaller than X. - `gossiper.replicate(...)` on B wakes up, and overwrites endpoint_state with the ones it saved in phase 1. In particular it reverts heart_beat back to smaller value, but the larger problem is that it saves updated application_states that use versions smaller than X. - now when node B sends the updated application_states in ACK or ACK2 message to node A, node A will ignore them, because their versions are smaller than X. Or node B will never send them, because whenever node A requests states from node B, it only requests states with versions > X. Either way, node A will fail to observe new states of node B. If I understand correctly, this is a regression introduced in 38c2347, which introduced a yield in `replicate`. Before that, the updated state would be saved atomically on shard 0, there could be no `heart_beat` bump in-between making a copy of the local state, updating it, and then saving it. With the description above, it's easy to make a consistent reproducer for the problem -- introduce a longer sleep in `add_local_application_state` before second phase of replicate, to increase the chance that gossiper loop will execute and bump heart_beat version during the yield. Further commit adds a test based on that. The fix is to bump the heart_beat under local endpoint lock, which is also taken by `replicate`. Fixes: scylladb#15393 Fixes: scylladb#15602 Fixes: scylladb#16668 Fixes: scylladb#16902 Fixes: scylladb#17493 Fixes: scylladb#18118 Fixes: scylladb/scylla-enterprise#3720

In testing, we've observed multiple cases where nodes would fail to observe updated application states of other nodes in gossiper. For example: - in scylladb#16902, a node would finish bootstrapping and enter NORMAL state, propagating this information through gossiper. However, other nodes would never observe that the node entered NORMAL state, still thinking that it is in joining state. This would lead to further bad consequences down the line. - in scylladb#15393, a node got stuck in bootstrap, waiting for schema versions to converge. Convergence would never be achieved and the test eventually timed out. The node was observing outdated schema state of some existing node in gossip. I created a test that would bootstrap 3 nodes, then wait until they all observe each other as NORMAL, with timeout. Unfortunately, thousands of runs of this test on different machines failed to reproduce the problem. After banging my head against the wall failing to reproduce, I decided to sprinkle randomized sleeps across multiple places in gossiper code and finally: the test started catching the problem in about 1 in 1000 runs. With additional logging and additional head-banging, I determined the root cause. The following scenario can happen, 2 nodes are sufficient, let's call them A and B: - Node B calls `add_local_application_state` to update its gossiper state, for example, to propagate its new NORMAL status. - `add_local_application_state` takes a copy of the endpoint_state, and updates the copy: ``` auto local_state = *ep_state_before; for (auto& p : states) { auto& state = p.first; auto& value = p.second; value = versioned_value::clone_with_higher_version(value); local_state.add_application_state(state, value); } ``` `clone_with_higher_version` bumps `version` inside gms/version_generator.cc. - `add_local_application_state` calls `gossiper.replicate(...)` - `replicate` works in 2 phases to achieve exception safety: in first phase it copies the updated `local_state` to all shards into a separate map. In second phase the values from separate map are used to overwrite the endpoint_state map used for gossiping. Due to the cross-shard calls of the 1 phase, there is a yield before the second phase. *During this yield* the following happens: - `gossiper::run()` loop on B executes and bumps node B's `heart_beat`. This uses the monotonic version_generator, so it uses a higher version then the ones we used for states added above. Let's call this new version X. Note that X is larger than the versions used by application_states added above. - now node B handles a SYN or ACK message from node A, creating an ACK or ACK2 message in response. This message contains: - old application states (NOT including the update described above, because `replicate` is still sleeping before phase 2), - but bumped heart_beat == X from `gossiper::run()` loop, and sends the message. - node A receives the message and remembers that the max version across all states (including heart_beat) of node B is X. This means that it will no longer request or apply states from node B with versions smaller than X. - `gossiper.replicate(...)` on B wakes up, and overwrites endpoint_state with the ones it saved in phase 1. In particular it reverts heart_beat back to smaller value, but the larger problem is that it saves updated application_states that use versions smaller than X. - now when node B sends the updated application_states in ACK or ACK2 message to node A, node A will ignore them, because their versions are smaller than X. Or node B will never send them, because whenever node A requests states from node B, it only requests states with versions > X. Either way, node A will fail to observe new states of node B. If I understand correctly, this is a regression introduced in 38c2347, which introduced a yield in `replicate`. Before that, the updated state would be saved atomically on shard 0, there could be no `heart_beat` bump in-between making a copy of the local state, updating it, and then saving it. With the description above, it's easy to make a consistent reproducer for the problem -- introduce a longer sleep in `add_local_application_state` before second phase of replicate, to increase the chance that gossiper loop will execute and bump heart_beat version during the yield. Further commit adds a test based on that. The fix is to bump the heart_beat under local endpoint lock, which is also taken by `replicate`. Fixes: scylladb#15393 Fixes: scylladb#15602 Fixes: scylladb#16668 Fixes: scylladb#16902 Fixes: scylladb#17493 Fixes: scylladb#18118 Fixes: scylladb/scylla-enterprise#3720

…amil Braun In testing, we've observed multiple cases where nodes would fail to observe updated application states of other nodes in gossiper. For example: - in #16902, a node would finish bootstrapping and enter NORMAL state, propagating this information through gossiper. However, other nodes would never observe that the node entered NORMAL state, still thinking that it is in joining state. This would lead to further bad consequences down the line. - in #15393, a node got stuck in bootstrap, waiting for schema versions to converge. Convergence would never be achieved and the test eventually timed out. The node was observing outdated schema state of some existing node in gossip. I created a test that would bootstrap 3 nodes, then wait until they all observe each other as NORMAL, with timeout. Unfortunately, thousands of runs of this test on different machines failed to reproduce the problem. After banging my head against the wall failing to reproduce, I decided to sprinkle randomized sleeps across multiple places in gossiper code and finally: the test started catching the problem in about 1 in 1000 runs. With additional logging and additional head-banging, I determined the root cause. The following scenario can happen, 2 nodes are sufficient, let's call them A and B: - Node B calls `add_local_application_state` to update its gossiper state, for example, to propagate its new NORMAL status. - `add_local_application_state` takes a copy of the endpoint_state, and updates the copy: ``` auto local_state = *ep_state_before; for (auto& p : states) { auto& state = p.first; auto& value = p.second; value = versioned_value::clone_with_higher_version(value); local_state.add_application_state(state, value); } ``` `clone_with_higher_version` bumps `version` inside gms/version_generator.cc. - `add_local_application_state` calls `gossiper.replicate(...)` - `replicate` works in 2 phases to achieve exception safety: in first phase it copies the updated `local_state` to all shards into a separate map. In second phase the values from separate map are used to overwrite the endpoint_state map used for gossiping. Due to the cross-shard calls of the 1 phase, there is a yield before the second phase. *During this yield* the following happens: - `gossiper::run()` loop on B executes and bumps node B's `heart_beat`. This uses the monotonic version_generator, so it uses a higher version then the ones we used for states added above. Let's call this new version X. Note that X is larger than the versions used by application_states added above. - now node B handles a SYN or ACK message from node A, creating an ACK or ACK2 message in response. This message contains: - old application states (NOT including the update described above, because `replicate` is still sleeping before phase 2), - but bumped heart_beat == X from `gossiper::run()` loop, and sends the message. - node A receives the message and remembers that the max version across all states (including heart_beat) of node B is X. This means that it will no longer request or apply states from node B with versions smaller than X. - `gossiper.replicate(...)` on B wakes up, and overwrites endpoint_state with the ones it saved in phase 1. In particular it reverts heart_beat back to smaller value, but the larger problem is that it saves updated application_states that use versions smaller than X. - now when node B sends the updated application_states in ACK or ACK2 message to node A, node A will ignore them, because their versions are smaller than X. Or node B will never send them, because whenever node A requests states from node B, it only requests states with versions > X. Either way, node A will fail to observe new states of node B. If I understand correctly, this is a regression introduced in 38c2347, which introduced a yield in `replicate`. Before that, the updated state would be saved atomically on shard 0, there could be no `heart_beat` bump in-between making a copy of the local state, updating it, and then saving it. With the description above, it's easy to make a consistent reproducer for the problem -- introduce a longer sleep in `add_local_application_state` before second phase of replicate, to increase the chance that gossiper loop will execute and bump heart_beat version during the yield. Further commit adds a test based on that. The fix is to bump the heart_beat under local endpoint lock, which is also taken by `replicate`. The PR also adds a regression test. Fixes: #15393 Fixes: #15602 Fixes: #16668 Fixes: #16902 Fixes: #17493 Fixes: #18118 Ref: scylladb/scylla-enterprise#3720 Closes #18184 * github.com:scylladb/scylladb: test: reproducer for missing gossiper updates gossiper: lock local endpoint when updating heart_beat

In testing, we've observed multiple cases where nodes would fail to observe updated application states of other nodes in gossiper. For example: - in #16902, a node would finish bootstrapping and enter NORMAL state, propagating this information through gossiper. However, other nodes would never observe that the node entered NORMAL state, still thinking that it is in joining state. This would lead to further bad consequences down the line. - in #15393, a node got stuck in bootstrap, waiting for schema versions to converge. Convergence would never be achieved and the test eventually timed out. The node was observing outdated schema state of some existing node in gossip. I created a test that would bootstrap 3 nodes, then wait until they all observe each other as NORMAL, with timeout. Unfortunately, thousands of runs of this test on different machines failed to reproduce the problem. After banging my head against the wall failing to reproduce, I decided to sprinkle randomized sleeps across multiple places in gossiper code and finally: the test started catching the problem in about 1 in 1000 runs. With additional logging and additional head-banging, I determined the root cause. The following scenario can happen, 2 nodes are sufficient, let's call them A and B: - Node B calls `add_local_application_state` to update its gossiper state, for example, to propagate its new NORMAL status. - `add_local_application_state` takes a copy of the endpoint_state, and updates the copy: ``` auto local_state = *ep_state_before; for (auto& p : states) { auto& state = p.first; auto& value = p.second; value = versioned_value::clone_with_higher_version(value); local_state.add_application_state(state, value); } ``` `clone_with_higher_version` bumps `version` inside gms/version_generator.cc. - `add_local_application_state` calls `gossiper.replicate(...)` - `replicate` works in 2 phases to achieve exception safety: in first phase it copies the updated `local_state` to all shards into a separate map. In second phase the values from separate map are used to overwrite the endpoint_state map used for gossiping. Due to the cross-shard calls of the 1 phase, there is a yield before the second phase. *During this yield* the following happens: - `gossiper::run()` loop on B executes and bumps node B's `heart_beat`. This uses the monotonic version_generator, so it uses a higher version then the ones we used for states added above. Let's call this new version X. Note that X is larger than the versions used by application_states added above. - now node B handles a SYN or ACK message from node A, creating an ACK or ACK2 message in response. This message contains: - old application states (NOT including the update described above, because `replicate` is still sleeping before phase 2), - but bumped heart_beat == X from `gossiper::run()` loop, and sends the message. - node A receives the message and remembers that the max version across all states (including heart_beat) of node B is X. This means that it will no longer request or apply states from node B with versions smaller than X. - `gossiper.replicate(...)` on B wakes up, and overwrites endpoint_state with the ones it saved in phase 1. In particular it reverts heart_beat back to smaller value, but the larger problem is that it saves updated application_states that use versions smaller than X. - now when node B sends the updated application_states in ACK or ACK2 message to node A, node A will ignore them, because their versions are smaller than X. Or node B will never send them, because whenever node A requests states from node B, it only requests states with versions > X. Either way, node A will fail to observe new states of node B. If I understand correctly, this is a regression introduced in 38c2347, which introduced a yield in `replicate`. Before that, the updated state would be saved atomically on shard 0, there could be no `heart_beat` bump in-between making a copy of the local state, updating it, and then saving it. With the description above, it's easy to make a consistent reproducer for the problem -- introduce a longer sleep in `add_local_application_state` before second phase of replicate, to increase the chance that gossiper loop will execute and bump heart_beat version during the yield. Further commit adds a test based on that. The fix is to bump the heart_beat under local endpoint lock, which is also taken by `replicate`. Fixes: #15393 Fixes: #15602 Fixes: #16668 Fixes: #16902 Fixes: #17493 Fixes: #18118 Ref: scylladb/scylla-enterprise#3720 (cherry picked from commit a0b331b)

…rt_beat' from ScyllaDB In testing, we've observed multiple cases where nodes would fail to observe updated application states of other nodes in gossiper. For example: - in #16902, a node would finish bootstrapping and enter NORMAL state, propagating this information through gossiper. However, other nodes would never observe that the node entered NORMAL state, still thinking that it is in joining state. This would lead to further bad consequences down the line. - in #15393, a node got stuck in bootstrap, waiting for schema versions to converge. Convergence would never be achieved and the test eventually timed out. The node was observing outdated schema state of some existing node in gossip. I created a test that would bootstrap 3 nodes, then wait until they all observe each other as NORMAL, with timeout. Unfortunately, thousands of runs of this test on different machines failed to reproduce the problem. After banging my head against the wall failing to reproduce, I decided to sprinkle randomized sleeps across multiple places in gossiper code and finally: the test started catching the problem in about 1 in 1000 runs. With additional logging and additional head-banging, I determined the root cause. The following scenario can happen, 2 nodes are sufficient, let's call them A and B: - Node B calls `add_local_application_state` to update its gossiper state, for example, to propagate its new NORMAL status. - `add_local_application_state` takes a copy of the endpoint_state, and updates the copy: ``` auto local_state = *ep_state_before; for (auto& p : states) { auto& state = p.first; auto& value = p.second; value = versioned_value::clone_with_higher_version(value); local_state.add_application_state(state, value); } ``` `clone_with_higher_version` bumps `version` inside gms/version_generator.cc. - `add_local_application_state` calls `gossiper.replicate(...)` - `replicate` works in 2 phases to achieve exception safety: in first phase it copies the updated `local_state` to all shards into a separate map. In second phase the values from separate map are used to overwrite the endpoint_state map used for gossiping. Due to the cross-shard calls of the 1 phase, there is a yield before the second phase. *During this yield* the following happens: - `gossiper::run()` loop on B executes and bumps node B's `heart_beat`. This uses the monotonic version_generator, so it uses a higher version then the ones we used for states added above. Let's call this new version X. Note that X is larger than the versions used by application_states added above. - now node B handles a SYN or ACK message from node A, creating an ACK or ACK2 message in response. This message contains: - old application states (NOT including the update described above, because `replicate` is still sleeping before phase 2), - but bumped heart_beat == X from `gossiper::run()` loop, and sends the message. - node A receives the message and remembers that the max version across all states (including heart_beat) of node B is X. This means that it will no longer request or apply states from node B with versions smaller than X. - `gossiper.replicate(...)` on B wakes up, and overwrites endpoint_state with the ones it saved in phase 1. In particular it reverts heart_beat back to smaller value, but the larger problem is that it saves updated application_states that use versions smaller than X. - now when node B sends the updated application_states in ACK or ACK2 message to node A, node A will ignore them, because their versions are smaller than X. Or node B will never send them, because whenever node A requests states from node B, it only requests states with versions > X. Either way, node A will fail to observe new states of node B. If I understand correctly, this is a regression introduced in 38c2347, which introduced a yield in `replicate`. Before that, the updated state would be saved atomically on shard 0, there could be no `heart_beat` bump in-between making a copy of the local state, updating it, and then saving it. With the description above, it's easy to make a consistent reproducer for the problem -- introduce a longer sleep in `add_local_application_state` before second phase of replicate, to increase the chance that gossiper loop will execute and bump heart_beat version during the yield. Further commit adds a test based on that. The fix is to bump the heart_beat under local endpoint lock, which is also taken by `replicate`. The PR also adds a regression test. Fixes: #15393 Fixes: #15602 Fixes: #16668 Fixes: #16902 Fixes: #17493 Fixes: #18118 Ref: scylladb/scylla-enterprise#3720 (cherry picked from commit a0b331b) (cherry picked from commit 7295509) Refs #18184 Closes #18245 * github.com:scylladb/scylladb: test: reproducer for missing gossiper updates gossiper: lock local endpoint when updating heart_beat

In a longevity test reported in scylladb#16668 we observed that NORMAL state is not being properly handled for a node that replaced another node. Either handle_state_normal is not being called, or it is but getting stuck in the middle. Which is the case couldn't be determined from the logs, and attempts at creating a local reproducer failed. Improve the INFO level logging in handle_state_normal to aid debugging in the future. The amount of logs is still constant per-node. Even though some log messages report all tokens owned by a node, handle_state_normal calls are still rare. The most "spammy" situation is when a node starts and calls handle_state_normal for every other node in the cluster, but it is a once-per-startup event.

In a longevity test reported in scylladb#16668 we observed that NORMAL state is not being properly handled for a node that replaced another node. Either handle_state_normal is not being called, or it is but getting stuck in the middle. Which is the case couldn't be determined from the logs, and attempts at creating a local reproducer failed. One hypothesis is that `gossiper` is stuck on `lock_endpoint`. We dealt with gossiper deadlocks in the past (e.g. scylladb#7127). Modify the code so it reports an error if `lock_endpoint` waits for the lock for more than a minute. When the issue reproduces again in longevity, we will see if `lock_endpoint` got stuck.

In testing, we've observed multiple cases where nodes would fail to observe updated application states of other nodes in gossiper. For example: - in scylladb#16902, a node would finish bootstrapping and enter NORMAL state, propagating this information through gossiper. However, other nodes would never observe that the node entered NORMAL state, still thinking that it is in joining state. This would lead to further bad consequences down the line. - in scylladb#15393, a node got stuck in bootstrap, waiting for schema versions to converge. Convergence would never be achieved and the test eventually timed out. The node was observing outdated schema state of some existing node in gossip. I created a test that would bootstrap 3 nodes, then wait until they all observe each other as NORMAL, with timeout. Unfortunately, thousands of runs of this test on different machines failed to reproduce the problem. After banging my head against the wall failing to reproduce, I decided to sprinkle randomized sleeps across multiple places in gossiper code and finally: the test started catching the problem in about 1 in 1000 runs. With additional logging and additional head-banging, I determined the root cause. The following scenario can happen, 2 nodes are sufficient, let's call them A and B: - Node B calls `add_local_application_state` to update its gossiper state, for example, to propagate its new NORMAL status. - `add_local_application_state` takes a copy of the endpoint_state, and updates the copy: ``` auto local_state = *ep_state_before; for (auto& p : states) { auto& state = p.first; auto& value = p.second; value = versioned_value::clone_with_higher_version(value); local_state.add_application_state(state, value); } ``` `clone_with_higher_version` bumps `version` inside gms/version_generator.cc. - `add_local_application_state` calls `gossiper.replicate(...)` - `replicate` works in 2 phases to achieve exception safety: in first phase it copies the updated `local_state` to all shards into a separate map. In second phase the values from separate map are used to overwrite the endpoint_state map used for gossiping. Due to the cross-shard calls of the 1 phase, there is a yield before the second phase. *During this yield* the following happens: - `gossiper::run()` loop on B executes and bumps node B's `heart_beat`. This uses the monotonic version_generator, so it uses a higher version then the ones we used for states added above. Let's call this new version X. Note that X is larger than the versions used by application_states added above. - now node B handles a SYN or ACK message from node A, creating an ACK or ACK2 message in response. This message contains: - old application states (NOT including the update described above, because `replicate` is still sleeping before phase 2), - but bumped heart_beat == X from `gossiper::run()` loop, and sends the message. - node A receives the message and remembers that the max version across all states (including heart_beat) of node B is X. This means that it will no longer request or apply states from node B with versions smaller than X. - `gossiper.replicate(...)` on B wakes up, and overwrites endpoint_state with the ones it saved in phase 1. In particular it reverts heart_beat back to smaller value, but the larger problem is that it saves updated application_states that use versions smaller than X. - now when node B sends the updated application_states in ACK or ACK2 message to node A, node A will ignore them, because their versions are smaller than X. Or node B will never send them, because whenever node A requests states from node B, it only requests states with versions > X. Either way, node A will fail to observe new states of node B. If I understand correctly, this is a regression introduced in 38c2347, which introduced a yield in `replicate`. Before that, the updated state would be saved atomically on shard 0, there could be no `heart_beat` bump in-between making a copy of the local state, updating it, and then saving it. With the description above, it's easy to make a consistent reproducer for the problem -- introduce a longer sleep in `add_local_application_state` before second phase of replicate, to increase the chance that gossiper loop will execute and bump heart_beat version during the yield. Further commit adds a test based on that. The fix is to bump the heart_beat under local endpoint lock, which is also taken by `replicate`. Fixes: scylladb#15393 Fixes: scylladb#15602 Fixes: scylladb#16668 Fixes: scylladb#16902 Fixes: scylladb#17493 Fixes: scylladb#18118 Ref: scylladb/scylla-enterprise#3720

fruch added the triage/master Looking for assignee label Jan 7, 2024

mykaul added this to the 6.0 milestone Jan 7, 2024

mykaul changed the title ~~Multiple node core dump during decommission operation of other node~~ Multiple node core dump during decommission operation of other node (conversion to host ID related) Jan 7, 2024

mykaul assigned gusev-p Jan 8, 2024

mykaul added status/release blocker Preventing from a release to be promoted and removed triage/master Looking for assignee labels Jan 8, 2024

kbr-scylla assigned kostja and kbr-scylla Jan 8, 2024

kbr-scylla mentioned this issue Jan 11, 2024

Add more logging for gossiper::lock_endpoint and storage_service::handle_state_normal #16733

Merged

kbr-scylla mentioned this issue Feb 2, 2024

New added node remained in UJ status on the part of nodes. Host ID is NULL #16902

Closed

2 tasks

mykaul mentioned this issue Feb 7, 2024

node decommission after ip was changed failed with coredump #17199

Closed

This was referenced Feb 22, 2024

Tests / Sanity Tests / replace_address_test.TestReplaceAddress.test_replace_shutdown_node[use_host_id-rbo_disabled] #15602

Closed

Missing gossiper updates #17493

Closed

kbr-scylla unassigned gusev-p Mar 29, 2024

kbr-scylla mentioned this issue Apr 4, 2024

gossiper: lock local endpoint when updating heart_beat #18184

Merged

scylladb-promoter closed this as completed in a0b331b Apr 16, 2024

scylladb-promoter closed this as completed in #18184 Apr 16, 2024

scylladb-promoter added the Backport candidate label Apr 16, 2024

mergify bot mentioned this issue Apr 16, 2024

[Backport 5.4] gossiper: lock local endpoint when updating heart_beat #18245

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple node core dump during decommission operation of other node (conversion to host ID related) #16668

Multiple node core dump during decommission operation of other node (conversion to host ID related) #16668

fruch commented Jan 7, 2024

Logs:

fruch commented Jan 7, 2024

mykaul commented Jan 7, 2024 •

edited

fruch commented Jan 7, 2024

bhalevy commented Jan 7, 2024

bhalevy commented Jan 7, 2024 •

edited

gusev-p commented Jan 7, 2024

bhalevy commented Jan 7, 2024

mykaul commented Jan 7, 2024

gusev-p commented Jan 7, 2024

kbr-scylla commented Jan 8, 2024

kbr-scylla commented Jan 8, 2024

kbr-scylla commented Jan 8, 2024

kbr-scylla commented Jan 8, 2024

kbr-scylla commented Jan 8, 2024

kbr-scylla commented Jan 8, 2024

kbr-scylla commented Jan 8, 2024

gusev-p commented Jan 8, 2024 •

edited

kbr-scylla commented Jan 9, 2024

kbr-scylla commented Jan 9, 2024

bhalevy commented Jan 19, 2024

kbr-scylla commented Jan 24, 2024

bhalevy commented Jan 24, 2024

kbr-scylla commented Feb 2, 2024

bhalevy commented Mar 12, 2024

kbr-scylla commented Mar 12, 2024

mykaul commented Apr 3, 2024

kbr-scylla commented Apr 3, 2024

kbr-scylla commented Apr 4, 2024

Multiple node core dump during decommission operation of other node (conversion to host ID related) #16668

Multiple node core dump during decommission operation of other node (conversion to host ID related) #16668

Comments

fruch commented Jan 7, 2024

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

fruch commented Jan 7, 2024

mykaul commented Jan 7, 2024 • edited

fruch commented Jan 7, 2024

bhalevy commented Jan 7, 2024

bhalevy commented Jan 7, 2024 • edited

gusev-p commented Jan 7, 2024

bhalevy commented Jan 7, 2024

mykaul commented Jan 7, 2024

gusev-p commented Jan 7, 2024

kbr-scylla commented Jan 8, 2024

kbr-scylla commented Jan 8, 2024

kbr-scylla commented Jan 8, 2024

kbr-scylla commented Jan 8, 2024

kbr-scylla commented Jan 8, 2024

kbr-scylla commented Jan 8, 2024

kbr-scylla commented Jan 8, 2024

gusev-p commented Jan 8, 2024 • edited

kbr-scylla commented Jan 9, 2024

kbr-scylla commented Jan 9, 2024

bhalevy commented Jan 19, 2024

kbr-scylla commented Jan 24, 2024

bhalevy commented Jan 24, 2024

kbr-scylla commented Feb 2, 2024

bhalevy commented Mar 12, 2024

kbr-scylla commented Mar 12, 2024

mykaul commented Apr 3, 2024

kbr-scylla commented Apr 3, 2024

kbr-scylla commented Apr 4, 2024

mykaul commented Jan 7, 2024 •

edited

bhalevy commented Jan 7, 2024 •

edited

gusev-p commented Jan 8, 2024 •

edited