Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple node core dump during decommission operation of other node (conversion to host ID related) #16668

Closed
1 of 2 tasks
Tracked by #17493
fruch opened this issue Jan 7, 2024 · 32 comments · Fixed by #18184
Closed
1 of 2 tasks
Tracked by #17493
Assignees
Labels
Backport candidate status/release blocker Preventing from a release to be promoted
Milestone

Comments

@fruch
Copy link
Contributor

fruch commented Jan 7, 2024

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

While decommissioning node-9, multiple nodes (node-3, node-5), had coredumps with the following error
and failing the whole load since 2 nodes were down, and also failing the decommission:

2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3      !ERR | scylla[6166]:  [shard  0: gms] token_metadata - endpoint for host_id 5fa31aad-4354-48ff-a3ae-ccdafa5f92ff is not found, at: 0x611fd1e 0x6120330 0x6120618 0x5bdffa7 0x3ef709a 0x4120f29 0x3f16cd1 0x13a30fa 0x5c1cf9f 0x5c1e287 0x5c1d5e8 0x5bafbc7 0x5baed7c 0x1336d79 0x13387d0 0x13353fc /opt/scylladb/libreloc/libc.so.6+0x27b89 /opt/scylladb/libreloc/libc.so.6+0x27c4a 0x1332ca4
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<seastar::with_semaphore<seastar::semaphore_default_exception_factory, gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(gms::inet_address&) const::{lambda()#1}, std::chrono::_V2::steady_clock>(seastar::basic_semaphore<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>&, unsigned long, gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(gms::inet_address&) const::{lambda()#1}&&)::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>)::{lambda()#1}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<seastar::with_semaphore<seastar::semaphore_default_exception_factory, gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(gms::inet_address&) const::{lambda()#1}, std::chrono::_V2::steady_clock>(seastar::basic_semaphore<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>&, unsigned long, gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(gms::inet_address&) const::{lambda()#1}&&)::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>)::{lambda()#1}, false> >(gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(gms::inet_address&) const::{lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<seastar::with_semaphore<seastar::semaphore_default_exception_factory, gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(auto:1&&) const::{lambda()#1}, std::chrono::_V2::steady_clock>(seastar::basic_semaphore<auto:1, auto:3>&, unsigned long, auto:2&&)::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(auto:1)::{lambda()#1}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>
   --------
   seastar::coroutine::parallel_for_each<gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0>
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]: Aborting on shard 0.
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]: Backtrace:
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x5c0b3e8
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x5c41c91
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   /opt/scylladb/libreloc/libc.so.6+0x3dbaf
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   /opt/scylladb/libreloc/libc.so.6+0x8e883
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   /opt/scylladb/libreloc/libc.so.6+0x3dafd
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   /opt/scylladb/libreloc/libc.so.6+0x2687e
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x5be0027
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x3ef709a
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x4120f29
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x3f16cd1
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x13a30fa
2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x5c1cf9f
2024-01-06T06:57:57.862+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x5c1e287
2024-01-06T06:57:57.862+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x5c1d5e8
2024-01-06T06:57:57.862+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x5bafbc7
2024-01-06T06:57:57.862+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x5baed7c
2024-01-06T06:57:57.862+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x1336d79
2024-01-06T06:57:57.862+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x13387d0
2024-01-06T06:57:57.862+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x13353fc
2024-01-06T06:57:57.862+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   /opt/scylladb/libreloc/libc.so.6+0x27b89
2024-01-06T06:57:57.862+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   /opt/scylladb/libreloc/libc.so.6+0x27c4a
2024-01-06T06:57:57.862+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:   0x1332ca4
2024-01-06 07:16:27.907 <2024-01-06 06:57:57.000>: (CoreDumpEvent Severity.ERROR) period_type=one-time event_id=2425c925-c2c6-48f3-b447-0be1717d2876 node=Node longevity-tls-50gb-3d-master-db-node-5329f695-3 [3.249.178.182 | 10.4.9.246] (seed: True)
corefile_url=https://storage.cloud.google.com/upload.scylladb.com/core.scylla.112.5991a193a26741db95420f046b1e1093.6166.1704524277000000/core.scylla.112.5991a193a26741db95420f046b1e1093.6166.1704524277000000.gz
backtrace=           PID: 6166 (scylla)
UID: 112 (scylla)
GID: 118 (scylla)
Signal: 6 (ABRT)
Timestamp: Sat 2024-01-06 06:57:57 UTC (1min 48s ago)
Command Line: /usr/bin/scylla --blocked-reactor-notify-ms 25 --abort-on-lsa-bad-alloc 1 --abort-on-seastar-bad-alloc --abort-on-internal-error 1 --abort-on-ebadf 1 --enable-sstable-key-validation 1 --log-to-syslog 1 --log-to-stdout 0 --default-log-level info --network-stack posix --io-properties-file=/etc/scylla.d/io_properties.yaml --cpuset 1-7,9-15 --lock-memory=1
Executable: /opt/scylladb/libexec/scylla
Control Group: /scylla.slice/scylla-server.slice/scylla-server.service
Unit: scylla-server.service
Slice: scylla-server.slice
Boot ID: 5991a193a26741db95420f046b1e1093
Machine ID: 06b5d0f9b5a84c899486186dc641d921
Hostname: longevity-tls-50gb-3d-master-db-node-5329f695-3
Storage: /var/lib/systemd/coredump/core.scylla.112.5991a193a26741db95420f046b1e1093.6166.1704524277000000 (present)
Disk Size: 114.5G
Message: Process 6166 (scylla) of user 112 dumped core.
...
Found module scylla with build-id: f21e4548b69223a75d01fd3bb9d4c9c2b1b71a6d
Stack trace of thread 6166:
#0  0x00007f012daea884 __pthread_kill_implementation (libc.so.6 + 0x8e884)
#1  0x00007f012da99afe raise (libc.so.6 + 0x3dafe)
#2  0x00007f012da8287f abort (libc.so.6 + 0x2687f)
#3  0x0000000005be0028 _ZN7seastar17on_internal_errorERNS_6loggerESt17basic_string_viewIcSt11char_traitsIcEE (scylla + 0x59e0028)
#4  0x0000000003ef709b _ZNK7locator14token_metadata24get_endpoint_for_host_idEN5utils11tagged_uuidINS_11host_id_tagEEE (scylla + 0x3cf709b)
#5  0x0000000004120f2a _ZN7seastar20noncopyable_functionIFSt8optionalIN7locator16endpoint_dc_rackEEN5utils11tagged_uuidINS2_11host_id_tagEEEEE17direct_vtable_forIZN7service15storage_service27update_topology_change_infoENS_13lw_shared_ptrINS2_14token_metadataEEENS_13basic_sstringIcjLj15ELb1EEEE3$_0E4callEPKSA_S8_ (scylla + 0x3f20f2a)
#6  0x0000000003f16cd2 _ZN7locator19token_metadata_impl27update_topology_change_infoERN7seastar20noncopyable_functionIFSt8optionalINS_16endpoint_dc_rackEEN5utils11tagged_uuidINS_11host_id_tagEEEEEE.resume (scylla + 0x3d16cd2)
#7  0x00000000013a30fb _ZN7seastar8internal21coroutine_traits_baseIvE12promise_type15run_and_disposeEv (scylla + 0x11a30fb)
#8  0x0000000005c1cfa0 _ZN7seastar7reactor14run_some_tasksEv (scylla + 0x5a1cfa0)
#9  0x0000000005c1e288 _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e288)
#10 0x0000000005c1d5e9 _ZN7seastar7reactor3runEv (scylla + 0x5a1d5e9)
#11 0x0000000005bafbc8 _ZN7seastar12app_template14run_deprecatedEiPPcOSt8functionIFvvEE (scylla + 0x59afbc8)
#12 0x0000000005baed7d _ZN7seastar12app_template3runEiPPcOSt8functionIFNS_6futureIiEEvEE (scylla + 0x59aed7d)
#13 0x0000000001336d7a _ZL11scylla_mainiPPc (scylla + 0x1136d7a)
#14 0x00000000013387d1 _ZNKSt8functionIFiiPPcEEclEiS1_ (scylla + 0x11387d1)
#15 0x00000000013353fd main (scylla + 0x11353fd)
#16 0x00007f012da83b8a __libc_start_call_main (libc.so.6 + 0x27b8a)
#17 0x00007f012da83c4b __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x27c4b)
#18 0x0000000001332ca5 _start (scylla + 0x1132ca5)
Stack trace of thread 6181:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6183:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6186:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6182:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6192:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6167:
#0  0x0000000005c6b560 _ZN7seastar8internal12io_geteventsEmllPNS0_9linux_abi8io_eventEPK8timespecb (scylla + 0x5a6b560)
#1  0x0000000005c66bdf _ZN7seastar19aio_storage_context16reap_completionsEb (scylla + 0x5a66bdf)
#2  0x0000000005c67a42 _ZN7seastar19reactor_backend_aio23reap_kernel_completionsEv (scylla + 0x5a67a42)
#3  0x0000000005c41269 _ZNSt17_Function_handlerIFbvEZN7seastar7reactor6do_runEvE3$_5E9_M_invokeERKSt9_Any_data (scylla + 0x5a41269)
#4  0x0000000005c1e2c6 _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e2c6)
#5  0x0000000005c42284 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a42284)
#6  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#7  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#8  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6190:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6188:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6184:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6168:
#0  0x0000000002a38fb2 _ZNSt8__detail9__variant17__gen_vtable_implINS0_12_Multi_arrayIPFNS0_21__deduce_visit_resultIbEEO18overloaded_functorIJZN4cql34expr13recurse_untilERKNS7_10expressionERKN7seastar20noncopyable_functionIFbSA_EEEE3$_0ZNS7_13recurse_untilESA_SG_E3$_1ZNS7_13recurse_untilESA_SG_E3$_2ZNS7_13recurse_untilESA_SG_E3$_3ZNS7_13recurse_untilESA_SG_E3$_4ZNS7_13recurse_untilESA_SG_E3$_5ZNS7_13recurse_untilESA_SG_E3$_6ZNS7_13recurse_untilESA_SG_E3$_7ZNS7_13recurse_untilESA_SG_E3$_8ZNS7_13recurse_untilESA_SG_E3$_9ZNS7_13recurse_untilESA_SG_E4$_10EERSt7variantIJNS7_11conjunctionENS7_15binary_operatorENS7_12column_valueENS7_21unresolved_identifierENS7_25column_mutation_attributeENS7_13function_callENS7_4castENS7_15field_selectionENS7_13bind_variableENS7_16untyped_constantENS7_8constantENS7_17tuple_constructorENS7_22collection_constructorENS7_20usertype_constructorENS7_9subscriptENS7_9temporaryEEEEJEEESt16integer_sequenceImJLm1EEEE14__visit_invokeEST_S1C_ (scylla + 0x2838fb2)
#1  0x0000000002a0623d _ZN4cql34expr13recurse_untilERKNS0_10expressionERKN7seastar20noncopyable_functionIFbS3_EEE (scylla + 0x280623d)
#2  0x0000000003451fdd _ZNK4cql312restrictions22statement_restrictions24get_partition_key_rangesERKNS_13query_optionsE (scylla + 0x3251fdd)
#3  0x0000000002c359f7 _ZNK4cql310statements16select_statement10do_executeERNS_15query_processorERN7service11query_stateERKNS_13query_optionsE (scylla + 0x2a359f7)
#4  0x0000000002cada48 _ZN7seastar20noncopyable_functionIFNS_6futureINS_10shared_ptrIN13cql_transport8messages14result_messageEEEEEPKN4cql310statements16select_statementERNS8_15query_processorERN7service11query_stateERKNS8_13query_optionsEEE17direct_vtable_forISt7_Mem_fnIMSA_KFS7_SE_SH_SK_EEE4callEPKSM_SC_SE_SH_SK_ (scylla + 0x2aada48)
#5  0x0000000002cadfbd _ZN7seastar20noncopyable_functionIFNS_6futureINS_10shared_ptrIN13cql_transport8messages14result_messageEEEEEPKN4cql310statements16select_statementERNS8_15query_processorERN7service11query_stateERKNS8_13query_optionsEEE17direct_vtable_forIZNS_35inheriting_concrete_execution_stageIS7_JSC_SE_SH_SK_EE20make_stage_for_groupENS_16scheduling_groupEEUlSC_SE_SH_SK_E_E4callEPKSM_SC_SE_SH_SK_ (scylla + 0x2aadfbd)
#6  0x0000000002cadc56 _ZN7seastar24concrete_execution_stageINS_6futureINS_10shared_ptrIN13cql_transport8messages14result_messageEEEEEJPKN4cql310statements16select_statementERNS8_15query_processorERN7service11query_stateERKNS8_13query_optionsEEE8do_flushEv (scylla + 0x2aadc56)
#7  0x0000000005bb7b43 _ZN7seastar11lambda_taskIZNS_15execution_stage5flushEvE3$_0E15run_and_disposeEv (scylla + 0x59b7b43)
#8  0x0000000005c1cfa0 _ZN7seastar7reactor14run_some_tasksEv (scylla + 0x5a1cfa0)
#9  0x0000000005c1e288 _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e288)
#10 0x0000000005c42284 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a42284)
#11 0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#12 0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#13 0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6178:
#0  0x00007f012db66b4d syscall (libc.so.6 + 0x10ab4d)
#1  0x0000000005c6b7b1 _ZN7seastar8internal13io_pgeteventsEmllPNS0_9linux_abi8io_eventEPK8timespecPK10__sigset_tb (scylla + 0x5a6b7b1)
#2  0x0000000005c675d3 _ZN7seastar19reactor_backend_aio12await_eventsEiPK10__sigset_t (scylla + 0x5a675d3)
#3  0x0000000005c67cfd _ZN7seastar19reactor_backend_aio23wait_and_process_eventsEPK10__sigset_t (scylla + 0x5a67cfd)
#4  0x0000000005c1e57d _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e57d)
#5  0x0000000005c42284 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a42284)
#6  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#7  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#8  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6180:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6179:
#0  0x0000000005c2a469 _ZN7seastar3smp11poll_queuesEv (scylla + 0x5a2a469)
#1  0x0000000005c60d1b _ZN7seastar7reactor10smp_pollfn4pollEv (scylla + 0x5a60d1b)
#2  0x0000000005c41269 _ZNSt17_Function_handlerIFbvEZN7seastar7reactor6do_runEvE3$_5E9_M_invokeERKSt9_Any_data (scylla + 0x5a41269)
#3  0x0000000005c1e2c6 _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e2c6)
#4  0x0000000005c42284 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a42284)
#5  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#6  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#7  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6187:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6177:
#0  0x0000000005bca29c _ZN7seastar6memory24drain_cross_cpu_freelistEv (scylla + 0x59ca29c)
#1  0x0000000005c41269 _ZNSt17_Function_handlerIFbvEZN7seastar7reactor6do_runEvE3$_5E9_M_invokeERKSt9_Any_data (scylla + 0x5a41269)
#2  0x0000000005c1e2c6 _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e2c6)
#3  0x0000000005c42284 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a42284)
#4  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#5  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#6  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6185:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6169:
#0  0x0000000002083ef1 _ZN8logalloc11region_impl5allocEPK15migrate_fn_typemm (scylla + 0x1e83ef1)
#1  0x0000000001e35833 _ZNK18compact_radix_tree4treeI13cell_and_hashjE9node_head5cloneIRZN3rowC1ERK6schema11column_kindRKS5_E3$_0EESt4pairIPS3_NSt15__exception_ptr13exception_ptrEEOT_j (scylla + 0x1c35833)
#2  0x0000000001dfaddf _ZN13deletable_rowC2ERK6schemaRKS_ (scylla + 0x1bfaddf)
#3  0x0000000001dab503 _ZN18mutation_partitionC1ERK6schemaRKS_ (scylla + 0x1bab503)
#4  0x0000000001efc7f3 _ZN15partition_entry5applyERN8logalloc6regionER16mutation_cleanerRK6schemaRK18mutation_partitionS7_R26mutation_application_stats (scylla + 0x1cfc7f3)
#5  0x0000000001cb6534 _ZN7replica8memtable5applyERK15frozen_mutationRKN7seastar13lw_shared_ptrIK6schemaEEON2db9rp_handleE (scylla + 0x1ab6534)
#6  0x0000000001b7b329 _ZN7replica5table5applyERK15frozen_mutationN7seastar13lw_shared_ptrIK6schemaEEON2db9rp_handleENSt6chrono10time_pointINS4_12lowres_clockENSC_8durationIlSt5ratioILl1ELl1000000000EEEEEE (scylla + 0x197b329)
#7  0x00000000019fd237 _ZN7replica8database8do_applyEN7seastar13lw_shared_ptrIK6schemaEERK15frozen_mutationN7tracing15trace_state_ptrENSt6chrono10time_pointINS1_12lowres_clockENSB_8durationIlSt5ratioILl1ELl1000000000EEEEEENS1_10bool_classIN2db14force_sync_tagEEESt7variantIJSt9monostateNSK_24per_partition_rate_limit12account_onlyENSP_19account_and_enforceEEE (scylla + 0x17fd237)
#8  0x0000000001a93d22 _ZN7seastar20noncopyable_functionIFNS_6futureIvEEPN7replica8databaseENS_13lw_shared_ptrIK6schemaEERK15frozen_mutationN7tracing15trace_state_ptrENSt6chrono10time_pointINS_12lowres_clockENSF_8durationIlSt5ratioILl1ELl1000000000EEEEEENS_10bool_classIN2db14force_sync_tagEEESt7variantIJSt9monostateNSO_24per_partition_rate_limit12account_onlyENST_19account_and_enforceEEEEE17direct_vtable_forISt7_Mem_fnIMS4_FS2_S9_SC_SE_SM_SQ_SW_EEE4callEPKSY_S5_S9_SC_SE_SM_SQ_SW_ (scylla + 0x1893d22)
#9  0x0000000001ab1f5c _ZN7seastar20noncopyable_functionIFNS_6futureIvEEPN7replica8databaseENS_13lw_shared_ptrIK6schemaEERK15frozen_mutationN7tracing15trace_state_ptrENSt6chrono10time_pointINS_12lowres_clockENSF_8durationIlSt5ratioILl1ELl1000000000EEEEEENS_10bool_classIN2db14force_sync_tagEEESt7variantIJSt9monostateNSO_24per_partition_rate_limit12account_onlyENST_19account_and_enforceEEEEE17direct_vtable_forIZNS_35inheriting_concrete_execution_stageIS2_JS5_S9_SC_SE_SM_SQ_SW_EE20make_stage_for_groupENS_16scheduling_groupEEUlS5_S9_SC_SE_SM_SQ_SW_E_E4callEPKSY_S5_S9_SC_SE_SM_SQ_SW_ (scylla + 0x18b1f5c)
#10 0x0000000001ab1b29 _ZN7seastar24concrete_execution_stageINS_6futureIvEEJPN7replica8databaseENS_13lw_shared_ptrIK6schemaEERK15frozen_mutationN7tracing15trace_state_ptrENSt6chrono10time_pointINS_12lowres_clockENSF_8durationIlSt5ratioILl1ELl1000000000EEEEEENS_10bool_classIN2db14force_sync_tagEEESt7variantIJSt9monostateNSO_24per_partition_rate_limit12account_onlyENST_19account_and_enforceEEEEE8do_flushEv (scylla + 0x18b1b29)
#11 0x0000000005bb7b43 _ZN7seastar11lambda_taskIZNS_15execution_stage5flushEvE3$_0E15run_and_disposeEv (scylla + 0x59b7b43)
#12 0x0000000005c1cfa0 _ZN7seastar7reactor14run_some_tasksEv (scylla + 0x5a1cfa0)
#13 0x0000000005c1e288 _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e288)
#14 0x0000000005c42284 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a42284)
#15 0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#16 0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#17 0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6170:
#0  0x0000000001fd9010 _ZN9row_cache15make_reader_optEN7seastar13lw_shared_ptrIK6schemaEE13reader_permitRK20nonwrapping_intervalIN3dht13ring_positionEERKN5query15partition_sliceEPK18tombstone_gc_stateN7tracing15trace_state_ptrENS0_10bool_classIN17streamed_mutation14forwarding_tagEEENSL_IN15mutation_reader30partition_range_forwarding_tagEEE (scylla + 0x1dd9010)
#1  0x0000000001b20cca _ZNK7replica5table14make_reader_v2EN7seastar13lw_shared_ptrIK6schemaEE13reader_permitRK20nonwrapping_intervalIN3dht13ring_positionEERKN5query15partition_sliceEN7tracing15trace_state_ptrENS1_10bool_classIN17streamed_mutation14forwarding_tagEEENSJ_IN15mutation_reader30partition_range_forwarding_tagEEE (scylla + 0x1920cca)
#2  0x0000000001c04b23 _ZNSt17_Function_handlerIF23flat_mutation_reader_v2N7seastar13lw_shared_ptrIK6schemaEE13reader_permitRK20nonwrapping_intervalIN3dht13ring_positionEERKN5query15partition_sliceEN7tracing15trace_state_ptrENS1_10bool_classIN17streamed_mutation14forwarding_tagEEENSJ_IN15mutation_reader30partition_range_forwarding_tagEEEEZNK7replica5table18as_mutation_sourceEvE3$_0E9_M_invokeERKSt9_Any_dataOS5_OS6_SC_SG_OSI_OSM_OSP_ (scylla + 0x1a04b23)
#3  0x0000000001b83f85 _ZN5query7querierC2ERK15mutation_sourceN7seastar13lw_shared_ptrIK6schemaEE13reader_permit20nonwrapping_intervalIN3dht13ring_positionEENS_15partition_sliceEN7tracing15trace_state_ptrENS_12querier_base14querier_configE (scylla + 0x1983f85)
#4  0x0000000001b8066f _ZN7replica5table5queryEN7seastar13lw_shared_ptrIK6schemaEE13reader_permitRKN5query12read_commandENS7_14result_optionsERKSt6vectorI20nonwrapping_intervalIN3dht13ring_positionEESaISG_EEN7tracing15trace_state_ptrERNS7_21result_memory_limiterENSt6chrono10time_pointINS1_12lowres_clockENSP_8durationIlSt5ratioILl1ELl1000000000EEEEEEPSt8optionalINS7_7querierEE (scylla + 0x198066f)
#5  0x0000000001aa9039 _ZN7seastar20noncopyable_functionIFNS_6futureIvEE13reader_permitEE19indirect_vtable_forIZN7replica8database5queryENS_13lw_shared_ptrIK6schemaEERKN5query12read_commandENSD_14result_optionsERKSt6vectorI20nonwrapping_intervalIN3dht13ring_positionEESaISM_EEN7tracing15trace_state_ptrENSt6chrono10time_pointINS_12lowres_clockENST_8durationIlSt5ratioILl1ELl1000000000EEEEEESt7variantIJSt9monostateN2db24per_partition_rate_limit12account_onlyENS14_19account_and_enforceEEEE3$_0E4callEPKS5_S3_ (scylla + 0x18a9039)
#6  0x0000000004694fe1 _ZN28reader_concurrency_semaphore14execution_loopEv.resume (scylla + 0x4494fe1)
#7  0x00000000013a30fb _ZN7seastar8internal21coroutine_traits_baseIvE12promise_type15run_and_disposeEv (scylla + 0x11a30fb)
#8  0x0000000005c1cfa0 _ZN7seastar7reactor14run_some_tasksEv (scylla + 0x5a1cfa0)
#9  0x0000000005c1e288 _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e288)
#10 0x0000000005c42284 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a42284)
#11 0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#12 0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#13 0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6176:
#0  0x0000000006003c95 _ZN7seastar8io_queue13poll_io_queueEv (scylla + 0x5e03c95)
#1  0x0000000005c60f29 _ZN7seastar7reactor26io_queue_submission_pollfn4pollEv (scylla + 0x5a60f29)
#2  0x0000000005c41269 _ZNSt17_Function_handlerIFbvEZN7seastar7reactor6do_runEvE3$_5E9_M_invokeERKSt9_Any_data (scylla + 0x5a41269)
#3  0x0000000005c1e2c6 _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e2c6)
#4  0x0000000005c42284 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a42284)
#5  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#6  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#7  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6193:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6189:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6175:
#0  0x00007ffe309b46e8 n/a (linux-vdso.so.1 + 0x6e8)
#1  0x00007ffe309b480a n/a (linux-vdso.so.1 + 0x80a)
#2  0x00007f012db322fd clock_gettime@@GLIBC_2.17 (libc.so.6 + 0xd62fd)
#3  0x00007f012dd37565 _ZNSt6chrono3_V212steady_clock3nowEv (libstdc++.so.6 + 0xd9565)
#4  0x0000000005c1e2a1 _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e2a1)
#5  0x0000000005c42284 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a42284)
#6  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#7  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#8  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6173:
#0  0x00007f012db66b4d syscall (libc.so.6 + 0x10ab4d)
#1  0x0000000005c67ac6 _ZN7seastar19reactor_backend_aio18kernel_submit_workEv (scylla + 0x5a67ac6)
#2  0x0000000005c41269 _ZNSt17_Function_handlerIFbvEZN7seastar7reactor6do_runEvE3$_5E9_M_invokeERKSt9_Any_data (scylla + 0x5a41269)
#3  0x0000000005c1e2c6 _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e2c6)
#4  0x0000000005c42284 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a42284)
#5  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#6  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#7  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6174:
#0  0x000000000600a395 _ZN7seastar10fair_queue17dispatch_requestsESt8functionIFvRNS_16fair_queue_entryEEE (scylla + 0x5e0a395)
#1  0x0000000006003ca9 _ZN7seastar8io_queue13poll_io_queueEv (scylla + 0x5e03ca9)
#2  0x0000000005c60f29 _ZN7seastar7reactor26io_queue_submission_pollfn4pollEv (scylla + 0x5a60f29)
#3  0x0000000005c41269 _ZNSt17_Function_handlerIFbvEZN7seastar7reactor6do_runEvE3$_5E9_M_invokeERKSt9_Any_data (scylla + 0x5a41269)
#4  0x0000000005c1e2c6 _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e2c6)
#5  0x0000000005c42284 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a42284)
#6  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#7  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#8  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6191:
#0  0x00007f012db5d0ea read (libc.so.6 + 0x1010ea)
#1  0x0000000005c659c5 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x5a659c5)
#2  0x0000000005c65cd3 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1ERNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a65cd3)
#3  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#4  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#5  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6172:
#0  0x0000000001fd9284 _ZN9row_cache15make_reader_optEN7seastar13lw_shared_ptrIK6schemaEE13reader_permitRK20nonwrapping_intervalIN3dht13ring_positionEERKN5query15partition_sliceEPK18tombstone_gc_stateN7tracing15trace_state_ptrENS0_10bool_classIN17streamed_mutation14forwarding_tagEEENSL_IN15mutation_reader30partition_range_forwarding_tagEEE (scylla + 0x1dd9284)
#1  0x0000000001b20cca _ZNK7replica5table14make_reader_v2EN7seastar13lw_shared_ptrIK6schemaEE13reader_permitRK20nonwrapping_intervalIN3dht13ring_positionEERKN5query15partition_sliceEN7tracing15trace_state_ptrENS1_10bool_classIN17streamed_mutation14forwarding_tagEEENSJ_IN15mutation_reader30partition_range_forwarding_tagEEE (scylla + 0x1920cca)
#2  0x0000000001c04b23 _ZNSt17_Function_handlerIF23flat_mutation_reader_v2N7seastar13lw_shared_ptrIK6schemaEE13reader_permitRK20nonwrapping_intervalIN3dht13ring_positionEERKN5query15partition_sliceEN7tracing15trace_state_ptrENS1_10bool_classIN17streamed_mutation14forwarding_tagEEENSJ_IN15mutation_reader30partition_range_forwarding_tagEEEEZNK7replica5table18as_mutation_sourceEvE3$_0E9_M_invokeERKSt9_Any_dataOS5_OS6_SC_SG_OSI_OSM_OSP_ (scylla + 0x1a04b23)
#3  0x0000000001b83f85 _ZN5query7querierC2ERK15mutation_sourceN7seastar13lw_shared_ptrIK6schemaEE13reader_permit20nonwrapping_intervalIN3dht13ring_positionEENS_15partition_sliceEN7tracing15trace_state_ptrENS_12querier_base14querier_configE (scylla + 0x1983f85)
#4  0x0000000001b8066f _ZN7replica5table5queryEN7seastar13lw_shared_ptrIK6schemaEE13reader_permitRKN5query12read_commandENS7_14result_optionsERKSt6vectorI20nonwrapping_intervalIN3dht13ring_positionEESaISG_EEN7tracing15trace_state_ptrERNS7_21result_memory_limiterENSt6chrono10time_pointINS1_12lowres_clockENSP_8durationIlSt5ratioILl1ELl1000000000EEEEEEPSt8optionalINS7_7querierEE (scylla + 0x198066f)
#5  0x0000000001aa9039 _ZN7seastar20noncopyable_functionIFNS_6futureIvEE13reader_permitEE19indirect_vtable_forIZN7replica8database5queryENS_13lw_shared_ptrIK6schemaEERKN5query12read_commandENSD_14result_optionsERKSt6vectorI20nonwrapping_intervalIN3dht13ring_positionEESaISM_EEN7tracing15trace_state_ptrENSt6chrono10time_pointINS_12lowres_clockENST_8durationIlSt5ratioILl1ELl1000000000EEEEEESt7variantIJSt9monostateN2db24per_partition_rate_limit12account_onlyENS14_19account_and_enforceEEEE3$_0E4callEPKS5_S3_ (scylla + 0x18a9039)
#6  0x0000000004694fe1 _ZN28reader_concurrency_semaphore14execution_loopEv.resume (scylla + 0x4494fe1)
#7  0x00000000013a30fb _ZN7seastar8internal21coroutine_traits_baseIvE12promise_type15run_and_disposeEv (scylla + 0x11a30fb)
#8  0x0000000005c1cfa0 _ZN7seastar7reactor14run_some_tasksEv (scylla + 0x5a1cfa0)
#9  0x0000000005c1e288 _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e288)
#10 0x0000000005c42284 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a42284)
#11 0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#12 0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#13 0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
Stack trace of thread 6171:
#0  0x0000000001757c20 _ZN7seastar6futureIvE16handle_exceptionIZZZZZNS_3rpc11recv_helperIN4netw10serializerESt8functionIFNS0_INS3_12no_wait_typeEEERKNS3_11client_infoENS3_14opt_time_pointE15frozen_mutationN5utils12small_vectorIN3gms12inet_addressELm3EEESI_jmNS3_8optionalISt8optionalIN7tracing10trace_infoEEEENSK_ISt7variantIJSt9monostateN2db24per_partition_rate_limit12account_onlyENST_19account_and_enforceEEEEENSK_IN7service13fencing_tokenEEEEES9_JSE_SJ_SI_jmSP_SX_S10_ENS3_19do_want_client_infoENS3_18do_want_time_pointEEEDaNS3_9signatureIFT1_DpT2_EEEOT0_T3_T4_ENUlNS_10shared_ptrINS3_6server10connectionEEESL_INSt6chrono10time_pointINS_12lowres_clockENS1J_8durationIlSt5ratioILl1ELl1000000000EEEEEEElNS3_7rcv_bufEE_clES1I_S1R_lS1S_ENUlT_E_clINS_15semaphore_unitsINS_35semaphore_default_exception_factoryES1L_EEEEDaS1U_ENUlvE_clEvENUlS9_E_clES9_EUlNSt15__exception_ptr13exception_ptrEE_EES1_OS1U_ (scylla + 0x1557c20)
#1  0x000000000175739f _ZZZZZN7seastar3rpc11recv_helperIN4netw10serializerESt8functionIFNS_6futureINS0_12no_wait_typeEEERKNS0_11client_infoENS0_14opt_time_pointE15frozen_mutationN5utils12small_vectorIN3gms12inet_addressELm3EEESG_jmNS0_8optionalISt8optionalIN7tracing10trace_infoEEEENSI_ISt7variantIJSt9monostateN2db24per_partition_rate_limit12account_onlyENSR_19account_and_enforceEEEEENSI_IN7service13fencing_tokenEEEEES7_JSC_SH_SG_jmSN_SV_SY_ENS0_19do_want_client_infoENS0_18do_want_time_pointEEEDaNS0_9signatureIFT1_DpT2_EEEOT0_T3_T4_ENUlNS_10shared_ptrINS0_6server10connectionEEESJ_INSt6chrono10time_pointINS_12lowres_clockENS1H_8durationIlSt5ratioILl1ELl1000000000EEEEEEElNS0_7rcv_bufEE_clES1G_S1P_lS1Q_ENUlT_E_clINS_15semaphore_unitsINS_35semaphore_default_exception_factoryES1J_EEEEDaS1S_ENUlvE_clEvENUlS7_E_clES7_ (scylla + 0x155739f)
#2  0x0000000001759638 _ZN7seastar12continuationINS_8internal22promise_base_with_typeIvEEZZZZNS_3rpc11recv_helperIN4netw10serializerESt8functionIFNS_6futureINS4_12no_wait_typeEEERKNS4_11client_infoENS4_14opt_time_pointE15frozen_mutationN5utils12small_vectorIN3gms12inet_addressELm3EEESK_jmNS4_8optionalISt8optionalIN7tracing10trace_infoEEEENSM_ISt7variantIJSt9monostateN2db24per_partition_rate_limit12account_onlyENSV_19account_and_enforceEEEEENSM_IN7service13fencing_tokenEEEEESB_JSG_SL_SK_jmSR_SZ_S12_ENS4_19do_want_client_infoENS4_18do_want_time_pointEEEDaNS4_9signatureIFT1_DpT2_EEEOT0_T3_T4_ENUlNS_10shared_ptrINS4_6server10connectionEEESN_INSt6chrono10time_pointINS_12lowres_clockENS1L_8durationIlSt5ratioILl1ELl1000000000EEEEEEElNS4_7rcv_bufEE_clES1K_S1T_lS1U_ENUlT_E_clINS_15semaphore_unitsINS_35semaphore_default_exception_factoryES1N_EEEEDaS1W_ENUlvE_clEvEUlSB_E_ZNSB_17then_wrapped_nrvoINS9_IvEES23_EENS_8futurizeIS1W_E4typeES1E_EUlOS3_RS23_ONS_12future_stateISA_EEE_SA_E15run_and_disposeEv (scylla + 0x1559638)
#3  0x0000000005c1cfa0 _ZN7seastar7reactor14run_some_tasksEv (scylla + 0x5a1cfa0)
#4  0x0000000005c1e288 _ZN7seastar7reactor6do_runEv (scylla + 0x5a1e288)
#5  0x0000000005c42284 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x5a42284)
#6  0x0000000005be0cdb _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x59e0cdb)
#7  0x00007f012dae8947 start_thread (libc.so.6 + 0x8c947)
#8  0x00007f012db6e860 __clone3 (libc.so.6 + 0x112860)
download_instructions=gsutil cp gs://upload.scylladb.com/core.scylla.112.5991a193a26741db95420f046b1e1093.6166.1704524277000000/core.scylla.112.5991a193a26741db95420f046b1e1093.6166.1704524277000000.gz .
gunzip /var/lib/systemd/coredump/core.scylla.112.5991a193a26741db95420f046b1e1093.6166.1704524277000000.gz

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Kernel Version: 5.15.0-1051-aws
Scylla version (or git commit hash): 5.5.0~dev-20240105.7e84e03f5231 with build-id f21e4548b69223a75d01fd3bb9d4c9c2b1b71a6d

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

  • longevity-tls-50gb-3d-master-db-node-5329f695-9 (54.155.147.60 | 10.4.10.205) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-5329f695-8 (3.250.158.2 | 10.4.11.205) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-5329f695-7 (18.200.236.161 | 10.4.9.127) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-5329f695-6 (34.245.194.88 | 10.4.9.178) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-5329f695-5 (3.255.183.111 | 10.4.9.98) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-5329f695-4 (18.201.16.110 | 10.4.9.5) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-5329f695-3 (3.249.178.182 | 10.4.9.246) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-5329f695-2 (18.201.13.191 | 10.4.11.217) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-5329f695-10 (3.253.72.9 | 10.4.9.41) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-5329f695-1 (3.254.197.193 | 10.4.8.89) (shards: 14)

OS / Image: ami-077f3a25a749656b7 (aws: undefined_region)

Test: longevity-50gb-3days-test
Test id: 5329f695-3131-4153-a22e-a2bce1a8af32
Test name: scylla-master/longevity/longevity-50gb-3days-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 5329f695-3131-4153-a22e-a2bce1a8af32
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 5329f695-3131-4153-a22e-a2bce1a8af32

Logs:

Jenkins job URL
Argus

@fruch fruch added the triage/master Looking for assignee label Jan 7, 2024
@fruch
Copy link
Contributor Author

fruch commented Jan 7, 2024

last week the test was passing that nemesis with success,

those are the changes merged in scylla since:

🟢 ❯ git log --oneline 331d9ce788e2..7e84e03f5231 | grep Merge
bf068dd023 Merge `handle error in cdc generation propagation during bootstrap` from Gleb
f942bf4a1f Merge 'Do not update endpoint state via gossiper::add_saved_endpoint once it was updated via gossip' from Benny Halevy
20531872a7 Merge 'test: randomized_nemesis_test: add formatter for append_entry' from Kefu Chai
715e062d4a Merge 'table, memtable: share log structured allocator statistics across all tablets in a table' from Avi Kivity
949658590f Merge 'raft topology: do not update token metadata in on_alive and on_remove' from Patryk Jędrzejczak
7f6955b883 Merge 'test: make use of concurrent bootstrap' from Patryk Jędrzejczak
8ba0decda5 Merge 'System.peers: enforce host_id' from Benny Halevy

Sound it might be related to 8ba0dec, but I'll let @bhalevy comment on that

@mykaul
Copy link
Contributor

mykaul commented Jan 7, 2024

Decoded:

2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]: Backtrace:2024-01-06T06:57:57.861+00:00 longevity-tls-50gb-3d-master-db-node-5329f695-3     !INFO | scylla[6166]:
[Backtrace #0]
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:68
 (inlined by) seastar::backtrace_buffer::append_backtrace() at ./build/release/seastar/./seastar/src/core/reactor.cc:826
 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:856
seastar::print_with_backtrace(char const*, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:868
 (inlined by) seastar::sigabrt_action() at ./build/release/seastar/./seastar/src/core/reactor.cc:4062
 (inlined by) operator() at ./build/release/seastar/./seastar/src/core/reactor.cc:4038
 (inlined by) __invoke at ./build/release/seastar/./seastar/src/core/reactor.cc:4034
/data/scylla-s3-reloc.cache/by-build-id/f21e4548b69223a75d01fd3bb9d4c9c2b1b71a6d/extracted/scylla/libreloc/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=7026fe8c129a523e07856d7c96306663ceab6e24, for GNU/Linux 3.2.0, not stripped

__GI___sigaction at :?
__pthread_kill_implementation at ??:?
__GI_raise at :?
__GI_abort at :?
seastar::on_internal_error(seastar::logger&, std::basic_string_view<char, std::char_traits<char> >) at ./build/release/seastar/./seastar/src/core/on_internal_error.cc:57
locator::token_metadata_impl::get_endpoint_for_host_id(utils::tagged_uuid<locator::host_id_tag>) const at ./locator/token_metadata.cc:550
 (inlined by) locator::token_metadata::get_endpoint_for_host_id(utils::tagged_uuid<locator::host_id_tag>) const at ./locator/token_metadata.cc:975
operator() at ./service/storage_service.cc:6305
 (inlined by) seastar::noncopyable_function<std::optional<locator::endpoint_dc_rack> (utils::tagged_uuid<locator::host_id_tag>)>::direct_vtable_for<service::storage_service::update_topology_change_info(seastar::lw_shared_ptr<locator::token_metadata>, seastar::basic_sstring<char, unsigned int, 15u, true>)::$_0>::call(seastar::noncopyable_function<std::optional<locator::endpoint_dc_rack> (utils::tagged_uuid<locator::host_id_tag>)> const*, utils::tagged_uuid<locator::host_id_tag>) at ././seastar/include/seastar/util/noncopyable_function.hh:129
seastar::noncopyable_function<std::optional<locator::endpoint_dc_rack> (utils::tagged_uuid<locator::host_id_tag>)>::operator()(utils::tagged_uuid<locator::host_id_tag>) const at ././seastar/include/seastar/util/noncopyable_function.hh:215
 (inlined by) locator::token_metadata_impl::update_topology_change_info(seastar::noncopyable_function<std::optional<locator::endpoint_dc_rack> (utils::tagged_uuid<locator::host_id_tag>)>&) at ./locator/token_metadata.cc:753
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume() const at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/coroutine:240
 (inlined by) seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:125
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2666
 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3129
seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3305
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3188
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:276
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:167
scylla_main(int, char**) at ./main.cc:670
std::function<int (int, char**)>::operator()(int, char**) const at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/std_function.h:591
main at ./main.cc:2081
__libc_start_call_main at ??:?
__libc_start_main_alias_2 at :?
_start at ??:?

Which reminds me #14974 ?

@fruch
Copy link
Contributor Author

fruch commented Jan 7, 2024

what happened to https://backtrace.scylladb.com/ ?

@bhalevy
Copy link
Member

bhalevy commented Jan 7, 2024

what happened to https://backtrace.scylladb.com/ ?

it doesn't support https, only http.

@bhalevy
Copy link
Member

bhalevy commented Jan 7, 2024

last week the test was passing that nemesis with success,

those are the changes merged in scylla since:

🟢 ❯ git log --oneline 331d9ce788e2..7e84e03f5231 | grep Merge
bf068dd023 Merge `handle error in cdc generation propagation during bootstrap` from Gleb
f942bf4a1f Merge 'Do not update endpoint state via gossiper::add_saved_endpoint once it was updated via gossip' from Benny Halevy
20531872a7 Merge 'test: randomized_nemesis_test: add formatter for append_entry' from Kefu Chai
715e062d4a Merge 'table, memtable: share log structured allocator statistics across all tablets in a table' from Avi Kivity
949658590f Merge 'raft topology: do not update token metadata in on_alive and on_remove' from Patryk Jędrzejczak
7f6955b883 Merge 'test: make use of concurrent bootstrap' from Patryk Jędrzejczak
8ba0decda5 Merge 'System.peers: enforce host_id' from Benny Halevy

Sound it might be related to 8ba0dec, but I'll let @bhalevy comment on that

The internal error was added by @gusev-p in 5a1418f
as part of 26cbd28.

@gusev-p, with _raft_topology_change_enabled, we handle the case where the host id is not found in

const auto* node = _topology_state_machine._topology.find(server_id);
if (node) {
return locator::endpoint_dc_rack {
.dc = node->second.datacenter,
.rack = node->second.rack,
};
}
return std::nullopt;

so without _raft_topology_change_enabled, shouldn't we use get_endpoint_for_host_id_if_known instead of get_endpoint_for_host_id here?
return get_dc_rack_for(tm.get_endpoint_for_host_id(host_id));

@gusev-p
Copy link

gusev-p commented Jan 7, 2024

last week the test was passing that nemesis with success,
those are the changes merged in scylla since:

🟢 ❯ git log --oneline 331d9ce788e2..7e84e03f5231 | grep Merge
bf068dd023 Merge `handle error in cdc generation propagation during bootstrap` from Gleb
f942bf4a1f Merge 'Do not update endpoint state via gossiper::add_saved_endpoint once it was updated via gossip' from Benny Halevy
20531872a7 Merge 'test: randomized_nemesis_test: add formatter for append_entry' from Kefu Chai
715e062d4a Merge 'table, memtable: share log structured allocator statistics across all tablets in a table' from Avi Kivity
949658590f Merge 'raft topology: do not update token metadata in on_alive and on_remove' from Patryk Jędrzejczak
7f6955b883 Merge 'test: make use of concurrent bootstrap' from Patryk Jędrzejczak
8ba0decda5 Merge 'System.peers: enforce host_id' from Benny Halevy

Sound it might be related to 8ba0dec, but I'll let @bhalevy comment on that

The internal error was added by @gusev-p in 5a1418f as part of 26cbd28.

@gusev-p, with _raft_topology_change_enabled, we handle the case where the host id is not found in

const auto* node = _topology_state_machine._topology.find(server_id);
if (node) {
return locator::endpoint_dc_rack {
.dc = node->second.datacenter,
.rack = node->second.rack,
};
}
return std::nullopt;

so without _raft_topology_change_enabled, shouldn't we use get_endpoint_for_host_id_if_known instead of get_endpoint_for_host_id here?

return get_dc_rack_for(tm.get_endpoint_for_host_id(host_id));

That's depressing that we get the real CI feedback only from longevity tests month after the changes were merged( Neither test.py, nor dtests caught that.

Regarding the code, we discussed this particular line in PR review, but in the utter crap that this github UI is I now barely can find anything. The upshot: we relied on the token_metadata to know the IPs of the nodes it manages in gossiper topology mode. Before the changes the IPs themselves were used to identify the nodes, meaning we knew the IPs each time update_topology_change_info was called. And the refactoring itself strived to maintain the same workflow for gossiper mode, meaning each time update_topology_change_info is called IPs should be known. Obviously, there is a flaw in this reasoning. We relied on dtests to check wether is's true or not, and it doesn't work in this way(

with _raft_topology_change_enabled, we handle the case where the host id is not found in

It's not exactly the same case. In _raft_topology_change_enabled mode, we check whether the entire node exists in _topology_state_machine._topology or not, and in gossiper mode we ask the token_metadata itself for an IP for the given id. The equivalent of the raft if is in get_dc_rack_for function - it returns std::nullopt if it can't find IP in gossiper.

Probably I should dig into the scenario of this longevity test and figure out what exactly was broken by my refactoring.

@bhalevy
Copy link
Member

bhalevy commented Jan 7, 2024

How about the following fix that passes a node& from to the dc_rack_fn so it can use either the host_id or endpoint,
as the latter should be node in the token_metadata topology:

diff --git a/locator/token_metadata.cc b/locator/token_metadata.cc
index 9f72708e12..1a629886ed 100644
--- a/locator/token_metadata.cc
+++ b/locator/token_metadata.cc
@@ -750,7 +750,8 @@ future<> token_metadata_impl::update_topology_change_info(dc_rack_fn& get_dc_rac
         }
         // apply new_normal_tokens
         for (auto& [endpoint, tokens]: new_normal_tokens) {
-            target_token_metadata->update_topology(endpoint, get_dc_rack(endpoint), node::state::normal);
+            auto* node = _topology.find_node(endpoint);
+            target_token_metadata->update_topology(endpoint, get_dc_rack(*node), node::state::normal);
             co_await target_token_metadata->update_normal_tokens(std::move(tokens), endpoint);
         }
         // apply leaving endpoints
diff --git a/locator/token_metadata.hh b/locator/token_metadata.hh
index b798b47ab0..5982718f57 100644
--- a/locator/token_metadata.hh
+++ b/locator/token_metadata.hh
@@ -74,6 +74,8 @@ struct host_id_or_endpoint {
 class token_metadata_impl;
 struct topology_change_info;
 
+using dc_rack_fn = seastar::noncopyable_function<std::optional<endpoint_dc_rack>(const locator::node&)>;
+
 class token_metadata final {
     std::unique_ptr<token_metadata_impl> _impl;
 private:
diff --git a/locator/types.hh b/locator/types.hh
index 3f2783f3fe..ceb672b8f2 100644
--- a/locator/types.hh
+++ b/locator/types.hh
@@ -31,6 +31,4 @@ struct endpoint_dc_rack {
     bool operator==(const endpoint_dc_rack&) const = default;
 };
 
-using dc_rack_fn = seastar::noncopyable_function<std::optional<endpoint_dc_rack>(host_id)>;
-
 } // namespace locator
diff --git a/service/storage_service.cc b/service/storage_service.cc
index 076c458ce3..5b205ce162 100644
--- a/service/storage_service.cc
+++ b/service/storage_service.cc
@@ -6289,9 +6289,9 @@ future<> storage_service::update_topology_change_info(mutable_token_metadata_ptr
     assert(this_shard_id() == 0);
 
     try {
-        locator::dc_rack_fn get_dc_rack_by_host_id([this, &tm = *tmptr] (locator::host_id host_id) -> std::optional<locator::endpoint_dc_rack> {
+        locator::dc_rack_fn get_dc_rack_by_host_id([this] (const locator::node& n) -> std::optional<locator::endpoint_dc_rack> {
             if (_raft_topology_change_enabled) {
-                const auto server_id = raft::server_id(host_id.uuid());
+                const auto server_id = raft::server_id(n.host_id().uuid());
                 const auto* node = _topology_state_machine._topology.find(server_id);
                 if (node) {
                     return locator::endpoint_dc_rack {
@@ -6302,7 +6302,7 @@ future<> storage_service::update_topology_change_info(mutable_token_metadata_ptr
                 return std::nullopt;
             }
 
-            return get_dc_rack_for(tm.get_endpoint_for_host_id(host_id));
+            return get_dc_rack_for(n.endpoint());
         });
         co_await tmptr->update_topology_change_info(get_dc_rack_by_host_id);
     } catch (...) {
diff --git a/test/boost/token_metadata_test.cc b/test/boost/token_metadata_test.cc
index 29317ae07d..71f36c987d 100644
--- a/test/boost/token_metadata_test.cc
+++ b/test/boost/token_metadata_test.cc
@@ -21,13 +21,17 @@ namespace {
         return host_id{utils::UUID(0, id)};
     }
 
-    endpoint_dc_rack get_dc_rack(host_id) {
+    endpoint_dc_rack unknown_dc_rack() {
         return {
             .dc = "unk-dc",
             .rack = "unk-rack"
         };
     }
 
+    endpoint_dc_rack get_dc_rack(locator::host_id) {
+        return unknown_dc_rack();
+    }
+
     mutable_token_metadata_ptr create_token_metadata(host_id this_host_id) {
         return make_lw_shared<token_metadata>(token_metadata::config {
             topology::config {
@@ -39,7 +43,9 @@ namespace {
 
     template <typename Strategy>
     mutable_vnode_erm_ptr create_erm(mutable_token_metadata_ptr tmptr, replication_strategy_config_options opts = {}) {
-        dc_rack_fn get_dc_rack_fn = get_dc_rack;
+        dc_rack_fn get_dc_rack_fn = [] (const locator::node&) {
+            return unknown_dc_rack();
+        };
         tmptr->update_topology_change_info(get_dc_rack_fn).get();
         auto strategy = seastar::make_shared<Strategy>(replication_strategy_params(opts, std::nullopt));
         return calculate_effective_replication_map(std::move(strategy), tmptr).get0();

@mykaul
Copy link
Contributor

mykaul commented Jan 7, 2024

that's depressing that we get the real CI feedback only from longevity tests month after the changes were merged( Neither test.py, nor dtests caught that.

@gusev-p - that's a legitimate feedback - can you follow up on how missed this in either/both test suites?

@mykaul mykaul added this to the 6.0 milestone Jan 7, 2024
@mykaul mykaul changed the title Multiple node core dump during decommission operation of other node Multiple node core dump during decommission operation of other node (conversion to host ID related) Jan 7, 2024
@gusev-p
Copy link

gusev-p commented Jan 7, 2024

How about the following fix that passes a node& from to the dc_rack_fn so it can use either the host_id or endpoint,
as the latter should be node in the token_metadata topology:

This won't help much, in our case endpoint() will be empty and the effect is the same as get_endpoint_for_host_id_if_known(host_id).value_or(inet_address{}). We'll discuss on a daily tomorrow how to handle this.

@mykaul mykaul added status/release blocker Preventing from a release to be promoted and removed triage/master Looking for assignee labels Jan 8, 2024
@kbr-scylla
Copy link
Contributor

Neither test.py, nor dtests caught that.

test.py are running mostly in raft-topology mode now (except only a few specific test cases).

dtests dunno. The issue is most likely a timing race (as most with gossiper are) and perhaps the larger a cluster is, the easier it is to reproduce; and in dtests we don't test such large clusters. Or (more likely I think) it's because longevity test is running on a real distributed cluster (multiple machines) and network latencies are needed to reproduce this. Hmm... could it be that nodes are not getting gossip messages in time?

@kbr-scylla
Copy link
Contributor

Regarding the code, we discussed this particular line in #15903, but in the utter crap that this github UI is I now barely can find anything.

You probably mean this
#15903 (comment)

@kbr-scylla
Copy link
Contributor

The node that was decommissioning was actually node-10.

The crashes happened while the node was announcing that it left the ring

Jan 06 06:57:56.455247 longevity-tls-50gb-3d-master-db-node-5329f695-10 scylla[6090]:  [shard  0:strm] storage_service - Announcing that I have left the ring for 30000ms
...
Jan 06 06:58:26.455296 longevity-tls-50gb-3d-master-db-node-5329f695-10 scylla[6090]:  [shard  0:strm] storage_service - decommission[e7d77f31-f33f-4d42-bd03-0aa4f2f11818]: left token ring

the aborts happened in this time period, e.g. node-9:

Jan 06 06:57:58.547864 longevity-tls-50gb-3d-master-db-node-5329f695-9 scylla[6086]:  [shard  0: gms] token_metadata - endpoint for host_id 5fa31aad-4354-48ff-a3ae-ccdafa5f92ff is not found, at: 0x611fd1e 0x6120330 0x6120618 0x5bdffa7 0x3ef709a 0x4120f29 0x3f16cd1 0x13a30fa 0x5c1cf9f 0x5c1e287 0x5c1d5e8 0x5bafbc7 0x5baed7c 0x1336d79 0x13387d0 0x13353fc /opt/scylladb/libreloc/libc.so.6+0x27b89 /opt/scylladb/libreloc/libc.so.6+0x27c4a 0x1332ca4

the host ID they're trying to map (5fa31aad-4354-48ff-a3ae-ccdafa5f92ff) is of the decommissioning node.

@kbr-scylla
Copy link
Contributor

Hmm

        // apply new_normal_tokens
        for (auto& [endpoint, tokens]: new_normal_tokens) {
            target_token_metadata->update_topology(endpoint, get_dc_rack(endpoint), node::state::normal);
            co_await target_token_metadata->update_normal_tokens(std::move(tokens), endpoint);
        }
        // apply leaving endpoints
        for (const auto& endpoint: _leaving_endpoints) {
            target_token_metadata->remove_endpoint(endpoint);
        }

The crash is happening when trying to map IPs of endpoints in new_normal_tokens.

Curiously, this also includes leaving endpoints if there are any -- those are being removed in the lines below, after we attempted to map their IPs.

IIUC we could modify this code so we don't need mappings for leaving endpoints -- after all we're adding them and immediately removing them from target_token_metadata, so we're just introducing redundant intermediate stage which seems to be the only place to use the mappings.

Still the question remains, why do we sometimes have the mappings and sometimes not.

@kbr-scylla
Copy link
Contributor

Wait, I might've misunderstood. Leaving endpoints should not be part of new_normal_tokens

@kbr-scylla
Copy link
Contributor

I think this is a clue. Our decommissioning node has replaced another node before:

Jan 06 06:56:21.981466 longevity-tls-50gb-3d-master-db-node-5329f695-4 scylla[6200]:  [shard  0: gms] storage_service - handle_state_normal: Nodes 10.4.9.127 and 10.4.9.41 have the same token -1916910330658437065. Ignoring 10.4.9.127
Jan 06 06:56:21.981505 longevity-tls-50gb-3d-master-db-node-5329f695-4 scylla[6200]:  [shard  0: gms] storage_service - handle_state_normal: endpoints_to_remove endpoint=10.4.9.127
Jan 06 06:56:22.125801 longevity-tls-50gb-3d-master-db-node-5329f695-4 scylla[6200]:  [shard  0: gms] gossip - Removed endpoint 10.4.9.127

(this log is from a node which did not crash)

Perhaps on nodes where the crash happened, the state of old node was still lingering and somehow messed everything up.

BTW. on node-4 which did not crash, we can see 10.4.9.127 state being removed over and over again. This looks like #14991

@kbr-scylla
Copy link
Contributor

So... our culprit is still inside _replacing_endpoints? And that's why it is included in new_normal_tokens?

Just gossiper things.

@gusev-p
Copy link

gusev-p commented Jan 8, 2024

10.4.9.41/5fa31aad-4354-48ff-a3ae-ccdafa5f92ff is the node which first replaced some other node and then was decommissioned.

failed node logs (node 3):

Jan 06 06:29:46.003906 longevity-tls-50gb-3d-master-db-node-5329f695-3 scylla[6166]:  [shard  0: gms] gossip - InetAddress 10.4.9.41 is now UP, status = UNKNOWN

    Jan 06 06:31:35.729212 longevity-tls-50gb-3d-master-db-node-5329f695-3 scylla[6166]:  [shard  0:strm] storage_service - replace[349238da-585f-4011-bb9a-050db4cfbbbd]: Added replacing_node=10.4.9.41/5fa31aad-4354-48ff-a3ae-ccdafa5f92ff to replace existing_node=10.4.9.127/94ecd347-1d25-433e-84e1-cb1333917bca, coordinator=10.4.9.41/5fa31aad-4354-48ff-a3ae-ccdafa5f92ff

    Jan 06 06:31:35.729231 longevity-tls-50gb-3d-master-db-node-5329f695-3 scylla[6166]:  [shard  0:strm] token_metadata - Added node 5fa31aad-4354-48ff-a3ae-ccdafa5f92ff as pending replacing endpoint which replaces existing node 94ecd347-1d25-433e-84e1-cb1333917bca

    Jan 06 06:33:46.797641 longevity-tls-50gb-3d-master-db-node-5329f695-3 scylla[6166]:  [shard  0:strm] storage_service - replace[349238da-585f-4011-bb9a-050db4cfbbbd]: Marked ops done from coordinator=10.4.9.41
        at exactly the same time the same 'Marked ops done' happens on a healthy node,
but there it's immediately followed by 'handle_state_normal', which is missing here
for some reason. 

healthy node (node2):

Jan 06 06:29:46.006827 longevity-tls-50gb-3d-master-db-node-5329f695-2 scylla[6144]:  [shard  0: gms] gossip - InetAddress 10.4.9.41 is now UP, status = UNKNOWN

    Jan 06 06:31:35.731959 longevity-tls-50gb-3d-master-db-node-5329f695-2 scylla[6144]:  [shard  0:strm] token_metadata - Added node 5fa31aad-4354-48ff-a3ae-ccdafa5f92ff as pending replacing endpoint which replaces existing node 94ecd347-1d25-433e-84e1-cb1333917bca

    Jan 06 06:33:46.797671 longevity-tls-50gb-3d-master-db-node-5329f695-2 scylla[6144]:  [shard  0:strm] storage_service - replace[349238da-585f-4011-bb9a-050db4cfbbbd]: Marked ops done from coordinator=10.4.9.41
Jan 06 06:33:48.243623 longevity-tls-50gb-3d-master-db-node-5329f695-2 scylla[6144]:  [shard  0: gms] storage_service - handle_state_normal: remove endpoint=10.4.9.127 token=-1916910330658437065
     we see the 'handle_state_normal' which removes the old node (since its tokens
are now owned by the new node), 'remove_endpoint' is called for the replaced node,
and this also removes the mapping to the new node from _replacing_endpoints in token_metadata

so handle_state_normal was not called on node 3, there remained a mapping in _replacing_endpoints containing 5fa31aad-4354-48ff-a3ae-ccdafa5f92ff as a value, then decommission came for 5fa31aad-4354-48ff-a3ae-ccdafa5f92ff, storage_service::excise was called for it, tmptr->remove_endpoint(*host_id); was called, which removed the node from token_metadata.topology, but not from _replacing_endpoints. Then update_topology_change_info caused the crash.

This sequence of events is similar to this one in that token_metadata_impl::remove_endpoint removes from _replacing_endpoints only by key.

So, the upshot:

  • I think we need to make token_metadata_impl::remove_endpoint to remove _replacing_endpoints by both the key and the value;
  • It's unclear why handle_state_normal wasn't called after Marked ops done on the failed nodes. Maybe this code in gossiper placed it into quarantine?
// check for dead state removal
auto expire_time = get_expire_time_for_endpoint(endpoint);
const auto host_id = get_host_id(endpoint);
if (!is_alive && (now > expire_time)
    && (!get_token_metadata_ptr()->is_normal_token_owner(host_id))) {
    logger.debug("time is expiring for endpoint : {} ({})", endpoint, expire_time.time_since_epoch().count());
    co_await evict_from_membership(endpoint, pid);
}

@kbr-scylla @bhalevy

@kbr-scylla
Copy link
Contributor

SCT actually reported a bunch of errors pointing to the discrepancies, for example:

2024-01-06 06:42:12.150: (ClusterHealthValidatorEvent Severity.ERROR) period_type=one-time event_id=1e74743f-689f-494f-91ec-7dcfe98251da during_nemesis=RunUniqueSequence: type=NodeStatus node=longevity-tls-50gb-3d-master-db-node-5329f695-5 error=Current node Node longevity-tls-50gb-3d-master-db-node-5329f695-5 [3.255.183.111 | 10.4.9.98] (seed: True). The node Node longevity-tls-50gb-3d-master-db-node-5329f695-10 [3.253.72.9 | 10.4.9.41] (seed: True) exists in the gossip but doesn't exist in the nodetool.status
2024-01-06 06:42:12.213: (ClusterHealthValidatorEvent Severity.ERROR) period_type=one-time event_id=ce1b79c8-f0ba-47fa-98d3-3bbc2031a3ae during_nemesis=RunUniqueSequence: type=NodeStatus node=longevity-tls-50gb-3d-master-db-node-5329f695-5 error=Current node Node longevity-tls-50gb-3d-master-db-node-5329f695-5 [3.255.183.111 | 10.4.9.98] (seed: True). Wrong node status. Node Node longevity-tls-50gb-3d-master-db-node-5329f695-2 [18.201.13.191 | 10.4.11.217] (seed: True) status in nodetool.status is UN, but status in gossip shutdown
2024-01-06 06:42:12.231: (ClusterHealthValidatorEvent Severity.ERROR) period_type=one-time event_id=4131652e-5301-4ff1-bfa4-543b5536293e during_nemesis=RunUniqueSequence: type=NodeSchemaVersion node=longevity-tls-50gb-3d-master-db-node-5329f695-5 error=Current node Node longevity-tls-50gb-3d-master-db-node-5329f695-5 [3.255.183.111 | 10.4.9.98] (seed: True). Node Node longevity-tls-50gb-3d-master-db-node-5329f695-1 [3.254.197.193 | 10.4.8.89] (seed: True) (not target node) exists in the nodetool.status but missed in gossip.
2024-01-06 06:42:12.253: (ClusterHealthValidatorEvent Severity.ERROR) period_type=one-time event_id=ad214989-862b-42ab-8289-82ea5e4f8a2c during_nemesis=RunUniqueSequence: type=NodeSchemaVersion node=longevity-tls-50gb-3d-master-db-node-5329f695-5 error=Current node Node longevity-tls-50gb-3d-master-db-node-5329f695-5 [3.255.183.111 | 10.4.9.98] (seed: True). Node Node longevity-tls-50gb-3d-master-db-node-5329f695-10 [3.253.72.9 | 10.4.9.41] (seed: True) (not target node) exists in the gossip but missed in SYSTEM.PEERS.
2024-01-06 06:42:12.276: (ClusterHealthValidatorEvent Severity.ERROR) period_type=one-time event_id=7ac69ffa-ee55-4110-bc7a-e9676c9b5105 during_nemesis=RunUniqueSequence: type=NodeSchemaVersion node=longevity-tls-50gb-3d-master-db-node-5329f695-5 error=Current node Node longevity-tls-50gb-3d-master-db-node-5329f695-5 [3.255.183.111 | 10.4.9.98] (seed: True). Nodes 10.4.8.89 exists in the SYSTEM.PEERS but missed in gossip.

(this is from sct-runner-events, events.log)

@kbr-scylla
Copy link
Contributor

Attempts at creating a fast local reproducer failed.

It looks like we'll have to run longevity with custom builds with more logging, or perhaps enable more logging on master.

The logging in gossiper/storage_service and handle_state_normal is very inconsistent. The decisions which logs to put on INFO level and which on DEBUG seemed to be done randomly, and most cases are not covered by a single INFO log.

Apparently SCT provides nodetool gossipinfo output between operations. We have output from between the replace and decommission.
Node 2 (healthy):

< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG > /10.4.9.41
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   generation:1704522584
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   heartbeat:1391
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X9:org.apache.cassandra.locator.Ec2Snitch
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   RPC_ADDRESS:10.4.9.41
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   HOST_ID:5fa31aad-4354-48ff-a3ae-ccdafa5f92ff
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   SCHEMA:a6725fa6-ac53-11ee-780b-aa82ae73c005
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   RACK:1c
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X1:AGGREGATE_STORAGE_OPTIONS,ALTERNATOR_TTL,CDC,CDC_GENERATIONS_V2,COLLECTION_INDEXING,COMPUTED_COLUMNS,CORRECT_COUNTER_ORDER,CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX,CORRECT_NON_COMPOUND_RANGE_TOMBSTONES,CORRECT_STATIC_COMPACT_IN_MC,COUNTERS,DIGEST_FOR_NULL_VALUES,DIGEST_INSENSITIVE_TO_EXPIRY,DIGEST_MULTIPARTITION_READ,EMPTY_REPLICA_MUTATION_PAGES,EMPTY_REPLICA_PAGES,GROUP0_SCHEMA_VERSIONING,HINTED_HANDOFF_SEPARATE_CONNECTION,INDEXES,LARGE_COLLECTION_DETECTION,LARGE_PARTITIONS,LA_SSTABLE_FORMAT,LWT,MATERIALIZED_VIEWS,MC_SSTABLE_FORMAT,MD_SSTABLE_FORMAT,ME_SSTABLE_FORMAT,NONFROZEN_UDTS,PARALLELIZED_AGGREGATION,PER_TABLE_CACHING,PER_TABLE_PARTITIONERS,RANGE_SCAN_DATA_VARIANT,RANGE_TOMBSTONES,ROLES,ROW_LEVEL_REPAIR,SCHEMA_COMMITLOG,SCHEMA_TABLES_V3,SECONDARY_INDEXES_ON_STATIC_COLUMNS,SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT,STREAM_WITH_RPC_STREAM,SUPPORTS_RAFT_CLUSTER_MANAGEMENT,TABLE_DIGEST_INSENSITIVE_TO_EXPIRY,TOMBSTONE_GC_OPTIONS,TRUNCATION_TABLE,TYPED_ERRORS_IN_READ_RPC,UDA,UDA_NATIVE_PARALLELIZED_AGGREGATION,UNBOUNDED_RANGE_TOMBSTONES,UUID_SSTABLE_IDENTIFIERS,VIEW_VIRTUAL_COLUMNS,WRITE_FAILURE_REPLY,XXHASH
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X3:3
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   RELEASE_VERSION:3.0.8
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   DC:eu-west
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   NET_VERSION:0
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   LOAD:10326500191
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   STATUS:NORMAL,5583842826136721579
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X4:1
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X8:v2;1704520650893;707d8d86-a7d5-4182-bb48-d94139848208
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X6:14
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X7:12
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X2:mview.users:0.947625;system_distributed_everywhere.cdc_generation_descriptions_v2:0.000000;system_traces.sessions:0.000000;system_distributed.cdc_generation_timestamps:0.000000;system_auth.role_permissions:0.000000;system_auth.role_members:0.000000;system_distributed.cdc_streams_descriptions_v2:0.000000;system_traces.events:0.000000;mview.users_by_first_name:0.130302;system_traces.node_slow_log_time_idx:0.000000;system_auth.roles:0.999807;system_traces.node_slow_log:0.000000;mview.users_by_last_name:0.155384;system_distributed.service_levels:1.000000;system_distributed.view_build_status:0.000000;keyspace1.standard1:0.092051;system_traces.sessions_time_idx:0.000000;system_auth.role_attributes:0.000000;
< t:2024-01-06 06:39:35,610 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X5:1076:877238681:1704523174210

node 3 (crashed):

< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG > /10.4.9.41
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   generation:1704522584
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   heartbeat:1433
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   RPC_ADDRESS:10.4.9.41
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   SCHEMA:a6725fa6-ac53-11ee-780b-aa82ae73c005
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   HOST_ID:5fa31aad-4354-48ff-a3ae-ccdafa5f92ff
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   LOAD:10326500191
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X8:v2;1704520650893;707d8d86-a7d5-4182-bb48-d94139848208
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X5:522:877238681:1704523188211
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X1:AGGREGATE_STORAGE_OPTIONS,ALTERNATOR_TTL,CDC,CDC_GENERATIONS_V2,COLLECTION_INDEXING,COMPUTED_COLUMNS,CORRECT_COUNTER_ORDER,CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX,CORRECT_NON_COMPOUND_RANGE_TOMBSTONES,CORRECT_STATIC_COMPACT_IN_MC,COUNTERS,DIGEST_FOR_NULL_VALUES,DIGEST_INSENSITIVE_TO_EXPIRY,DIGEST_MULTIPARTITION_READ,EMPTY_REPLICA_MUTATION_PAGES,EMPTY_REPLICA_PAGES,GROUP0_SCHEMA_VERSIONING,HINTED_HANDOFF_SEPARATE_CONNECTION,INDEXES,LARGE_COLLECTION_DETECTION,LARGE_PARTITIONS,LA_SSTABLE_FORMAT,LWT,MATERIALIZED_VIEWS,MC_SSTABLE_FORMAT,MD_SSTABLE_FORMAT,ME_SSTABLE_FORMAT,NONFROZEN_UDTS,PARALLELIZED_AGGREGATION,PER_TABLE_CACHING,PER_TABLE_PARTITIONERS,RANGE_SCAN_DATA_VARIANT,RANGE_TOMBSTONES,ROLES,ROW_LEVEL_REPAIR,SCHEMA_COMMITLOG,SCHEMA_TABLES_V3,SECONDARY_INDEXES_ON_STATIC_COLUMNS,SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT,STREAM_WITH_RPC_STREAM,SUPPORTS_RAFT_CLUSTER_MANAGEMENT,TABLE_DIGEST_INSENSITIVE_TO_EXPIRY,TOMBSTONE_GC_OPTIONS,TRUNCATION_TABLE,TYPED_ERRORS_IN_READ_RPC,UDA,UDA_NATIVE_PARALLELIZED_AGGREGATION,UNBOUNDED_RANGE_TOMBSTONES,UUID_SSTABLE_IDENTIFIERS,VIEW_VIRTUAL_COLUMNS,WRITE_FAILURE_REPLY,XXHASH
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X2:mview.users:0.973837;system_distributed_everywhere.cdc_generation_descriptions_v2:0.000000;system_traces.sessions:0.000000;system_distributed.cdc_generation_timestamps:0.000000;system_auth.role_permissions:0.000000;system_auth.role_members:0.000000;system_distributed.cdc_streams_descriptions_v2:0.000000;system_traces.events:0.000000;mview.users_by_first_name:0.206648;system_traces.node_slow_log_time_idx:0.000000;system_auth.roles:0.999974;system_traces.node_slow_log:0.000000;mview.users_by_last_name:0.243549;system_distributed.service_levels:1.000000;system_distributed.view_build_status:0.000000;keyspace1.standard1:0.114186;system_traces.sessions_time_idx:0.000000;system_auth.role_attributes:0.000000;
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X6:14
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X9:org.apache.cassandra.locator.Ec2Snitch
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   DC:eu-west
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   RACK:1c
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X4:1
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X3:3
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   X7:12
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   NET_VERSION:0
< t:2024-01-06 06:39:49,805 f:cluster.py      l:2653 c:sdcm.cluster         p:DEBUG >   RELEASE_VERSION:3.0.8

We see that node 3 is getting gossip updates regularly -- the heartbeat is newer than heartbeat in node 2's output (because nodetool gossipinfo was called on node 3 after node 2). And we see that status of node 10 is NORMAL. So node 3 did get the new status -- but for some reason

  • either handle_state_normal wasn't called -- maybe because gossiper is deadlocked somewhere?
  • or it was called, but entered one of the other branches which didn't result in any INFO log

Note that gossiper deadlocking and not calling handlers wouldn't be the first time...

I need to send a PR with more logging:

  • to cover each case of handle_state_normal with at least one INFO level log (this is easy)
  • and somehow to detect gossiper deadlock if it happened (this is hard)

kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jan 11, 2024
In a longevity test reported in scylladb#16668 we observed that
NORMAL state is not being properly handled for a node that replaced
another node. Either handle_state_normal is not being called, or it is
but getting stuck in the middle. Which is the case couldn't be
determined from the logs, and attempts at creating a local reproducer
failed.

Improve the INFO level logging in handle_state_normal to aid debugging
in the future.

The amount of logs is still constant per-node. Even though some log
messages report all tokens owned by a node, handle_state_normal calls
are still rare. The most "spammy" situation is when a node starts and
calls handle_state_normal for every other node in the cluster, but it is
a once-per-startup event.
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jan 11, 2024
In a longevity test reported in scylladb#16668 we observed that
NORMAL state is not being properly handled for a node that replaced
another node. Either handle_state_normal is not being called, or it is
but getting stuck in the middle. Which is the case couldn't be
determined from the logs, and attempts at creating a local reproducer
failed.

One hypothesis is that `gossiper` is stuck on `lock_endpoint`. We dealt
with gossiper deadlocks in the past (e.g. scylladb#7127).

Modify the code so it reports an error if `lock_endpoint` waits for the
lock for more than a minute. When the issue reproduces again in
longevity, we will see if `lock_endpoint` got stuck.
@bhalevy
Copy link
Member

bhalevy commented Jan 19, 2024

  • I think we need to make token_metadata_impl::remove_endpoint to remove _replacing_endpoints by both the key and the value;

I agree.
@gusev-p See #16731

@kbr-scylla
Copy link
Contributor

@bhalevy I'm worried that #16731 will prevent this failure from reproducing, masking the root cause of the issue.

The root cause here is that nodes in the cluster never learned that the replacing node transitioned to NORMAL.

We don't have any other known test to catch that problem, except this longevity one.

That's why we should aim to reproduce it first, with more logs, try to find the root cause, before we merge #16731.

@bhalevy
Copy link
Member

bhalevy commented Jan 24, 2024

@bhalevy I'm worried that #16731 will prevent this failure from reproducing, masking the root cause of the issue.

The root cause here is that nodes in the cluster never learned that the replacing node transitioned to NORMAL.

We don't have any other known test to catch that problem, except this longevity one.

That's why we should aim to reproduce it first, with more logs, try to find the root cause, before we merge #16731.

ok. makes sense

@kbr-scylla
Copy link
Contributor

And we see that status of node 10 is NORMAL. So node 3 did get the new status -- but for some reason

Argh how did I not see this -- STATUS for node 10 is missing from gossipinfo on the node which crashed (node 3)... even though node 3's endpoint_state is newer in this output compared to healthy node 2's status (the generation, heartbeat pair is greater)

@kbr-scylla
Copy link
Contributor

Seen in https://argus.scylladb.com/test/da141da4-1427-469f-a136-a7db8a47d5fa/runs?additionalRuns[]=22943c8d-3b16-40a6-b047-c1dd133fc26f

That was before fd32e2e

Start time: 2024-02-04 18:21:52
End time: 2024-02-04 20:02:24
Scylla version: 5.5.0~dev-20240202.52e6398ad64d
Build id: 59019e20ed174f607214637907e85c7e293b26af

@mykaul
Copy link
Contributor

mykaul commented Apr 3, 2024

What's the latest on this one?

@kbr-scylla
Copy link
Contributor

Missing STATUS=NORMAL update was recently spotted on CI
#18118 (comment)

Which means the issue is likely still present, they have the same root cause.

@kbr-scylla
Copy link
Contributor

Found the root cause.
Will send fix soon.

kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Apr 4, 2024
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (now including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent ~100%
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

Fixes: scylladb#15393
Fixes: scylladb#15602
Fixes: scylladb#16668
Fixes: scylladb#16902
Fixes: scylladb#17493
Fixes: scylladb#18118
Fixes: scylladb/scylla-enterprise#3720
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Apr 4, 2024
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (now including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

Fixes: scylladb#15393
Fixes: scylladb#15602
Fixes: scylladb#16668
Fixes: scylladb#16902
Fixes: scylladb#17493
Fixes: scylladb#18118
Fixes: scylladb/scylla-enterprise#3720
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Apr 4, 2024
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

Fixes: scylladb#15393
Fixes: scylladb#15602
Fixes: scylladb#16668
Fixes: scylladb#16902
Fixes: scylladb#17493
Fixes: scylladb#18118
Fixes: scylladb/scylla-enterprise#3720
denesb added a commit that referenced this issue Apr 16, 2024
…amil Braun

In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in #16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in #15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

The PR also adds a regression test.

Fixes: #15393
Fixes: #15602
Fixes: #16668
Fixes: #16902
Fixes: #17493
Fixes: #18118
Ref: scylladb/scylla-enterprise#3720

Closes #18184

* github.com:scylladb/scylladb:
  test: reproducer for missing gossiper updates
  gossiper: lock local endpoint when updating heart_beat
mergify bot pushed a commit that referenced this issue Apr 16, 2024
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in #16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in #15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

Fixes: #15393
Fixes: #15602
Fixes: #16668
Fixes: #16902
Fixes: #17493
Fixes: #18118
Ref: scylladb/scylla-enterprise#3720
(cherry picked from commit a0b331b)
kbr-scylla added a commit that referenced this issue Apr 16, 2024
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in #16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in #15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

Fixes: #15393
Fixes: #15602
Fixes: #16668
Fixes: #16902
Fixes: #17493
Fixes: #18118
Ref: scylladb/scylla-enterprise#3720
(cherry picked from commit a0b331b)
kbr-scylla added a commit that referenced this issue Apr 17, 2024
…rt_beat' from ScyllaDB

In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in #16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in #15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

The PR also adds a regression test.

Fixes: #15393
Fixes: #15602
Fixes: #16668
Fixes: #16902
Fixes: #17493
Fixes: #18118
Ref: scylladb/scylla-enterprise#3720

(cherry picked from commit a0b331b)

(cherry picked from commit 7295509)

Refs #18184

Closes #18245

* github.com:scylladb/scylladb:
  test: reproducer for missing gossiper updates
  gossiper: lock local endpoint when updating heart_beat
dgarcia360 pushed a commit to dgarcia360/scylla that referenced this issue Apr 30, 2024
In a longevity test reported in scylladb#16668 we observed that
NORMAL state is not being properly handled for a node that replaced
another node. Either handle_state_normal is not being called, or it is
but getting stuck in the middle. Which is the case couldn't be
determined from the logs, and attempts at creating a local reproducer
failed.

Improve the INFO level logging in handle_state_normal to aid debugging
in the future.

The amount of logs is still constant per-node. Even though some log
messages report all tokens owned by a node, handle_state_normal calls
are still rare. The most "spammy" situation is when a node starts and
calls handle_state_normal for every other node in the cluster, but it is
a once-per-startup event.
dgarcia360 pushed a commit to dgarcia360/scylla that referenced this issue Apr 30, 2024
In a longevity test reported in scylladb#16668 we observed that
NORMAL state is not being properly handled for a node that replaced
another node. Either handle_state_normal is not being called, or it is
but getting stuck in the middle. Which is the case couldn't be
determined from the logs, and attempts at creating a local reproducer
failed.

One hypothesis is that `gossiper` is stuck on `lock_endpoint`. We dealt
with gossiper deadlocks in the past (e.g. scylladb#7127).

Modify the code so it reports an error if `lock_endpoint` waits for the
lock for more than a minute. When the issue reproduces again in
longevity, we will see if `lock_endpoint` got stuck.
dgarcia360 pushed a commit to dgarcia360/scylla that referenced this issue Apr 30, 2024
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

Fixes: scylladb#15393
Fixes: scylladb#15602
Fixes: scylladb#16668
Fixes: scylladb#16902
Fixes: scylladb#17493
Fixes: scylladb#18118
Ref: scylladb/scylla-enterprise#3720
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backport candidate status/release blocker Preventing from a release to be promoted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants