Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coredump during creating instance on azure on single node #10494

Closed
aleksbykov opened this issue May 5, 2022 · 12 comments
Closed

coredump during creating instance on azure on single node #10494

aleksbykov opened this issue May 5, 2022 · 12 comments
Assignees
Labels
cloud/azure Azure related issues type/bug
Milestone

Comments

@aleksbykov
Copy link
Contributor

Installation details

Kernel Version: 5.13.0-1022-azure
Scylla version (or git commit hash): 5.1.dev-20220504.b26a3da584cc with build-id ab2a33a30756c1513f4c516cd272291e75acec0e
Cluster size: 6 nodes (Standard_L8s_v2)

Scylla Nodes used in this run:

  • longevity-10gb-3h-master-db-node-835fbc85-eastus-8 (20.127.8.251 | 10.0.0.14) (shards: 8)
  • longevity-10gb-3h-master-db-node-835fbc85-eastus-7 (20.120.98.177 | 10.0.0.14) (shards: 8)
  • longevity-10gb-3h-master-db-node-835fbc85-eastus-6 (20.121.13.143 | 10.0.0.10) (shards: 8)
  • longevity-10gb-3h-master-db-node-835fbc85-eastus-5 (20.121.13.124 | 10.0.0.9) (shards: 8)
  • longevity-10gb-3h-master-db-node-835fbc85-eastus-4 (20.119.62.231 | 10.0.0.8) (shards: 8)
  • longevity-10gb-3h-master-db-node-835fbc85-eastus-3 (20.119.59.43 | 10.0.0.7) (shards: 8)
  • longevity-10gb-3h-master-db-node-835fbc85-eastus-2 (20.25.96.108 | 10.0.0.6) (shards: 8)
  • longevity-10gb-3h-master-db-node-835fbc85-eastus-1 (20.232.111.57 | 10.0.0.5) (shards: 8)

OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/ScyllaDB-5.1.dev-0.20220504.b26a3da584cc-1-build-148 (azure: eastus)

Test: longevity-10gb-3h-azure-test
Test id: 835fbc85-2bdf-46aa-a87d-04348bbbc1f8
Test name: scylla-master/longevity/longevity-10gb-3h-azure-test
Test config file(s):

Issue description

Coredump happened creating node2 even before cluster configuration:

2022-05-04 07:24:32.864 <2022-05-04 07:13:27.000>: (CoreDumpEvent Severity.ERROR) period_type=one-time event_id=1fe38754-d16a-423e-9b40-9b0b73e0cf60 node=Node longevity-10gb-3h-master-db-node-835fbc85-eastus-2 [20.25.96.108 | 10.0.0.6] (seed: False)
corefile_url=https://storage.cloud.google.com/upload.scylladb.com/core.scylla.113.e9ea3c0c29724c5c8ff102ee668da941.5560.1651648407000000000000/core.scylla.113.e9ea3c0c29724c5c8ff102ee668da941.5560.1651648407000000000000.gz
backtrace=           PID: 5560 (scylla)
UID: 113 (scylla)
GID: 121 (scylla)
Signal: 11 (SEGV)
Timestamp: Wed 2022-05-04 07:13:27 UTC (9min ago)
Command Line: /usr/bin/scylla --log-to-syslog 1 --log-to-stdout 0 --default-log-level info --network-stack posix --io-properties-file=/etc/scylla.d/io_properties.yaml --cpuset 0-7 --lock-memory=1
Executable: /opt/scylladb/libexec/scylla
Control Group: /scylla.slice/scylla-server.slice/scylla-server.service
Unit: scylla-server.service
Slice: scylla-server.slice
Boot ID: e9ea3c0c29724c5c8ff102ee668da941
Machine ID: 00776e0e9e334a95b9dd4a446cd6fbcd
Hostname: longevity-10gb-3h-master-db-node-eastus-2
Storage: /var/lib/systemd/coredump/core.scylla.113.e9ea3c0c29724c5c8ff102ee668da941.5560.1651648407000000000000
Message: Process 5560 (scylla) of user 113 dumped core.
Stack trace of thread 5567:
#0  0x00007f3e08dbc815 __memmove_avx_unaligned_erms (libc.so.6 + 0x163815)
#1  0x0000000002fe8ef4 _ZN7locator22production_snitch_base9set_my_dcERKN7seastar13basic_sstringIcjLj15ELb1EEE (scylla + 0x2de8ef4)
#2  0x0000000002f3282c _ZNSt17_Function_handlerIFN7seastar6futureIvEERN7locator10snitch_ptrEEZNS0_7shardedIS4_E13invoke_on_allIZNS3_12azure_snitch11load_configEvE3$_0JEEES2_NS0_21smp_submit_to_optionsET_DpT0_EUlS5_E_E9_M_invokeERKSt9_Any_dataS5_ (scylla + 0x2d3282c)
#3  0x000000000110803b _ZN7seastar17smp_message_queue15async_work_itemIZZNS_7shardedIN7locator10snitch_ptrEE13invoke_on_allENS_21smp_submit_to_optionsESt8functionIFNS_6futureIvEERS4_EEENKUljE_clEjEUlvE_E15run_and_disposeEv (scylla + 0xf0803b)
#4  0x000000000466bd05 _ZN7seastar7reactor14run_some_tasksEv (scylla + 0x446bd05)
#5  0x000000000466d0e8 _ZN7seastar7reactor6do_runEv (scylla + 0x446d0e8)
#6  0x000000000468bc96 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE4$_89E9_M_invokeERKSt9_Any_data (scylla + 0x448bc96)
#7  0x000000000463fa8b _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x443fa8b)
#8  0x00007f3e0984e2a5 start_thread (libpthread.so.0 + 0x92a5)
#9  0x00007f3e08d59323 __clone (libc.so.6 + 0x100323)
Stack trace of thread 5568:
#0  0x00007f3e0985794c read (libpthread.so.0 + 0x1294c)
#1  0x00000000046ae285 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x44ae285)
#2  0x00000000046ae5c0 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x44ae5c0)
#3  0x000000000463fa8b _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x443fa8b)
#4  0x00007f3e0984e2a5 start_thread (libpthread.so.0 + 0x92a5)
#5  0x00007f3e08d59323 __clone (libc.so.6 + 0x100323)
Stack trace of thread 5575:
#0  0x00007f3e0985794c read (libpthread.so.0 + 0x1294c)
#1  0x00000000046ae285 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x44ae285)
#2  0x00000000046ae5c0 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x44ae5c0)
#3  0x000000000463fa8b _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x443fa8b)
#4  0x00007f3e0984e2a5 start_thread (libpthread.so.0 + 0x92a5)
#5  0x00007f3e08d59323 __clone (libc.so.6 + 0x100323)
Stack trace of thread 5571:
#0  0x00007f3e0985794c read (libpthread.so.0 + 0x1294c)
#1  0x00000000046ae285 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x44ae285)
#2  0x00000000046ae5c0 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x44ae5c0)
#3  0x000000000463fa8b _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x443fa8b)
#4  0x00007f3e0984e2a5 start_thread (libpthread.so.0 + 0x92a5)
#5  0x00007f3e08d59323 __clone (libc.so.6 + 0x100323)
Stack trace of thread 5570:
#0  0x00007f3e0985794c read (libpthread.so.0 + 0x1294c)
#1  0x00000000046ae247 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x44ae247)
#2  0x00000000046ae5c0 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x44ae5c0)
#3  0x000000000463fa8b _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x443fa8b)
#4  0x00007f3e0984e2a5 start_thread (libpthread.so.0 + 0x92a5)
#5  0x00007f3e08d59323 __clone (libc.so.6 + 0x100323)
Stack trace of thread 5565:
#0  0x00007f3e08d53ddd syscall (libc.so.6 + 0xfaddd)
#1  0x00000000046b43f1 _ZN7seastar8internal13io_pgeteventsEmllPNS0_9linux_abi8io_eventEPK8timespecPK10__sigset_tb (scylla + 0x44b43f1)
#2  0x00000000046afff5 _ZN7seastar19reactor_backend_aio12await_eventsEiPK10__sigset_t (scylla + 0x44afff5)
#3  0x00000000046b0764 _ZN7seastar19reactor_backend_aio23wait_and_process_eventsEPK10__sigset_t (scylla + 0x44b0764)
#4  0x000000000466d46d _ZN7seastar7reactor6do_runEv (scylla + 0x446d46d)
#5  0x000000000468bc96 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE4$_89E9_M_invokeERKSt9_Any_data (scylla + 0x448bc96)
#6  0x000000000463fa8b _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x443fa8b)
#7  0x00007f3e0984e2a5 start_thread (libpthread.so.0 + 0x92a5)
#8  0x00007f3e08d59323 __clone (libc.so.6 + 0x100323)
Stack trace of thread 5564:
#0  0x00007f3e08d53ddd syscall (libc.so.6 + 0xfaddd)
#1  0x00000000046b43f1 _ZN7seastar8internal13io_pgeteventsEmllPNS0_9linux_abi8io_eventEPK8timespecPK10__sigset_tb (scylla + 0x44b43f1)
#2  0x00000000046afff5 _ZN7seastar19reactor_backend_aio12await_eventsEiPK10__sigset_t (scylla + 0x44afff5)
#3  0x00000000046b0764 _ZN7seastar19reactor_backend_aio23wait_and_process_eventsEPK10__sigset_t (scylla + 0x44b0764)
#4  0x000000000466d46d _ZN7seastar7reactor6do_runEv (scylla + 0x446d46d)
#5  0x000000000468bc96 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE4$_89E9_M_invokeERKSt9_Any_data (scylla + 0x448bc96)
#6  0x000000000463fa8b _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x443fa8b)
#7  0x00007f3e0984e2a5 start_thread (libpthread.so.0 + 0x92a5)
#8  0x00007f3e08d59323 __clone (libc.so.6 + 0x100323)
Stack trace of thread 5561:
#0  0x00007f3e08d53ddd syscall (libc.so.6 + 0xfaddd)
#1  0x00000000046b43f1 _ZN7seastar8internal13io_pgeteventsEmllPNS0_9linux_abi8io_eventEPK8timespecPK10__sigset_tb (scylla + 0x44b43f1)
#2  0x00000000046afff5 _ZN7seastar19reactor_backend_aio12await_eventsEiPK10__sigset_t (scylla + 0x44afff5)
#3  0x00000000046b0764 _ZN7seastar19reactor_backend_aio23wait_and_process_eventsEPK10__sigset_t (scylla + 0x44b0764)
#4  0x000000000466d46d _ZN7seastar7reactor6do_runEv (scylla + 0x446d46d)
#5  0x000000000468bc96 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE4$_89E9_M_invokeERKSt9_Any_data (scylla + 0x448bc96)
#6  0x000000000463fa8b _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x443fa8b)
#7  0x00007f3e0984e2a5 start_thread (libpthread.so.0 + 0x92a5)
#8  0x00007f3e08d59323 __clone (libc.so.6 + 0x100323)
Stack trace of thread 5572:
#0  0x00007f3e0985794c read (libpthread.so.0 + 0x1294c)
#1  0x00000000046ae285 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x44ae285)
#2  0x00000000046ae5c0 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x44ae5c0)
#3  0x000000000463fa8b _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x443fa8b)
#4  0x00007f3e0984e2a5 start_thread (libpthread.so.0 + 0x92a5)
#5  0x00007f3e08d59323 __clone (libc.so.6 + 0x100323)
Stack trace of thread 5560:
#0  0x00007f3e08d53ddd syscall (libc.so.6 + 0xfaddd)
#1  0x00000000046b43f1 _ZN7seastar8internal13io_pgeteventsEmllPNS0_9linux_abi8io_eventEPK8timespecPK10__sigset_tb (scylla + 0x44b43f1)
#2  0x00000000046afff5 _ZN7seastar19reactor_backend_aio12await_eventsEiPK10__sigset_t (scylla + 0x44afff5)
#3  0x00000000046b0764 _ZN7seastar19reactor_backend_aio23wait_and_process_eventsEPK10__sigset_t (scylla + 0x44b0764)
#4  0x000000000466d46d _ZN7seastar7reactor6do_runEv (scylla + 0x446d46d)
#5  0x000000000466c33d _ZN7seastar7reactor3runEv (scylla + 0x446c33d)
#6  0x0000000004613349 _ZN7seastar12app_template14run_deprecatedEiPPcOSt8functionIFvvEE (scylla + 0x4413349)
#7  0x0000000004612822 _ZN7seastar12app_template3runEiPPcOSt8functionIFNS_6futureIiEEvEE (scylla + 0x4412822)
#8  0x000000000105ef70 _ZL11scylla_mainiPPc (scylla + 0xe5ef70)
#9  0x000000000105c79b main (scylla + 0xe5c79b)
#10 0x00007f3e08c80b75 __libc_start_main (libc.so.6 + 0x27b75)
#11 0x000000000105b72e _start (scylla + 0xe5b72e)
Stack trace of thread 5574:
#0  0x00007f3e0985794c read (libpthread.so.0 + 0x1294c)
#1  0x00000000046ae285 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x44ae285)
#2  0x00000000046ae5c0 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x44ae5c0)
#3  0x000000000463fa8b _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x443fa8b)
#4  0x00007f3e0984e2a5 start_thread (libpthread.so.0 + 0x92a5)
#5  0x00007f3e08d59323 __clone (libc.so.6 + 0x100323)
Stack trace of thread 5573:
#0  0x00007f3e0985794c read (libpthread.so.0 + 0x1294c)
#1  0x00000000046ae285 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x44ae285)
#2  0x00000000046ae5c0 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x44ae5c0)
#3  0x000000000463fa8b _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x443fa8b)
#4  0x00007f3e0984e2a5 start_thread (libpthread.so.0 + 0x92a5)
#5  0x00007f3e08d59323 __clone (libc.so.6 + 0x100323)
Stack trace of thread 5562:
#0  0x00007f3e08d53ddd syscall (libc.so.6 + 0xfaddd)
#1  0x00000000046b43f1 _ZN7seastar8internal13io_pgeteventsEmllPNS0_9linux_abi8io_eventEPK8timespecPK10__sigset_tb (scylla + 0x44b43f1)
#2  0x00000000046afff5 _ZN7seastar19reactor_backend_aio12await_eventsEiPK10__sigset_t (scylla + 0x44afff5)
#3  0x00000000046b0764 _ZN7seastar19reactor_backend_aio23wait_and_process_eventsEPK10__sigset_t (scylla + 0x44b0764)
#4  0x000000000466d46d _ZN7seastar7reactor6do_runEv (scylla + 0x446d46d)
#5  0x000000000468bc96 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE4$_89E9_M_invokeERKSt9_Any_data (scylla + 0x448bc96)
#6  0x000000000463fa8b _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x443fa8b)
#7  0x00007f3e0984e2a5 start_thread (libpthread.so.0 + 0x92a5)
#8  0x00007f3e08d59323 __clone (libc.so.6 + 0x100323)
Stack trace of thread 5563:
#0  0x00007f3e08d53ddd syscall (libc.so.6 + 0xfaddd)
#1  0x00000000046b43f1 _ZN7seastar8internal13io_pgeteventsEmllPNS0_9linux_abi8io_eventEPK8timespecPK10__sigset_tb (scylla + 0x44b43f1)
#2  0x00000000046afff5 _ZN7seastar19reactor_backend_aio12await_eventsEiPK10__sigset_t (scylla + 0x44afff5)
#3  0x00000000046b0764 _ZN7seastar19reactor_backend_aio23wait_and_process_eventsEPK10__sigset_t (scylla + 0x44b0764)
#4  0x000000000466d46d _ZN7seastar7reactor6do_runEv (scylla + 0x446d46d)
#5  0x000000000468bc96 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE4$_89E9_M_invokeERKSt9_Any_data (scylla + 0x448bc96)
#6  0x000000000463fa8b _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x443fa8b)
#7  0x00007f3e0984e2a5 start_thread (libpthread.so.0 + 0x92a5)
#8  0x00007f3e08d59323 __clone (libc.so.6 + 0x100323)
Stack trace of thread 5569:
#0  0x00007f3e0985794c read (libpthread.so.0 + 0x1294c)
#1  0x00000000046ae285 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla + 0x44ae285)
#2  0x00000000046ae5c0 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla + 0x44ae5c0)
#3  0x000000000463fa8b _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x443fa8b)
#4  0x00007f3e0984e2a5 start_thread (libpthread.so.0 + 0x92a5)
#5  0x00007f3e08d59323 __clone (libc.so.6 + 0x100323)
Stack trace of thread 5566:
#0  0x00007f3e08d53ddd syscall (libc.so.6 + 0xfaddd)
#1  0x00000000046b43f1 _ZN7seastar8internal13io_pgeteventsEmllPNS0_9linux_abi8io_eventEPK8timespecPK10__sigset_tb (scylla + 0x44b43f1)
#2  0x00000000046afff5 _ZN7seastar19reactor_backend_aio12await_eventsEiPK10__sigset_t (scylla + 0x44afff5)
#3  0x00000000046b0764 _ZN7seastar19reactor_backend_aio23wait_and_process_eventsEPK10__sigset_t (scylla + 0x44b0764)
#4  0x000000000466d46d _ZN7seastar7reactor6do_runEv (scylla + 0x446d46d)
#5  0x000000000468bc96 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE4$_89E9_M_invokeERKSt9_Any_data (scylla + 0x448bc96)
#6  0x000000000463fa8b _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x443fa8b)
#7  0x00007f3e0984e2a5 start_thread (libpthread.so.0 + 0x92a5)
#8  0x00007f3e08d59323 __clone (libc.so.6 + 0x100323)
download_instructions=gsutil cp gs://upload.scylladb.com/core.scylla.113.e9ea3c0c29724c5c8ff102ee668da941.5560.1651648407000000000000/core.scylla.113.e9ea3c0c29724c5c8ff102ee668da941.5560.1651648407000000000000.gz .
gunzip /var/lib/systemd/coredump/core.scylla.113.e9ea3c0c29724c5c8ff102ee668da941.5560.1651648407000000000000.gz

  • Restore Monitor Stack command: $ hydra investigate show-monitor 835fbc85-2bdf-46aa-a87d-04348bbbc1f8
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 835fbc85-2bdf-46aa-a87d-04348bbbc1f8

Logs:

Jenkins job URL

@slivne slivne added cloud/azure Azure related issues type/bug labels May 15, 2022
@slivne slivne added this to the 5.0 milestone May 15, 2022
@asias
Copy link
Contributor

asias commented May 17, 2022

The crash is from azure snitch.

#0  0x00007f3e08dbc815 __memmove_avx_unaligned_erms (libc.so.6 + 0x163815)
#1  0x0000000002fe8ef4 _ZN7locator22production_snitch_base9set_my_dcERKN7seastar13basic_sstringIcjLj15ELb1EEE (scylla + 0x2de8ef4)
#2  0x0000000002f3282c _ZNSt17_Function_handlerIFN7seastar6futureIvEERN7locator10snitch_ptrEEZNS0_7shardedIS4_E13invoke_on_allIZNS3_12azure_snitch11load_configEvE3$_0JEEES2_NS0_21smp_submit_to_optionsET_DpT0_EUlS5_E_E9_M_invokeERKSt9_Any_dataS5_ (scylla + 0x2d3282c)
#3  0x000000000110803b _ZN7seastar17smp_message_queue15async_work_itemIZZNS_7shardedIN7locator10snitch_ptrEE13invoke_on_allENS_21smp_submit_to_optionsESt8functionIFNS_6futureIvEERS4_EEENKUljE_clEjEUlvE_E15run_and_disposeEv (scylla + 0xf0803b)
#4  0x000000000466bd05 _ZN7seastar7reactor14run_some_tasksEv (scylla + 0x446bd05)
#5  0x000000000466d0e8 _ZN7seastar7reactor6do_runEv (scylla + 0x446d0e8)
#6  0x000000000468bc96 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureERKNS1_11smp_optionsERKNS1_15reactor_optionsEE4$_89E9_M_invokeERKSt9_Any_data (scylla + 0x448bc96)
#7  0x000000000463fa8b _ZN7seastar12posix_thread13start_routineEPv (scylla + 0x443fa8b)
#8  0x00007f3e0984e2a5 start_thread (libpthread.so.0 + 0x92a5)
#9  0x00007f3e08d59323 __clone (libc.so.6 + 0x100323)

Probably the regression is introduced by the following commits. @xemul Can you please check?

commit 633746b87d02a214fe86d8ee3507258119ed2496
Author: Pavel Emelyanov <xemul@scylladb.com>
Date:   Thu Apr 7 16:14:22 2022 +0300

    snitch: Make config-based construction of all drivers
    
    Currently snitch drivers register themselves in class-registry with all
    sorts of construction options possible. All those different constuctors
    are in fact "config options".
    
    When later snitch will declare its dependencies (gossiper and system
    keyspace), it will require patching all this registrations, which's very
    inconvenient.
    
    This patch introduces the snitch_config struct and replaces all the
    snitch constructors with the snitch_driver(snitch_config cfg) one.
    
    Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

commit 552a08ecd060aba2697b455b1a699c2efac3d2e8
Author: Pavel Emelyanov <xemul@scylladb.com>
Date:   Thu Apr 7 19:41:20 2022 +0300

    snitch: Introduce container() method
    
    Some snitch drivers want the peering_sharded_service::container()
    functionality, but they can't directly use it, because the driver
    class is in fact the pimplification behind the sharded<snitch_ptr>
    service. To overcome this there's a _my_distributed pointer on the
    driver base class that points back to sharded<snitch_ptr> object.
    
    This patch replaces the direct _my_distributed usage with the
    container() method that does it and also asserts that the pointer
    in question is initialized (some drivers already do it, some don't).
    
    Other than making the code more peering_sharded_service-like, this
    patch allows changing _my_distributed into _backreference that
    points to this shard's snitch_ptr, see next patch.
    
    Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

@xemul
Copy link
Contributor

xemul commented May 17, 2022

I cannot find logs that show how system behaved right before the crash. All that I see from it is that snitch initialized successfully. @aleksbykov , please point me to correct logs.

@aleksbykov
Copy link
Contributor Author

The core happens even before we start collect any significant log data.

@xemul
Copy link
Contributor

xemul commented May 17, 2022

But scylla logs itself somewhere on start.

@xemul
Copy link
Contributor

xemul commented May 17, 2022

Actually, the azure snitch code loos weird from the very beginning (e44fa8d)

future<> azure_snitch::load_config() {
    ...
    _my_rack = azure_zone;
    _my_dc = azure_region;

    co_return co_await _my_distributed->invoke_on_all([this] (snitch_ptr& local_s) {
        if (this_shard_id() != io_cpu_id()) {
            local_s->set_my_dc(_my_dc);
            local_s->set_my_rack(_my_rack);
        }
    });
}

future<> azure_snitch::start() {
    return load_config().then(...);
}

It sets _my_rack and _my_dc on each shard then each shard goes and re-sets the same fields (set_my_... re-sets the corresponding _my_... field) with values from each shard's this. No wonder it crashes.

@xemul
Copy link
Contributor

xemul commented May 17, 2022

IOW

shard-1                shard-2                shard-3
_my_dc = foo;          _my_dc = foo;          _my_dc = foo;
invoke_on_all                                 invoke_on_all
            \--------> [my_dc = 1._my_dc]     |
1._my_dc = 3._my_dc <-------------------------/
old._my_dc.~sstring()  ...
                       2._my_dc = [my_dc]   <- the my_dc is already dead

@slivne
Copy link
Contributor

slivne commented May 19, 2022

@xemul - are you sending a patch to fix this ?

If we run a machine and give you access to it would that help ?

@xemul
Copy link
Contributor

xemul commented May 19, 2022

@slivne , I can prepare a patch, yes. I don't think I need a machine, it's a race + use-after-free that would be hard to trigger intentionally from my pov.

@xemul xemul assigned xemul and unassigned asias May 19, 2022
xemul added a commit to xemul/scylla that referenced this issue May 19, 2022
All snitch drivers are supposed to snitch info on some shard and
replicate the dc/rack info across others. All, but azure really do so.
The azure one gets dc/rack on all shards, which's excessive but not
terrible, but when all shards start to replicate their data to all the
others, this may lead to use-after-frees.

fixes: scylladb#10494

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
@slivne
Copy link
Contributor

slivne commented May 25, 2022

@xemul ping

@xemul
Copy link
Contributor

xemul commented May 25, 2022

[PATCH 0/3] Fix snitching on Azure sent 20.05.2022

avikivity pushed a commit that referenced this issue Jul 17, 2022
All snitch drivers are supposed to snitch info on some shard and
replicate the dc/rack info across others. All, but azure really do so.
The azure one gets dc/rack on all shards, which's excessive but not
terrible, but when all shards start to replicate their data to all the
others, this may lead to use-after-frees.

fixes: #10494

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit c6d0bc8)
avikivity pushed a commit that referenced this issue Jul 17, 2022
All snitch drivers are supposed to snitch info on some shard and
replicate the dc/rack info across others. All, but azure really do so.
The azure one gets dc/rack on all shards, which's excessive but not
terrible, but when all shards start to replicate their data to all the
others, this may lead to use-after-frees.

fixes: #10494

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit c6d0bc8)
@avikivity
Copy link
Member

Backported to 4.6, 5.0.

@avikivity
Copy link
Member

Strangely, no "Backport candidate" label to remove.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud/azure Azure related issues type/bug
Projects
None yet
Development

No branches or pull requests

7 participants