Repair task from manager failed due to coredump on one of the node #8059

aleksbykov · 2021-02-10T08:26:59Z

Installation details
Scylla version (or git commit hash): 4.5.dev-0.20210204.7f3083739 with build-id 4abdd1a158a7b6e39afe5f03c27cd50a3cd9d46d
Cluster size: 6 nodes (i3.4xlarge)
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0edd71dfeca9e0df2 (aws: eu-north-1)

Test id: 2b6831c3-23d5-45bd-a7d3-10544ec41d9a
Test: longevity-50gb-3days
Test name: longevity_test.LongevityTest.test_custom_time
Test config file(s):

[longevity-50GB-3days-authorization-and-tls-ssl.yaml] (https://github.com/scylladb/scylla-cluster-tests/blob/ecdd3f558c42c5b938e0f2a2c8800004151d3173/test-cases/longevity/longevity-50GB-3days-authorization-and-tls-ssl.yaml)

Issue description

====================================
The job failed after 22 hours. Next list of nemesis action:

ModifyTablePropertiesDefaultTimeToLive
ServerSslHotReloadingNemesis
ServerSslHotReloadingNemesis
NoCorruptRepair
Enospc
ShowTopPartitions
StopStartService
AddDropColumnMonkey
GrowCluster
ShrinkCluster
DecommissionStreamingErr
RestartNodeWithResharding
MajorCompaction
MemoryStress
SoftRebootNode
MajorCompaction
Drainer
DecommissionStreamingErr
ManagementBackupWithSpecificKeyspaces
StopStartNetworkInterfaces
Enospc
Decommission
TerminateAndRemoveNodeMonkey
RebuildStreamingErr
RejectNodeExporterNetwork
StopStartService
ServerSslHotReloadingNemesis
DecommissionStreamingErr
StopStartService
SoftRebootNode
SnapshotOperations
RepairStreamingErr
AddDropColumnMonkey
MajorCompaction
NodetoolCleanupMonkey
NodetoolCleanupMonkey
NodetoolCleanupMonkey
NodetoolCleanupMonkey
NodetoolCleanupMonkey
NodetoolCleanupMonkey
GrowCluster
ShrinkCluster

After that Next nemesis started ManagementRepair. Repair task was created and repair was started on cluster.
< t:2021-02-06 00:39:32,711 f:cli.py l:495 c:sdcm.mgmt.cli p:DEBUG > Created task id is: repair/0d092ce1-dd22-4ca9-863f-09aef4130160

During repair on node2 next error and coredump happened:

2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !ERR     | scylla: [shard 3] mutation_reader - maybe_validate_partition_start(): validation failed, expected partition with key that falls into current range ({5601512449687021828, pk{000a4d50383237324e4c4b30}}, {5601512879385889061, start}], but got {key: pk{000a4e384b4d4b3330303630}, token:5605976446640575917}, at: 0x3ccd24e 0x3ccd6f0 0x3ccda98 0x390dccd 0x13d9800 0x141ec32 0x3942d7f 0x3943f67 0x3962518 0x390e6ca /opt/scylladb/libreloc/libpthread.so.0+0x93f8 /opt/scylladb/libreloc/libc.so.6+0x101902#012   --------#012   seastar::continuation<seastar::internal::promise_base_with_type<void>, evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23, seastar::future<void>::then_impl_nrvo<evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23, seastar::future<void> >(evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>#012   --------#012   seastar::continuation<seastar::internal::promise_base_with_type<void>, evictable_reader::fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_25, seastar::future<void>::then_impl_nrvo<evictable_reader::fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_25, seastar::future<void> >(evictable_reader::fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_25&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, evictable_reader::fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_25&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>#012   --------#012   seastar::continuation<seastar::internal::promise_base_with_type<void>, evictable_reader::fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_26, seastar::future<void>::then_impl_nrvo<evictable_reader::fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_26, seastar::future<void> >(evictable_reader::fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_26&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, evictable_reader::fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_26&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>#012   --------#012   seastar::continuation<seastar::internal::promise_base_with_type<void>, evictable_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_27::operator()(flat_mutation_reader&)::{lambda()#1}, seastar::future<void>::then_impl_nrvo<{lambda()#1}, seastar::future>({lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda()#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>#012   --------#012   seastar::internal::do_with_state<std::tuple<flat_mutation_reader>, seastar::future<void> >#012   --------#012   seastar::continuation<seastar::internal::promise_base_with_type<(anonymous namespace)::remote_fill_buffer_result>, (anonymous namespace)::shard_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_33::operator()()::{lambda()#1}, seastar::future<void>::then_impl_nrvo<{lambda()#1}, (anonymous namespace)::shard_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_33::operator()()::{lambda()#1}<(anonymous namespace)::remote_fill_buffer_result> >({lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<(anonymous namespace)::remote_fill_buffer_result>&&, {lambda()#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>#012   --------#012   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::smp_message_queue::async_work_item<(anonymous namespace)::shard_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_33>::run_and_dispose()::{lambda(auto:1)#1}, seastar::future<(anonymous namespace)::remote_fill_buffer_result>::then_wrapped_nrvo<void, {lambda(auto:1)#1}>({lambda(auto:1)#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda(auto:1)#1}&, seastar::future_state<seastar::future>&&)#1}, seastar::future>
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: Aborting on shard 3.
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: Backtrace:
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: 0x3930048
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: 0x3961fe2
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: 0x7fd12be791df
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: /opt/scylladb/libreloc/libc.so.6+0x3d9d4
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: /opt/scylladb/libreloc/libc.so.6+0x268a3
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: 0x390dcef
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: 0x13d9800
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: 0x141ec32
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: 0x3942d7f
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: 0x3943f67
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: 0x3962518
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: 0x390e6ca
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: /opt/scylladb/libreloc/libpthread.so.0+0x93f8
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: /opt/scylladb/libreloc/libc.so.6+0x101902

Decoded backtrace

[scyllaadm@longevity-tls-50gb-3d-master-db-node-2b6831c3-1 ~]$ addr2line -Cpife /usr/lib/debug/opt/scylladb/libexec/scylla-4.5.dev-0.20210204.7f3083739.x86_64.debug 0x3930048 0x3961fe2 0x7fd12be791df /opt/scylladb/libreloc/libc.so.6+0x3d9d4 /opt/scylladb/libreloc/libc.so.6+0x268a3 0x390dcef 0x13d9800 0x141ec32 0x3942d7f 0x3943f67 0x3962518 0x390e6ca /opt/scylladb/libreloc/libpthread.so.0+0x93f8 /opt/scylladb/libreloc/libc.so.6+0x101902
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:59
 (inlined by) seastar::backtrace_buffer::append_backtrace() at ./build/release/seastar/./seastar/src/core/reactor.cc:753
 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:783
seastar::print_with_backtrace(char const*, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:795
 (inlined by) seastar::sigabrt_action() at ./build/release/seastar/./seastar/src/core/reactor.cc:3578
 (inlined by) operator() at ./build/release/seastar/./seastar/src/core/reactor.cc:3560
 (inlined by) __invoke at ./build/release/seastar/./seastar/src/core/reactor.cc:3556
?? ??:0
?? ??:0
?? ??:0
seastar::on_internal_error(seastar::logger&, std::basic_string_view<char, std::char_traits<char> >) at ./build/release/seastar/./seastar/src/core/on_internal_error.cc:39
void require<char [31], nonwrapping_interval<dht::ring_position>, dht::decorated_key>(bool, char const*, char const (&) [31], nonwrapping_interval<dht::ring_position> const&, dht::decorated_key const&) at ./mutation_reader.cc:1224
 (inlined by) evictable_reader::maybe_validate_partition_start(seastar::circular_buffer<mutation_fragment, tracking_allocator<mutation_fragment> > const&) at ./mutation_reader.cc:1258
operator() at ./mutation_reader.cc:1343
 (inlined by) void std::__invoke_impl<void, evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&>(std::__invoke_other, evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/invoke.h:60
 (inlined by) std::__invoke_result<evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&>::type std::__invoke<evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&>(evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/invoke.h:95
 (inlined by) std::invoke_result<evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&>::type std::invoke<evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&>(evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/functional:88
 (inlined by) auto seastar::internal::future_invoke<evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&, seastar::internal::monostate>(evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&, seastar::internal::monostate&&) at ././seastar/include/seastar/core/future.hh:1209
 (inlined by) operator() at ././seastar/include/seastar/core/future.hh:1582
 (inlined by) void seastar::futurize<void>::satisfy_with_result_of<seastar::future<void> seastar::future<void>::then_impl_nrvo<evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23, seastar::future<void> >(evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&, seastar::future_state<seastar::internal::monostate>&&)#1}::operator()(seastar::internal::promise_base_with_type<void>&&, evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&, seastar::future_state<seastar::internal::monostate>&&) const::{lambda()#1}>(seastar::internal::promise_base_with_type<void>&&, evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&&) at ././seastar/include/seastar/core/future.hh:2117
 (inlined by) operator() at ././seastar/include/seastar/core/future.hh:1575
 (inlined by) seastar::continuation<seastar::internal::promise_base_with_type<void>, evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23, seastar::future<void> seastar::future<void>::then_impl_nrvo<evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23, seastar::future<void> >(evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>::run_and_dispose() at ././seastar/include/seastar/core/future.hh:767
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2247
 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2656
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2815
operator() at ./build/release/seastar/./seastar/src/core/reactor.cc:4007
 (inlined by) void std::__invoke_impl<void, seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97&>(std::__invoke_other, seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97&) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/invoke.h:60
 (inlined by) std::enable_if<is_invocable_r_v<void, seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97&>, void>::type std::__invoke_r<void, seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97&>(seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97&) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/invoke.h:110
 (inlined by) std::_Function_handler<void (), seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97>::_M_invoke(std::_Any_data const&) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/std_function.h:291
std::function<void ()>::operator()() const at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/std_function.h:622
 (inlined by) seastar::posix_thread::start_routine(void*) at ./build/release/seastar/./seastar/src/core/posix.cc:60
?? ??:0
?? ??:0

Coredump

021-02-06 01:48:09.818: (CoreDumpEvent Severity.ERROR) node=Node longevity-tls-50gb-3d-master-db-node-2b6831c3-2 [13.49.223.11 | 10.0.1.26] (seed: False)
corefile_url=https://storage.cloud.google.com/upload.scylladb.com/core.scylla.997.8eab59a8978e4506963dded3f5e56fdc.97555.1612574198000000/core.scylla.997.8eab59a8978e4506963dded3f5e56fdc.97555.1612574198000000.gz
backtrace=           PID: 97555 (scylla)
           UID: 997 (scylla)
           GID: 1001 (scylla)
        Signal: 6 (ABRT)
     Timestamp: Sat 2021-02-06 01:16:38 UTC (30min ago)
  Command Line: /usr/bin/scylla --blocked-reactor-notify-ms 500 --abort-on-lsa-bad-alloc 1 --abort-on-seastar-bad-alloc --abort-on-internal-error 1 --abort-on-ebadf 1 --enable-sstable-key-validation 1 --log-to-syslog 1 --log-to-stdout 0 --default-log-level info --network-stack posix --io-properties-file=/etc/scylla.d/io_properties.yaml --cpuset 1-7,9-15 --lock-memory=1
    Executable: /opt/scylladb/libexec/scylla
 Control Group: /
       Boot ID: 8eab59a8978e4506963dded3f5e56fdc
    Machine ID: cc2c86fe566741e6a2ff6d399c5d5daa
      Hostname: longevity-tls-50gb-3d-master-db-node-2b6831c3-2
      Coredump: /var/lib/systemd/coredump/core.scylla.997.8eab59a8978e4506963dded3f5e56fdc.97555.1612574198000000
       Message: Process 97555 (scylla) of user 997 dumped core.
                
                Stack trace of thread 97558:
                #0  0x00007fd12b15b9d5 raise (libc.so.6)
                #1  0x00007fd12b144954 abort (libc.so.6)
                #2  0x000000000390dcf0 _ZN7seastar17on_internal_errorERNS_6loggerESt17basic_string_viewIcSt11char_traitsIcEE (scylla)
                #3  0x00000000013d9801 _ZN16evictable_reader30maybe_validate_partition_startERKN7seastar15circular_bufferI17mutation_fragment18tracking_allocatorIS2_EEE (scylla)
                #4  0x000000000141ec33 _ZN7seastar12continuationINS_8internal22promise_base_with_typeIvEEZN16evictable_reader14do_fill_bufferER20flat_mutation_readerNSt6chrono10time_pointINS_12lowres_clockENS7_8durationIlSt5ratioILl1ELl1000EEEEEEE4$_23ZNS_6futureIvE14then_impl_nrvoISF_SH_EET0_OT_EUlOS3_RSF_ONS_12future_stateINS1_9monostateEEEE_vE15run_and_disposeEv (scylla)
                #5  0x0000000003942d80 _ZN7seastar7reactor14run_some_tasksEv (scylla)
                #6  0x0000000003943f68 _ZN7seastar7reactor3runEv (scylla)
                #7  0x0000000003962519 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureEN5boost15program_options13variables_mapENS1_14reactor_configEE4$_97E9_M_invokeERKSt9_Any_data (scylla)
                #8  0x000000000390e6cb _ZN7seastar12posix_thread13start_routineEPv (scylla)
                #9  0x00007fd12be6e3f9 start_thread (libpthread.so.0)
                #10 0x00007fd12b21f903 __clone (libc.so.6)
                
                Stack trace of thread 97571:
                #0  0x00007fd12be780fc read (libpthread.so.0)
                #1  0x0000000003984a76 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
                #2  0x0000000003985130 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla)
                #3  0x000000000390e6cb _ZN7seastar12posix_thread13start_routineEPv (scylla)
                #4  0x00007fd12be6e3f9 start_thread (libpthread.so.0)
                #5  0x00007fd12b21f903 __clone (libc.so.6)
                
                Stack trace of thread 97570:
                #0  0x00007fd12be780fc read (libpthread.so.0)
                #1  0x0000000003984a76 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
                #2  0x0000000003985130 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla)
                #3  0x000000000390e6cb _ZN7seastar12posix_thread13start_routineEPv (scylla)
                #4  0x00007fd12be6e3f9 start_thread (libpthread.so.0)
                #5  0x00007fd12b21f903 __clone (libc.so.6)
                
                Stack trace of thread 97573:
                #0  0x00007fd12be780fc read (libpthread.so.0)
                #1  0x0000000003984a76 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
                #2  0x0000000003985130 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla)
                #3  0x000000000390e6cb _ZN7seastar12posix_thread13start_routineEPv (scylla)
                #4  0x00007fd12be6e3f9 start_thread (libpthread.so.0)
                #5  0x00007fd12b21f903 __clone (libc.so.6)
                
                Stack trace of thread 97572:
                #0  0x00007fd12be780fc read (libpthread.so.0)
                #1  0x0000000003984a76 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
                #2  0x0000000003985130 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla)
                #3  0x000000000390e6cb _ZN7seastar12posix_thread13start_routineEPv (scylla)
                #4  0x00007fd12be6e3f9 start_thread (libpthread.so.0)
                #5  0x00007fd12b21f903 __clone (libc.so.6)
                
                Stack trace of thread 97569:
                #0  0x00007fd12be780fc read (libpthread.so.0)
                #1  0x0000000003984a76 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
                #2  0x0000000003985130 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla)
                #3  0x000000000390e6cb _ZN7seastar12posix_thread13start_routineEPv (scylla)
                #4  0x00007fd12be6e3f9 start_thread (libpthread.so.0)
                #5  0x00007fd12b21f903 __clone (libc.so.6)
                
                Stack trace of thread 97557:
                #0  0x00000000010f3c12 _ZNK6schema21get_column_definitionERKN7seastar13basic_sstringIajLj31ELb0EEE (scylla)
                #1  0x00000000024dac9f _ZNK5boost9iterators6detail20iterator_facade_baseINS0_18transform_iteratorINS_12range_detail38d

download_instructions=gsutil cp gs://upload.scylladb.com/core.scylla.997.8eab59a8978e4506963dded3f5e56fdc.97555.1612574198000000/core.scylla.997.8eab59a8978e4506963dded3f5e56fdc.97555.1612574198000000.gz .
gunzip /var/lib/systemd/coredump/core.scylla.997.8eab59a8978e4506963dded3f5e56fdc.97555.1612574198000000.gz

This cause that repair task was failed:
2021-02-06 01:24:10.746: (DisruptionEvent Severity.ERROR): type=ManagementRepair subtype=end node=Node longevity-tls-50gb-3d-master-db-node-2b6831c3-7 [13.48.13.190 | 10.0.3.254] (seed: False) duration=2704 error=Task: repair/0d092ce1-dd22-4ca9-863f-09aef4130160 final status is: ERROR.

====================================

Restore Monitor Stack command: $ hydra investigate show-monitor 2b6831c3-23d5-45bd-a7d3-10544ec41d9a
Show all stored logs command: $ hydra investigate show-logs 2b6831c3-23d5-45bd-a7d3-10544ec41d9a

Jenkins job URL

The text was updated successfully, but these errors were encountered:

denesb · 2021-02-15T08:03:43Z

@bhalevy scylla-s3-reloc is not giving me anything for the build id (4abdd1a158a7b6e39afe5f03c27cd50a3cd9d46d). I checked the core and it is the correct build id.

bhalevy · 2021-02-15T08:15:47Z

@bhalevy scylla-s3-reloc is not giving me anything for the build id (4abdd1a158a7b6e39afe5f03c27cd50a3cd9d46d). I checked the core and it is the correct build id.

@denesb apparently the artifacts path has changed.
Here it is: http://downloads.scylladb.com.s3.amazonaws.com/unstable/scylla/master/relocatable/2021-02-04T17%3A36%3A39Z/scylla-package.tar.gz

/cc @hagitsegev

denesb · 2021-02-15T08:37:01Z

Thanks @bhalevy.

denesb · 2021-02-15T10:10:04Z

One immediately strange observation is that the end bound of _range_override doesn't coincide with the end bound of _pr. Only the start bound of the former should be moved as the reader is recreated somewhere mid-stream.

(gdb) p $10->_range_override._M_payload._M_payload._M_value
$17 = ({5601512449687021828, 000a4d50383237324e4c4b30}, {5601512879385889061, -1}]
(gdb) p *$10->_pr
$18 = [{5605694793325590236, -1}, {5606016479013259557, -1}]

On second look _range_override's end bound is smaller than _pr's start bound. Even stranger.

denesb · 2021-02-15T10:17:03Z

_last_pkey is disengaged so we are recreating the reader after fast-forwarding, which explains why _pr is a completely different range than _range_override. The bug is in fast_forward_to() which doesn't reset _range_override.

asias · 2021-02-18T06:18:54Z

_last_pkey is disengaged so we are recreating the reader after fast-forwarding, which explains why _pr is a completely different range than _range_override. The bug is in fast_forward_to() which doesn't reset _range_override.

Looking at c3b4c3f, just to be sure, so the issue is a false-positive validation not a real issue.

denesb · 2021-02-18T06:26:17Z

_last_pkey is disengaged so we are recreating the reader after fast-forwarding, which explains why _pr is a completely different range than _range_override. The bug is in fast_forward_to() which doesn't reset _range_override.

Looking at c3b4c3f, just to be sure, so the issue is a false-positive validation not a real issue.

Yes, the validation is using the wrong range to validate the emitted partition, so it triggers a false-positive validation failure.

slivne · 2021-02-21T12:07:06Z

@denesb do we need to backport this to 4.4 / 4.3 ?

denesb · 2021-02-22T07:17:24Z

Yes, 4.3 already has the position self validation.

bhalevy · 2021-07-28T07:26:04Z

@avikivity please backport.
We're seeing maybe_validate_partition_start tripping in the field
plus repair crashes that might be caused by this issue (although I don't have proof for this).

In any case that would be a prerequisite for backporting the fix for #8923, #8893

slivne · 2021-07-28T12:07:36Z

@avikivity ping

`_range_override` is used to store the modified range the reader reads after it has to be recreated (when recreating a reader it's read range is reduced to account for partitions it already read). When engaged, this field overrides the `_pr` field as the definitive range the reader is supposed to be currently reading. Fast forwarding conceptually overrides the range the reader is currently reading, however currently it doesn't reset the `_range_override` field. This resulted in `_range_override` (containing the modified pre-fast-forward range) incorrectly overriding the fast-forwarded-to range in `_pr` when validating the first partition produced by the just recreated reader, resulting in a false-positive validation failure. Fixes: #8059 Tests: unit(release) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210217164744.420100-1-bdenes@scylladb.com> (cherry picked from commit c3b4c3f)

avikivity · 2021-10-13T12:40:58Z

Backported to 4.4 (was already in 4.5).

`_range_override` is used to store the modified range the reader reads after it has to be recreated (when recreating a reader it's read range is reduced to account for partitions it already read). When engaged, this field overrides the `_pr` field as the definitive range the reader is supposed to be currently reading. Fast forwarding conceptually overrides the range the reader is currently reading, however currently it doesn't reset the `_range_override` field. This resulted in `_range_override` (containing the modified pre-fast-forward range) incorrectly overriding the fast-forwarded-to range in `_pr` when validating the first partition produced by the just recreated reader, resulting in a false-positive validation failure. Fixes: scylladb#8059 Tests: unit(release) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210217164744.420100-1-bdenes@scylladb.com> (cherry picked from commit c3b4c3f) Conflicts: - Decoroutinize evictable_reader::fast_forward_to - test_reader::next_partition: return void.

`_range_override` is used to store the modified range the reader reads after it has to be recreated (when recreating a reader it's read range is reduced to account for partitions it already read). When engaged, this field overrides the `_pr` field as the definitive range the reader is supposed to be currently reading. Fast forwarding conceptually overrides the range the reader is currently reading, however currently it doesn't reset the `_range_override` field. This resulted in `_range_override` (containing the modified pre-fast-forward range) incorrectly overriding the fast-forwarded-to range in `_pr` when validating the first partition produced by the just recreated reader, resulting in a false-positive validation failure. Fixes: #8059 Tests: unit(release) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210217164744.420100-1-bdenes@scylladb.com> [avi: add #include] (cherry picked from commit c3b4c3f)

slivne assigned bhalevy and denesb Feb 11, 2021

slivne added bug showstopper labels Feb 11, 2021

slivne added this to the 4.5 milestone Feb 11, 2021

slivne mentioned this issue Feb 14, 2021

Unexpected node restart during adding new DC #7982

Closed

scylladb-promoter closed this as completed in c3b4c3f Feb 17, 2021

scylladb-promoter added the Backport candidate label Feb 17, 2021

roydahan changed the title ~~Repair task from manager failed due to coredumpt on one of the node~~ Repair task from manager failed due to coredump on one of the node Oct 12, 2021

avikivity removed the Backport candidate label Oct 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repair task from manager failed due to coredump on one of the node #8059

Repair task from manager failed due to coredump on one of the node #8059

aleksbykov commented Feb 10, 2021

denesb commented Feb 15, 2021

bhalevy commented Feb 15, 2021

denesb commented Feb 15, 2021

denesb commented Feb 15, 2021

denesb commented Feb 15, 2021

asias commented Feb 18, 2021

denesb commented Feb 18, 2021

slivne commented Feb 21, 2021

denesb commented Feb 22, 2021

bhalevy commented Jul 28, 2021

slivne commented Jul 28, 2021

avikivity commented Oct 13, 2021

Repair task from manager failed due to coredump on one of the node #8059

Repair task from manager failed due to coredump on one of the node #8059

Comments

aleksbykov commented Feb 10, 2021

denesb commented Feb 15, 2021

bhalevy commented Feb 15, 2021

denesb commented Feb 15, 2021

denesb commented Feb 15, 2021

denesb commented Feb 15, 2021

asias commented Feb 18, 2021

denesb commented Feb 18, 2021

slivne commented Feb 21, 2021

denesb commented Feb 22, 2021

bhalevy commented Jul 28, 2021

slivne commented Jul 28, 2021

avikivity commented Oct 13, 2021