Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repair task from manager failed due to coredump on one of the node #8059

Closed
aleksbykov opened this issue Feb 10, 2021 · 12 comments
Closed

Repair task from manager failed due to coredump on one of the node #8059

aleksbykov opened this issue Feb 10, 2021 · 12 comments
Assignees
Labels
Milestone

Comments

@aleksbykov
Copy link
Contributor

Installation details
Scylla version (or git commit hash): 4.5.dev-0.20210204.7f3083739 with build-id 4abdd1a158a7b6e39afe5f03c27cd50a3cd9d46d
Cluster size: 6 nodes (i3.4xlarge)
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0edd71dfeca9e0df2 (aws: eu-north-1)

Test id: 2b6831c3-23d5-45bd-a7d3-10544ec41d9a
Test: longevity-50gb-3days
Test name: longevity_test.LongevityTest.test_custom_time
Test config file(s):

Issue description

====================================
The job failed after 22 hours. Next list of nemesis action:

  • ModifyTablePropertiesDefaultTimeToLive
  • ServerSslHotReloadingNemesis
  • ServerSslHotReloadingNemesis
  • NoCorruptRepair
  • Enospc
  • ShowTopPartitions
  • StopStartService
  • AddDropColumnMonkey
  • GrowCluster
  • ShrinkCluster
  • DecommissionStreamingErr
  • RestartNodeWithResharding
  • MajorCompaction
  • MemoryStress
  • SoftRebootNode
  • MajorCompaction
  • Drainer
  • DecommissionStreamingErr
  • ManagementBackupWithSpecificKeyspaces
  • StopStartNetworkInterfaces
  • Enospc
  • Decommission
  • TerminateAndRemoveNodeMonkey
  • RebuildStreamingErr
  • RejectNodeExporterNetwork
  • StopStartService
  • ServerSslHotReloadingNemesis
  • DecommissionStreamingErr
  • StopStartService
  • SoftRebootNode
  • SnapshotOperations
  • RepairStreamingErr
  • AddDropColumnMonkey
  • MajorCompaction
  • NodetoolCleanupMonkey
  • NodetoolCleanupMonkey
  • NodetoolCleanupMonkey
  • NodetoolCleanupMonkey
  • NodetoolCleanupMonkey
  • NodetoolCleanupMonkey
  • GrowCluster
  • ShrinkCluster

After that Next nemesis started ManagementRepair. Repair task was created and repair was started on cluster.
< t:2021-02-06 00:39:32,711 f:cli.py l:495 c:sdcm.mgmt.cli p:DEBUG > Created task id is: repair/0d092ce1-dd22-4ca9-863f-09aef4130160

During repair on node2 next error and coredump happened:

2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !ERR     | scylla: [shard 3] mutation_reader - maybe_validate_partition_start(): validation failed, expected partition with key that falls into current range ({5601512449687021828, pk{000a4d50383237324e4c4b30}}, {5601512879385889061, start}], but got {key: pk{000a4e384b4d4b3330303630}, token:5605976446640575917}, at: 0x3ccd24e 0x3ccd6f0 0x3ccda98 0x390dccd 0x13d9800 0x141ec32 0x3942d7f 0x3943f67 0x3962518 0x390e6ca /opt/scylladb/libreloc/libpthread.so.0+0x93f8 /opt/scylladb/libreloc/libc.so.6+0x101902#012   --------#012   seastar::continuation<seastar::internal::promise_base_with_type<void>, evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23, seastar::future<void>::then_impl_nrvo<evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23, seastar::future<void> >(evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>#012   --------#012   seastar::continuation<seastar::internal::promise_base_with_type<void>, evictable_reader::fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_25, seastar::future<void>::then_impl_nrvo<evictable_reader::fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_25, seastar::future<void> >(evictable_reader::fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_25&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, evictable_reader::fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_25&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>#012   --------#012   seastar::continuation<seastar::internal::promise_base_with_type<void>, evictable_reader::fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_26, seastar::future<void>::then_impl_nrvo<evictable_reader::fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_26, seastar::future<void> >(evictable_reader::fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_26&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, evictable_reader::fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_26&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>#012   --------#012   seastar::continuation<seastar::internal::promise_base_with_type<void>, evictable_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_27::operator()(flat_mutation_reader&)::{lambda()#1}, seastar::future<void>::then_impl_nrvo<{lambda()#1}, seastar::future>({lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda()#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>#012   --------#012   seastar::internal::do_with_state<std::tuple<flat_mutation_reader>, seastar::future<void> >#012   --------#012   seastar::continuation<seastar::internal::promise_base_with_type<(anonymous namespace)::remote_fill_buffer_result>, (anonymous namespace)::shard_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_33::operator()()::{lambda()#1}, seastar::future<void>::then_impl_nrvo<{lambda()#1}, (anonymous namespace)::shard_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_33::operator()()::{lambda()#1}<(anonymous namespace)::remote_fill_buffer_result> >({lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<(anonymous namespace)::remote_fill_buffer_result>&&, {lambda()#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>#012   --------#012   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::smp_message_queue::async_work_item<(anonymous namespace)::shard_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_33>::run_and_dispose()::{lambda(auto:1)#1}, seastar::future<(anonymous namespace)::remote_fill_buffer_result>::then_wrapped_nrvo<void, {lambda(auto:1)#1}>({lambda(auto:1)#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda(auto:1)#1}&, seastar::future_state<seastar::future>&&)#1}, seastar::future>
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: Aborting on shard 3.
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: Backtrace:
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: 0x3930048
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: 0x3961fe2
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: 0x7fd12be791df
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: /opt/scylladb/libreloc/libc.so.6+0x3d9d4
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: /opt/scylladb/libreloc/libc.so.6+0x268a3
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: 0x390dcef
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: 0x13d9800
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: 0x141ec32
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: 0x3942d7f
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: 0x3943f67
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: 0x3962518
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: 0x390e6ca
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: /opt/scylladb/libreloc/libpthread.so.0+0x93f8
2021-02-06T01:16:38+00:00  longevity-tls-50gb-3d-master-db-node-2b6831c3-2 !INFO    | scylla: /opt/scylladb/libreloc/libc.so.6+0x101902

Decoded backtrace

[scyllaadm@longevity-tls-50gb-3d-master-db-node-2b6831c3-1 ~]$ addr2line -Cpife /usr/lib/debug/opt/scylladb/libexec/scylla-4.5.dev-0.20210204.7f3083739.x86_64.debug 0x3930048 0x3961fe2 0x7fd12be791df /opt/scylladb/libreloc/libc.so.6+0x3d9d4 /opt/scylladb/libreloc/libc.so.6+0x268a3 0x390dcef 0x13d9800 0x141ec32 0x3942d7f 0x3943f67 0x3962518 0x390e6ca /opt/scylladb/libreloc/libpthread.so.0+0x93f8 /opt/scylladb/libreloc/libc.so.6+0x101902
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:59
 (inlined by) seastar::backtrace_buffer::append_backtrace() at ./build/release/seastar/./seastar/src/core/reactor.cc:753
 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:783
seastar::print_with_backtrace(char const*, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:795
 (inlined by) seastar::sigabrt_action() at ./build/release/seastar/./seastar/src/core/reactor.cc:3578
 (inlined by) operator() at ./build/release/seastar/./seastar/src/core/reactor.cc:3560
 (inlined by) __invoke at ./build/release/seastar/./seastar/src/core/reactor.cc:3556
?? ??:0
?? ??:0
?? ??:0
seastar::on_internal_error(seastar::logger&, std::basic_string_view<char, std::char_traits<char> >) at ./build/release/seastar/./seastar/src/core/on_internal_error.cc:39
void require<char [31], nonwrapping_interval<dht::ring_position>, dht::decorated_key>(bool, char const*, char const (&) [31], nonwrapping_interval<dht::ring_position> const&, dht::decorated_key const&) at ./mutation_reader.cc:1224
 (inlined by) evictable_reader::maybe_validate_partition_start(seastar::circular_buffer<mutation_fragment, tracking_allocator<mutation_fragment> > const&) at ./mutation_reader.cc:1258
operator() at ./mutation_reader.cc:1343
 (inlined by) void std::__invoke_impl<void, evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&>(std::__invoke_other, evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/invoke.h:60
 (inlined by) std::__invoke_result<evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&>::type std::__invoke<evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&>(evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/invoke.h:95
 (inlined by) std::invoke_result<evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&>::type std::invoke<evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&>(evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/functional:88
 (inlined by) auto seastar::internal::future_invoke<evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&, seastar::internal::monostate>(evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&, seastar::internal::monostate&&) at ././seastar/include/seastar/core/future.hh:1209
 (inlined by) operator() at ././seastar/include/seastar/core/future.hh:1582
 (inlined by) void seastar::futurize<void>::satisfy_with_result_of<seastar::future<void> seastar::future<void>::then_impl_nrvo<evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23, seastar::future<void> >(evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&, seastar::future_state<seastar::internal::monostate>&&)#1}::operator()(seastar::internal::promise_base_with_type<void>&&, evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&, seastar::future_state<seastar::internal::monostate>&&) const::{lambda()#1}>(seastar::internal::promise_base_with_type<void>&&, evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&&) at ././seastar/include/seastar/core/future.hh:2117
 (inlined by) operator() at ././seastar/include/seastar/core/future.hh:1575
 (inlined by) seastar::continuation<seastar::internal::promise_base_with_type<void>, evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23, seastar::future<void> seastar::future<void>::then_impl_nrvo<evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23, seastar::future<void> >(evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, evictable_reader::do_fill_buffer(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_23&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>::run_and_dispose() at ././seastar/include/seastar/core/future.hh:767
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2247
 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2656
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2815
operator() at ./build/release/seastar/./seastar/src/core/reactor.cc:4007
 (inlined by) void std::__invoke_impl<void, seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97&>(std::__invoke_other, seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97&) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/invoke.h:60
 (inlined by) std::enable_if<is_invocable_r_v<void, seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97&>, void>::type std::__invoke_r<void, seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97&>(seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97&) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/invoke.h:110
 (inlined by) std::_Function_handler<void (), seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97>::_M_invoke(std::_Any_data const&) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/std_function.h:291
std::function<void ()>::operator()() const at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/std_function.h:622
 (inlined by) seastar::posix_thread::start_routine(void*) at ./build/release/seastar/./seastar/src/core/posix.cc:60
?? ??:0
?? ??:0

Coredump

021-02-06 01:48:09.818: (CoreDumpEvent Severity.ERROR) node=Node longevity-tls-50gb-3d-master-db-node-2b6831c3-2 [13.49.223.11 | 10.0.1.26] (seed: False)
corefile_url=https://storage.cloud.google.com/upload.scylladb.com/core.scylla.997.8eab59a8978e4506963dded3f5e56fdc.97555.1612574198000000/core.scylla.997.8eab59a8978e4506963dded3f5e56fdc.97555.1612574198000000.gz
backtrace=           PID: 97555 (scylla)
           UID: 997 (scylla)
           GID: 1001 (scylla)
        Signal: 6 (ABRT)
     Timestamp: Sat 2021-02-06 01:16:38 UTC (30min ago)
  Command Line: /usr/bin/scylla --blocked-reactor-notify-ms 500 --abort-on-lsa-bad-alloc 1 --abort-on-seastar-bad-alloc --abort-on-internal-error 1 --abort-on-ebadf 1 --enable-sstable-key-validation 1 --log-to-syslog 1 --log-to-stdout 0 --default-log-level info --network-stack posix --io-properties-file=/etc/scylla.d/io_properties.yaml --cpuset 1-7,9-15 --lock-memory=1
    Executable: /opt/scylladb/libexec/scylla
 Control Group: /
       Boot ID: 8eab59a8978e4506963dded3f5e56fdc
    Machine ID: cc2c86fe566741e6a2ff6d399c5d5daa
      Hostname: longevity-tls-50gb-3d-master-db-node-2b6831c3-2
      Coredump: /var/lib/systemd/coredump/core.scylla.997.8eab59a8978e4506963dded3f5e56fdc.97555.1612574198000000
       Message: Process 97555 (scylla) of user 997 dumped core.
                
                Stack trace of thread 97558:
                #0  0x00007fd12b15b9d5 raise (libc.so.6)
                #1  0x00007fd12b144954 abort (libc.so.6)
                #2  0x000000000390dcf0 _ZN7seastar17on_internal_errorERNS_6loggerESt17basic_string_viewIcSt11char_traitsIcEE (scylla)
                #3  0x00000000013d9801 _ZN16evictable_reader30maybe_validate_partition_startERKN7seastar15circular_bufferI17mutation_fragment18tracking_allocatorIS2_EEE (scylla)
                #4  0x000000000141ec33 _ZN7seastar12continuationINS_8internal22promise_base_with_typeIvEEZN16evictable_reader14do_fill_bufferER20flat_mutation_readerNSt6chrono10time_pointINS_12lowres_clockENS7_8durationIlSt5ratioILl1ELl1000EEEEEEE4$_23ZNS_6futureIvE14then_impl_nrvoISF_SH_EET0_OT_EUlOS3_RSF_ONS_12future_stateINS1_9monostateEEEE_vE15run_and_disposeEv (scylla)
                #5  0x0000000003942d80 _ZN7seastar7reactor14run_some_tasksEv (scylla)
                #6  0x0000000003943f68 _ZN7seastar7reactor3runEv (scylla)
                #7  0x0000000003962519 _ZNSt17_Function_handlerIFvvEZN7seastar3smp9configureEN5boost15program_options13variables_mapENS1_14reactor_configEE4$_97E9_M_invokeERKSt9_Any_data (scylla)
                #8  0x000000000390e6cb _ZN7seastar12posix_thread13start_routineEPv (scylla)
                #9  0x00007fd12be6e3f9 start_thread (libpthread.so.0)
                #10 0x00007fd12b21f903 __clone (libc.so.6)
                
                Stack trace of thread 97571:
                #0  0x00007fd12be780fc read (libpthread.so.0)
                #1  0x0000000003984a76 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
                #2  0x0000000003985130 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla)
                #3  0x000000000390e6cb _ZN7seastar12posix_thread13start_routineEPv (scylla)
                #4  0x00007fd12be6e3f9 start_thread (libpthread.so.0)
                #5  0x00007fd12b21f903 __clone (libc.so.6)
                
                Stack trace of thread 97570:
                #0  0x00007fd12be780fc read (libpthread.so.0)
                #1  0x0000000003984a76 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
                #2  0x0000000003985130 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla)
                #3  0x000000000390e6cb _ZN7seastar12posix_thread13start_routineEPv (scylla)
                #4  0x00007fd12be6e3f9 start_thread (libpthread.so.0)
                #5  0x00007fd12b21f903 __clone (libc.so.6)
                
                Stack trace of thread 97573:
                #0  0x00007fd12be780fc read (libpthread.so.0)
                #1  0x0000000003984a76 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
                #2  0x0000000003985130 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla)
                #3  0x000000000390e6cb _ZN7seastar12posix_thread13start_routineEPv (scylla)
                #4  0x00007fd12be6e3f9 start_thread (libpthread.so.0)
                #5  0x00007fd12b21f903 __clone (libc.so.6)
                
                Stack trace of thread 97572:
                #0  0x00007fd12be780fc read (libpthread.so.0)
                #1  0x0000000003984a76 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
                #2  0x0000000003985130 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla)
                #3  0x000000000390e6cb _ZN7seastar12posix_thread13start_routineEPv (scylla)
                #4  0x00007fd12be6e3f9 start_thread (libpthread.so.0)
                #5  0x00007fd12b21f903 __clone (libc.so.6)
                
                Stack trace of thread 97569:
                #0  0x00007fd12be780fc read (libpthread.so.0)
                #1  0x0000000003984a76 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
                #2  0x0000000003985130 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla)
                #3  0x000000000390e6cb _ZN7seastar12posix_thread13start_routineEPv (scylla)
                #4  0x00007fd12be6e3f9 start_thread (libpthread.so.0)
                #5  0x00007fd12b21f903 __clone (libc.so.6)
                
                Stack trace of thread 97557:
                #0  0x00000000010f3c12 _ZNK6schema21get_column_definitionERKN7seastar13basic_sstringIajLj31ELb0EEE (scylla)
                #1  0x00000000024dac9f _ZNK5boost9iterators6detail20iterator_facade_baseINS0_18transform_iteratorINS_12range_detail38d

download_instructions=gsutil cp gs://upload.scylladb.com/core.scylla.997.8eab59a8978e4506963dded3f5e56fdc.97555.1612574198000000/core.scylla.997.8eab59a8978e4506963dded3f5e56fdc.97555.1612574198000000.gz .
gunzip /var/lib/systemd/coredump/core.scylla.997.8eab59a8978e4506963dded3f5e56fdc.97555.1612574198000000.gz

This cause that repair task was failed:
2021-02-06 01:24:10.746: (DisruptionEvent Severity.ERROR): type=ManagementRepair subtype=end node=Node longevity-tls-50gb-3d-master-db-node-2b6831c3-7 [13.48.13.190 | 10.0.3.254] (seed: False) duration=2704 error=Task: repair/0d092ce1-dd22-4ca9-863f-09aef4130160 final status is: ERROR.

====================================

Restore Monitor Stack command: $ hydra investigate show-monitor 2b6831c3-23d5-45bd-a7d3-10544ec41d9a
Show all stored logs command: $ hydra investigate show-logs 2b6831c3-23d5-45bd-a7d3-10544ec41d9a

Logs:
grafana - https://cloudius-jenkins-test.s3.amazonaws.com/2b6831c3-23d5-45bd-a7d3-10544ec41d9a/20210206_021033/grafana-screenshot-longevity-50gb-3days-scylla-per-server-metrics-nemesis-20210206_021420-longevity-tls-50gb-3d-master-monitor-node-2b6831c3-1.png
grafana - https://cloudius-jenkins-test.s3.amazonaws.com/2b6831c3-23d5-45bd-a7d3-10544ec41d9a/20210206_021033/grafana-screenshot-overview-20210206_021034-longevity-tls-50gb-3d-master-monitor-node-2b6831c3-1.png
grafana - https://cloudius-jenkins-test.s3.amazonaws.com/2b6831c3-23d5-45bd-a7d3-10544ec41d9a/20210206_022121/grafana-screenshot-longevity-50gb-3days-scylla-per-server-metrics-nemesis-20210206_022439-longevity-tls-50gb-3d-master-monitor-node-2b6831c3-1.png
grafana - https://cloudius-jenkins-test.s3.amazonaws.com/2b6831c3-23d5-45bd-a7d3-10544ec41d9a/20210206_022121/grafana-screenshot-overview-20210206_022121-longevity-tls-50gb-3d-master-monitor-node-2b6831c3-1.png
db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/2b6831c3-23d5-45bd-a7d3-10544ec41d9a/20210206_022932/db-cluster-2b6831c3.zip
loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/2b6831c3-23d5-45bd-a7d3-10544ec41d9a/20210206_022932/loader-set-2b6831c3.zip
monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/2b6831c3-23d5-45bd-a7d3-10544ec41d9a/20210206_022932/monitor-set-2b6831c3.zip
sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/2b6831c3-23d5-45bd-a7d3-10544ec41d9a/20210206_022932/sct-runner-2b6831c3.zip

Jenkins job URL

@denesb
Copy link
Contributor

denesb commented Feb 15, 2021

@bhalevy scylla-s3-reloc is not giving me anything for the build id (4abdd1a158a7b6e39afe5f03c27cd50a3cd9d46d). I checked the core and it is the correct build id.

@bhalevy
Copy link
Member

bhalevy commented Feb 15, 2021

@bhalevy scylla-s3-reloc is not giving me anything for the build id (4abdd1a158a7b6e39afe5f03c27cd50a3cd9d46d). I checked the core and it is the correct build id.

@denesb apparently the artifacts path has changed.
Here it is: http://downloads.scylladb.com.s3.amazonaws.com/unstable/scylla/master/relocatable/2021-02-04T17%3A36%3A39Z/scylla-package.tar.gz

/cc @hagitsegev

@denesb
Copy link
Contributor

denesb commented Feb 15, 2021

Thanks @bhalevy.

@denesb
Copy link
Contributor

denesb commented Feb 15, 2021

One immediately strange observation is that the end bound of _range_override doesn't coincide with the end bound of _pr. Only the start bound of the former should be moved as the reader is recreated somewhere mid-stream.

(gdb) p $10->_range_override._M_payload._M_payload._M_value
$17 = ({5601512449687021828, 000a4d50383237324e4c4b30}, {5601512879385889061, -1}]
(gdb) p *$10->_pr
$18 = [{5605694793325590236, -1}, {5606016479013259557, -1}]

On second look _range_override's end bound is smaller than _pr's start bound. Even stranger.

@denesb
Copy link
Contributor

denesb commented Feb 15, 2021

_last_pkey is disengaged so we are recreating the reader after fast-forwarding, which explains why _pr is a completely different range than _range_override. The bug is in fast_forward_to() which doesn't reset _range_override.

@asias
Copy link
Contributor

asias commented Feb 18, 2021

_last_pkey is disengaged so we are recreating the reader after fast-forwarding, which explains why _pr is a completely different range than _range_override. The bug is in fast_forward_to() which doesn't reset _range_override.

Looking at c3b4c3f, just to be sure, so the issue is a false-positive validation not a real issue.

@denesb
Copy link
Contributor

denesb commented Feb 18, 2021

_last_pkey is disengaged so we are recreating the reader after fast-forwarding, which explains why _pr is a completely different range than _range_override. The bug is in fast_forward_to() which doesn't reset _range_override.

Looking at c3b4c3f, just to be sure, so the issue is a false-positive validation not a real issue.

Yes, the validation is using the wrong range to validate the emitted partition, so it triggers a false-positive validation failure.

@slivne
Copy link
Contributor

slivne commented Feb 21, 2021

@denesb do we need to backport this to 4.4 / 4.3 ?

@denesb
Copy link
Contributor

denesb commented Feb 22, 2021

Yes, 4.3 already has the position self validation.

@bhalevy
Copy link
Member

bhalevy commented Jul 28, 2021

@avikivity please backport.
We're seeing maybe_validate_partition_start tripping in the field
plus repair crashes that might be caused by this issue (although I don't have proof for this).

In any case that would be a prerequisite for backporting the fix for #8923, #8893

@slivne
Copy link
Contributor

slivne commented Jul 28, 2021

@avikivity ping

@roydahan roydahan changed the title Repair task from manager failed due to coredumpt on one of the node Repair task from manager failed due to coredump on one of the node Oct 12, 2021
avikivity pushed a commit that referenced this issue Oct 13, 2021
`_range_override` is used to store the modified range the reader reads
after it has to be recreated (when recreating a reader it's read range
is reduced to account for partitions it already read). When engaged,
this field overrides the `_pr` field as the definitive range the reader
is supposed to be currently reading. Fast forwarding conceptually
overrides the range the reader is currently reading, however currently
it doesn't reset the `_range_override` field. This resulted in
`_range_override` (containing the modified pre-fast-forward range)
incorrectly overriding the fast-forwarded-to range in `_pr` when
validating the first partition produced by the just recreated reader,
resulting in a false-positive validation failure.

Fixes: #8059

Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210217164744.420100-1-bdenes@scylladb.com>
(cherry picked from commit c3b4c3f)
@avikivity
Copy link
Member

Backported to 4.4 (was already in 4.5).

denesb added a commit to denesb/scylla that referenced this issue Oct 20, 2021
`_range_override` is used to store the modified range the reader reads
after it has to be recreated (when recreating a reader it's read range
is reduced to account for partitions it already read). When engaged,
this field overrides the `_pr` field as the definitive range the reader
is supposed to be currently reading. Fast forwarding conceptually
overrides the range the reader is currently reading, however currently
it doesn't reset the `_range_override` field. This resulted in
`_range_override` (containing the modified pre-fast-forward range)
incorrectly overriding the fast-forwarded-to range in `_pr` when
validating the first partition produced by the just recreated reader,
resulting in a false-positive validation failure.

Fixes: scylladb#8059

Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210217164744.420100-1-bdenes@scylladb.com>
(cherry picked from commit c3b4c3f)

Conflicts:
- Decoroutinize evictable_reader::fast_forward_to
- test_reader::next_partition: return void.
avikivity pushed a commit that referenced this issue Oct 27, 2021
`_range_override` is used to store the modified range the reader reads
after it has to be recreated (when recreating a reader it's read range
is reduced to account for partitions it already read). When engaged,
this field overrides the `_pr` field as the definitive range the reader
is supposed to be currently reading. Fast forwarding conceptually
overrides the range the reader is currently reading, however currently
it doesn't reset the `_range_override` field. This resulted in
`_range_override` (containing the modified pre-fast-forward range)
incorrectly overriding the fast-forwarded-to range in `_pr` when
validating the first partition produced by the just recreated reader,
resulting in a false-positive validation failure.

Fixes: #8059

Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210217164744.420100-1-bdenes@scylladb.com>
[avi: add #include]
(cherry picked from commit c3b4c3f)
avikivity pushed a commit that referenced this issue Oct 28, 2021
`_range_override` is used to store the modified range the reader reads
after it has to be recreated (when recreating a reader it's read range
is reduced to account for partitions it already read). When engaged,
this field overrides the `_pr` field as the definitive range the reader
is supposed to be currently reading. Fast forwarding conceptually
overrides the range the reader is currently reading, however currently
it doesn't reset the `_range_override` field. This resulted in
`_range_override` (containing the modified pre-fast-forward range)
incorrectly overriding the fast-forwarded-to range in `_pr` when
validating the first partition produced by the just recreated reader,
resulting in a false-positive validation failure.

Fixes: #8059

Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210217164744.420100-1-bdenes@scylladb.com>
[avi: add #include]
(cherry picked from commit c3b4c3f)
avikivity pushed a commit that referenced this issue Oct 28, 2021
`_range_override` is used to store the modified range the reader reads
after it has to be recreated (when recreating a reader it's read range
is reduced to account for partitions it already read). When engaged,
this field overrides the `_pr` field as the definitive range the reader
is supposed to be currently reading. Fast forwarding conceptually
overrides the range the reader is currently reading, however currently
it doesn't reset the `_range_override` field. This resulted in
`_range_override` (containing the modified pre-fast-forward range)
incorrectly overriding the fast-forwarded-to range in `_pr` when
validating the first partition produced by the just recreated reader,
resulting in a false-positive validation failure.

Fixes: #8059

Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210217164744.420100-1-bdenes@scylladb.com>
[avi: add #include]
(cherry picked from commit c3b4c3f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants