Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coredumps during repair on large partitions ('!_snapshot->is_locked()' failed) #5327

Closed
juliayakovlev opened this issue Nov 21, 2019 · 3 comments
Closed
Assignees
Milestone

Comments

@juliayakovlev
Copy link

@juliayakovlev juliayakovlev commented Nov 21, 2019

Installation details
Scylla version (or git commit hash): 3.1.1-0.20191024.f32ec885c
Cluster size: 4 nodes (c5d.2xlarge)
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-086808f78d81dc1e0 (eu-west-1)
Detected by: Staging-unclosed-partition (based on longevity-large-partition)

Test creates 10 large partitions (~ 48G each partition, 10M rows) using scylla-bench:

"scylla-bench -workload=sequential -mode=write -max-rate=300 -replication-factor=3 -partition-count=10 -clustering-row-count=10000000 -clustering-row-size=5120 -concurrency=7 -rows-per-request=10",
"scylla-bench -workload=uniform -mode=read -replication-factor=3 -partition-count=10 -clustering-row-count=10000000 -clustering-row-size=5120 -rows-per-request=10 -concurrency=7 -max-rate=32000 -duration=5760m"                    

In parallel runs repair (not corrupt) nemesis.

When repair was run in the 6th time (it was on the node2), node1 got aborting on shard with coredump

Aborting:

2019-11-20T14:44:00+00:00  ip-10-0-109-137 !INFO    | scylla: [shard 6] compaction - Compacted 3 sstables to [/var/lib/scylla/data/scylla_bench/test-77d7b6800ba011eaadae000000000003/mc-854-big-Data.db:level=0, ]. 328111666 bytes to 31296
2309 (~95% of original) in 29068ms = 10.27MB/s. ~384 total partitions merged to 2.
2019-11-20T14:44:00+00:00  ip-10-0-109-137 !INFO    | scylla: [shard 6] compaction - Compacting [/var/lib/scylla/data/scylla_bench/test-77d7b6800ba011eaadae000000000003/mc-766-big-Data.db:level=0, /var/lib/scylla/data/scylla_bench/test-7
7d7b6800ba011eaadae000000000003/mc-854-big-Data.db:level=0, /var/lib/scylla/data/scylla_bench/test-77d7b6800ba011eaadae000000000003/mc-846-big-Data.db:level=0, ]
2019-11-20T14:44:00+00:00  ip-10-0-109-137 !WARNING | scylla: [shard 1] storage_proxy - Failed to apply mutation from 10.0.248.198#1: std::bad_alloc (std::bad_alloc)
2019-11-20T14:44:00+00:00  ip-10-0-109-137 !WARNING | scylla: [shard 1] storage_proxy - Failed to apply mutation from 10.0.83.3#1: std::bad_alloc (std::bad_alloc)
2019-11-20T14:44:00+00:00  ip-10-0-109-137 !ERR     | scylla: [shard 2] storage_proxy - exception during mutation write to 10.0.109.137: std::bad_alloc (std::bad_alloc)
2019-11-20T14:44:00+00:00  ip-10-0-109-137 !WARNING | scylla: [shard 7] storage_proxy - Failed to apply mutation from 10.0.250.109#7: std::bad_alloc (std::bad_alloc)
2019-11-20T14:44:00+00:00  ip-10-0-109-137 !WARNING | scylla: [shard 0] storage_proxy - Failed to apply mutation from 10.0.248.198#0: std::bad_alloc (std::bad_alloc)
2019-11-20T14:44:00+00:00  ip-10-0-109-137 !WARNING | scylla: [shard 6] storage_proxy - Failed to apply mutation from 10.0.248.198#6: std::bad_alloc (std::bad_alloc)
2019-11-20T14:44:00+00:00  ip-10-0-109-137 !WARNING | scylla: [shard 2] storage_proxy - Failed to apply mutation from 10.0.83.3#2: std::bad_alloc (std::bad_alloc)
2019-11-20T14:44:00+00:00  ip-10-0-109-137 !WARNING | scylla: [shard 4] storage_proxy - Failed to apply mutation from 10.0.250.109#4: std::bad_alloc (std::bad_alloc)

2019-11-20T14:44:00+00:00  ip-10-0-109-137 !INFO    | scylla: scylla: partition_version.cc:706: void partition_entry::evict(mutation_cleaner&): Assertion `!_snapshot->is_locked()' failed.
2019-11-20T14:44:00+00:00  ip-10-0-109-137 !INFO    | scylla: Aborting on shard 6.
2019-11-20T14:44:00+00:00  ip-10-0-109-137 !INFO    | scylla: Backtrace:
2019-11-20T14:44:00+00:00  ip-10-0-109-137 !INFO    | scylla: 0x000000000419fed2
2019-11-20T14:44:00+00:00  ip-10-0-109-137 !INFO    | scylla: 0x000000000408d1e5
2019-11-20T14:44:00+00:00  ip-10-0-109-137 !INFO    | scylla: 0x000000000408d4e5
2019-11-20T14:44:00+00:00  ip-10-0-109-137 !INFO    | scylla: 0x000000000408d593
2019-11-20T14:44:00+00:00  ip-10-0-109-137 !INFO    | scylla: 0x00007f9ef156402f
2019-11-20T14:44:00+00:00  ip-10-0-109-137 !INFO    | scylla: /opt/scylladb/libreloc/libc.so.6+0x000000000003853e
2019-11-20T14:44:00+00:00  ip-10-0-109-137 !INFO    | scylla: /opt/scylladb/libreloc/libc.so.6+0x0000000000022894
2019-11-20T14:44:00+00:00  ip-10-0-109-137 !INFO    | scylla: /opt/scylladb/libreloc/libc.so.6+0x0000000000022768
2019-11-20T14:44:00+00:00  ip-10-0-109-137 !INFO    | scylla: /opt/scylladb/libreloc/libc.so.6+0x00000000000309f5
2019-11-20T14:44:00+00:00  ip-10-0-109-137 !INFO    | scylla: 0x0000000000d80235
2019-11-20T14:44:00+00:00  ip-10-0-109-137 !INFO    | scylla: 0x0000000000da0d21
2019-11-20T14:44:00+00:00  ip-10-0-109-137 !INFO    | scylla: 0x0000000000da0eb5
2019-11-20T14:44:00+00:00  ip-10-0-109-137 !INFO    | scylla: 0x000000000045d05e
2019-11-20T14:44:00+00:00  ip-10-0-109-137 !INFO    | scylla: 0x000000000045f942
2019-11-20T14:44:00+00:00  ip-10-0-109-137 !INFO    | scylla: 0x0000000000839561

Decoded backtrace:

void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at /usr/include/boost/program_options/variables_map.hpp:146
seastar::backtrace_buffer::append_backtrace() at /usr/include/boost/program_options/variables_map.hpp:146
 (inlined by) print_with_backtrace at /jenkins/workspace/scylla-3.1/relocatable-pkg/scylla/seastar/build/release/../../src/core/reactor.cc:1104
seastar::print_with_backtrace(char const*) at /usr/include/boost/program_options/variables_map.hpp:146
sigabrt_action at /usr/include/boost/program_options/variables_map.hpp:146
 (inlined by) operator() at /jenkins/workspace/scylla-3.1/relocatable-pkg/scylla/seastar/build/release/../../src/core/reactor.cc:5015
 (inlined by) _FUN at /jenkins/workspace/scylla-3.1/relocatable-pkg/scylla/seastar/build/release/../../src/core/reactor.cc:5011
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
partition_entry::evict(mutation_cleaner&) at crtstuff.c:?
row_cache::invalidate_locked(dht::decorated_key const&) at crtstuff.c:?
 (inlined by) operator()<cache_entry*> at /jenkins/workspace/scylla-3.1/relocatable-pkg/scylla/row_cache.cc:1092
 (inlined by) erase_and_dispose<row_cache::invalidate_locked(const dht::decorated_key&)::<lambda(auto:107&&)> > at /usr/include/boost/intrusive/bstree.hpp:1500
 (inlined by) row_cache::invalidate_locked(dht::decorated_key const&) at /jenkins/workspace/scylla-3.1/relocatable-pkg/scylla/row_cache.cc:1094
row_cache::invalidate_sync(memtable&) at crtstuff.c:?
 (inlined by) with_linearized_managed_bytes<row_cache::invalidate_sync(memtable&)::<lambda()>::<lambda(memtable_entry*)>::<lambda()> > at /jenkins/workspace/scylla-3.1/relocatable-pkg/scylla/./utils/managed_bytes.hh:428
 (inlined by) operator() at /jenkins/workspace/scylla-3.1/relocatable-pkg/scylla/row_cache.cc:888
 (inlined by) operator() at /usr/include/boost/intrusive/detail/node_cloner_disposer.hpp:96
 (inlined by) dispose_subtree<boost::intrusive::detail::node_disposer<row_cache::invalidate_sync(memtable&)::<lambda()>::<lambda(memtable_entry*)>, boost::intrusive::mhtraits<memtable_entry, boost::intrusive::set_member_hook<>, &memtable_entry::_link>, (boost::intrusive::algo_types)5> > at /usr/include/boost/intrusive/bstree_algorithms.hpp:2017
 (inlined by) clear_and_dispose<boost::intrusive::detail::node_disposer<row_cache::invalidate_sync(memtable&)::<lambda()>::<lambda(memtable_entry*)>, boost::intrusive::mhtraits<memtable_entry, boost::intrusive::set_member_hook<>, &memtable_entry::_link>, (boost::intrusive::algo_types)5> > at /usr/include/boost/intrusive/bstree_algorithms.hpp:615
 (inlined by) clear_and_dispose<row_cache::invalidate_sync(memtable&)::<lambda()>::<lambda(memtable_entry*)> > at /usr/include/boost/intrusive/bstree.hpp:1603
 (inlined by) operator() at /jenkins/workspace/scylla-3.1/relocatable-pkg/scylla/row_cache.cc:887
 (inlined by) with_allocator<row_cache::invalidate_sync(memtable&)::<lambda()> > at /jenkins/workspace/scylla-3.1/relocatable-pkg/scylla/./utils/allocation_strategy.hh:307
 (inlined by) row_cache::invalidate_sync(memtable&) at /jenkins/workspace/scylla-3.1/relocatable-pkg/scylla/row_cache.cc:882
seastar::future<> row_cache::do_update<row_cache::update(seastar::noncopyable_function<void ()>, memtable&)::{lambda(logalloc::allocating_section&, boost::intrusive::tree_iterator<boost::intrusive::mhtraits<cache_entry, boost::intrusive::set_member_hook<boost::intrusive::link_mode<(boost::intrusive::link_mode_type)2>, void, void, void>, &cache_entry::_cache_link>, false>, memtable_entry&, std::function<partition_presence_checker_result (dht::decorated_key const&)>&, real_dirty_memory_accounter&)#1}>(seastar::noncopyable_function<void ()>, memtable&, row_cache::update(seastar::noncopyable_function<void ()>, memtable&)::{lambda(logalloc::allocating_section&, boost::intrusive::tree_iterator<boost::intrusive::mhtraits<cache_entry, boost::intrusive::set_member_hook<boost::intrusive::link_mode<(boost::intrusive::link_mode_type)2>, void, void, void>, &cache_entry::_cache_link>, false>, memtable_entry&, std::function<partition_presence_checker_result (dht::decorated_key const&)>&, real_dirty_memory_accounter&)#1})::{lambda()#1}::operator()() const::{lambda()#2}::operator()()::{lambda()#1}::operator()() const [clone .isra.1116] at row_cache.cc:?
seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::futurize<std::result_of<std::decay<seastar::future<> >::type ()>::type>::type seastar::async<seastar::future<> row_cache::do_update<row_cache::update(seastar::noncopyable_function<void ()>, memtable&)::{lambda(logalloc::allocating_section&, boost::intrusive::tree_iterator<boost::intrusive::mhtraits<cache_entry, boost::intrusive::set_member_hook<boost::intrusive::link_mode<(boost::intrusive::link_mode_type)2>, void, void, void>, &cache_entry::_cache_link>, false>, memtable_entry&, std::function<partition_presence_checker_result (dht::decorated_key const&)>&, real_dirty_memory_accounter&)#1}>(seastar::noncopyable_function<void ()>, memtable&, row_cache::update(seastar::noncopyable_function<void ()>, memtable&)::{lambda(logalloc::allocating_section&, boost::intrusive::tree_iterator<boost::intrusive::mhtraits<cache_entry, boost::intrusive::set_member_hook<boost::intrusive::link_mode<(boost::intrusive::link_mode_type)2>, void, void, void>, &cache_entry::_cache_link>, false>, memtable_entry&, std::function<partition_presence_checker_result (dht::decorated_key const&)>&, real_dirty_memory_accounter&)#1})::{lambda()#1}::operator()() const::{lambda()#2}>(seastar::thread_attributes, seastar::future<>&&, (std::decay&&)...)::{lambda(std::result_of<std::decay<seastar::future<> >::type ()> seastar::async<{lambda()#1}>(std::result_of<std::decay<seastar::future<> >::type ()>::type, seastar::futurize<std::result_of<std::decay<seastar::future<> >::type ()>::type>, seastar::thread_attributes)::work&)#1}::operator()(std::decay&&)::{lambda()#1}>::call(seastar::noncopyable_function<void ()> const*) [clone .cold.1633] at row_cache.cc:?
 (inlined by) ~deferred_action at /jenkins/workspace/scylla-3.1/relocatable-pkg/scylla/seastar/include/seastar/util/defer.hh:44
 (inlined by) operator() at /jenkins/workspace/scylla-3.1/relocatable-pkg/scylla/row_cache.cc:931
 (inlined by) apply at /jenkins/workspace/scylla-3.1/relocatable-pkg/scylla/seastar/include/seastar/core/apply.hh:35
 (inlined by) apply<row_cache::do_update(row_cache::external_updater, memtable&, Updater) [with Updater = row_cache::update(row_cache::external_updater, memtable&)::<lambda(logalloc::allocating_section&, boost::intrusive::set<cache_entry, boost::intrusive::member_hook<cache_entry, boost::intrusive::set_member_hook<boost::intrusive::link_mode<(boost::intrusive::link_mode_type)2> >, &cache_entry::_cache_link>, boost::intrusive::constant_time_size<false>, boost::intrusive::compare<cache_entry::compare> >::iterator, memtable_entry&, partition_presence_checker&, real_dirty_memory_accounter&)>]::<lambda()>::<lambda()> > at /jenkins/workspace/scylla-3.1/relocatable-pkg/scylla/seastar/include/seastar/core/apply.hh:43
 (inlined by) apply_tuple<row_cache::do_update(row_cache::external_updater, memtable&, Updater) [with Updater = row_cache::update(row_cache::external_updater, memtable&)::<lambda(logalloc::allocating_section&, boost::intrusive::set<cache_entry, boost::intrusive::member_hook<cache_entry, boost::intrusive::set_member_hook<boost::intrusive::link_mode<(boost::intrusive::link_mode_type)2> >, &cache_entry::_cache_link>, boost::intrusive::constant_time_size<false>, boost::intrusive::compare<cache_entry::compare> >::iterator, memtable_entry&, partition_presence_checker&, real_dirty_memory_accounter&)>]::<lambda()>::<lambda()> > at /jenkins/workspace/scylla-3.1/relocatable-pkg/scylla/seastar/include/seastar/core/future.hh:1346
 (inlined by) apply<row_cache::do_update(row_cache::external_updater, memtable&, Updater) [with Updater = row_cache::update(row_cache::external_updater, memtable&)::<lambda(logalloc::allocating_section&, boost::intrusive::set<cache_entry, boost::intrusive::member_hook<cache_entry, boost::intrusive::set_member_hook<boost::intrusive::link_mode<(boost::intrusive::link_mode_type)2> >, &cache_entry::_cache_link>, boost::intrusive::constant_time_size<false>, boost::intrusive::compare<cache_entry::compare> >::iterator, memtable_entry&, partition_presence_checker&, real_dirty_memory_accounter&)>]::<lambda()>::<lambda()> > at /jenkins/workspace/scylla-3.1/relocatable-pkg/scylla/seastar/include/seastar/core/future.hh:1380
 (inlined by) operator() at /jenkins/workspace/scylla-3.1/relocatable-pkg/scylla/seastar/include/seastar/core/thread.hh:324
 (inlined by) call at /jenkins/workspace/scylla-3.1/relocatable-pkg/scylla/seastar/include/seastar/util/noncopyable_function.hh:71
seastar::noncopyable_function<void ()>::operator()() const at /jenkins/workspace/scylla-3.1/relocatable-pkg/scylla/seastar/build/release/../../include/seastar/util/noncopyable_function.hh:145
 (inlined by) seastar::thread_context::main() at /jenkins/workspace/scylla-3.1/relocatable-pkg/scylla/seastar/build/release/../../src/core/thread.cc:317
?? ??:0

Coredump download instruction:

gsutil cp gs://upload.scylladb.com/core.scylla.996.8e1db1b599dd4fbbb1e6da07b8d9e8dc.2671.1574261040000000/core.scylla.996.8e1db1b599dd4fbbb1e6da07b8d9e8dc.2671.1574261040000000.gz .gsutil 

After a few time same coredump got again on the node1, when repair was run on node3
Coredump download instruction:

cp gs://upload.scylladb.com/core.scylla.996.8e1db1b599dd4fbbb1e6da07b8d9e8dc.15335.1574263815000000/core.scylla.996.8e1db1b599dd4fbbb1e6da07b8d9e8dc.15335.1574263815000000.gz .

node1 log:
database.log

Restore Monitor Stack command: hydra investigate show-monitor 16537e28-9587-4f50-a0a9-c4fc26688bfd

Show all stored logs command: hydra investigate show-logs 16537e28-9587-4f50-a0a9-c4fc26688bfd

@juliayakovlev

This comment has been minimized.

Copy link
Author

@juliayakovlev juliayakovlev commented Nov 21, 2019

Issue recreated with version 3.2.rc0-0.20191118.4dae72b2cd3

A lot of coredumps on 3 from 4 nodes.
The cluster is alive.

Aborting and seastar::broken_promise backtraces:

Nov 21 15:19:17 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]:  [shard 2] compaction - Compacting [/var/lib/scylla/data/scylla_bench/test-cc31aee00c7011eab011000000000004/mc-226-big-Data.db:level=0, /var/lib/scylla/data/scylla_bench/test-cc31aee00c7011eab011000000000004/mc-146-big-Data.db:level=0, ]
Nov 21 15:19:17 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]:  [shard 6] large_data - Writing a partition with too many rows [scylla_bench/test:6] (142560 rows)
Nov 21 15:19:17 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]:  [shard 6] compaction - Compacting [/var/lib/scylla/data/scylla_bench/test-cc31aee00c7011eab011000000000004/mc-358-big-Data.db:level=0, /var/lib/scylla/data/scylla_bench/test-cc31aee00c7011eab011000000000004/mc-342-big-Data.db:level=0, /var/lib/scylla/data/scylla_bench/test-cc31aee00c7011eab011000000000004/mc-334-big-Data.db:level=0, ]
Nov 21 15:19:25 ip-10-0-255-133.eu-west-1.compute.internal sudo[7722]:   centos : TTY=unknown ; PWD=/home/centos ; USER=root ; COMMAND=/bin/coredumpctl --no-pager --no-legend
Nov 21 15:19:25 ip-10-0-255-133.eu-west-1.compute.internal sudo[7722]: pam_unix(sudo:session): session opened for user root by (uid=0)
Nov 21 15:19:25 ip-10-0-255-133.eu-west-1.compute.internal sudo[7722]: pam_unix(sudo:session): session closed for user root
Nov 21 15:19:31 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]:  [shard 6] compaction - Compacted 2 sstables to [/var/lib/scylla/data/scylla_bench/test-cc31aee00c7011eab011000000000004/mc-310-big-Data.db:level=0, ]. 1721135218 bytes to 1466689131 (~85% of original) in 215512ms = 6.49MB/s. ~256 total partitions merged to 2.
*********************************************************************
*********************************************************************
Nov 21 15:19:43 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]:  [shard 3] storage_proxy - Failed to apply mutation from 10.0.159.93#3: seastar::broken_promise (broken promise)
Nov 21 15:19:43 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]:  [shard 3] storage_proxy - Failed to apply mutation from 10.0.225.192#3: seastar::broken_promise (broken promise)
Nov 21 15:19:43 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]:  [shard 3] storage_proxy - Failed to apply mutation from 10.0.116.116#3: seastar::broken_promise (broken promise)
Nov 21 15:19:43 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]:  [shard 3] storage_proxy - Failed to apply mutation from 10.0.116.116#3: seastar::broken_promise (broken promise)
Nov 21 15:19:43 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]:  [shard 3] storage_proxy - Failed to apply mutation from 10.0.225.192#3: seastar::broken_promise (broken promise)
Nov 21 15:19:43 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]:  [shard 3] storage_proxy - exception during mutation write to 10.0.255.133: seastar::broken_promise (broken promise)
Nov 21 15:19:43 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]:  [shard 3] storage_proxy - Failed to apply mutation from 10.0.225.192#3: seastar::broken_promise (broken promise)
Nov 21 15:19:43 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]:  [shard 6] storage_proxy - Failed to apply mutation from 10.0.225.192#6: seastar::broken_promise (broken promise)
Nov 21 15:19:43 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]:  [shard 6] seastar - Exceptional future ignored: seastar::broken_promise (broken promise), backtrace:   0x2cbd18d
                                                                           0x2a20354
                                                                           0x17dcb76
                                                                           0x2a1db21
                                                                           0x2a1dd2f
                                                                           0x2a8603d
                                                                           0x2a9a66c
                                                                           0x2a0072d
                                                                           /opt/scylladb/libreloc/libpthread.so.0+0x85a1
                                                                           /opt/scylladb/libreloc/libc.so.6+0xfb302
Nov 21 15:19:50 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]:  [shard 3] compaction - Compacted 2 sstables to [/var/lib/scylla/data/scylla_bench/test-cc31aee00c7011eab011000000000004/mc-235-big-Data.db:level=0, ]. 403736174 by
tes to 304151786 (~75% of original) in 36567ms = 7.93MB/s. ~256 total partitions merged to 1.
Nov 21 15:19:50 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]:  [shard 1] compaction - Compacting [/var/lib/scylla/data/scylla_bench/test-cc31aee00c7011eab011000000000004/mc-225-big-Data.db:level=0, /var/lib/scylla/data/scylla_
bench/test-cc31aee00c7011eab011000000000004/mc-233-big-Data.db:level=0, ]
Nov 21 15:19:53 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]:  [shard 0] compaction - Compacted 2 sstables to [/var/lib/scylla/data/scylla_bench/test-cc31aee00c7011eab011000000000004/mc-232-big-Data.db:level=0, ]. 405690742 by
tes to 299162313 (~73% of original) in 41205ms = 6.92MB/s. ~256 total partitions merged to 1.
Nov 21 15:19:55 ip-10-0-255-133.eu-west-1.compute.internal sudo[7807]:   centos : TTY=unknown ; PWD=/home/centos ; USER=root ; COMMAND=/bin/coredumpctl --no-pager --no-legend
Nov 21 15:19:55 ip-10-0-255-133.eu-west-1.compute.internal sudo[7807]: pam_unix(sudo:session): session opened for user root by (uid=0)
Nov 21 15:19:55 ip-10-0-255-133.eu-west-1.compute.internal sudo[7807]: pam_unix(sudo:session): session closed for user root
Nov 21 15:19:55 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]:  [shard 2] compaction - Compacted 2 sstables to [/var/lib/scylla/data/scylla_bench/test-cc31aee00c7011eab011000000000004/mc-242-big-Data.db:level=0, ]. 411348772 by
tes to 293298429 (~71% of original) in 41854ms = 6.68MB/s. ~256 total partitions merged to 1.
Nov 21 15:20:08 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]:  [shard 6] compaction - Compacted 2 sstables to [/var/lib/scylla/data/scylla_bench/test-cc31aee00c7011eab011000000000004/mc-350-big-Data.db:level=0, ]. 827713197 by
tes to 767928218 (~92% of original) in 158639ms = 4.62MB/s. ~256 total partitions merged to 2.
Nov 21 15:20:09 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]:  [shard 3] compaction - Compacting [/var/lib/scylla/data/scylla_bench/test-cc31aee00c7011eab011000000000004/mc-251-big-Data.db:level=0, /var/lib/scylla/data/scylla_
bench/test-cc31aee00c7011eab011000000000004/mc-235-big-Data.db:level=0, ]
Nov 21 15:20:10 ip-10-0-255-133.eu-west-1.compute.internal sudo[7847]:   centos : TTY=unknown ; PWD=/home/centos ; USER=root ; COMMAND=/bin/coredumpctl --no-pager --no-legend
Nov 21 15:20:10 ip-10-0-255-133.eu-west-1.compute.internal sudo[7847]: pam_unix(sudo:session): session opened for user root by (uid=0)
Nov 21 15:20:10 ip-10-0-255-133.eu-west-1.compute.internal sudo[7847]: pam_unix(sudo:session): session closed for user root
Nov 21 15:20:11 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]: scylla: /jenkins/workspace/scylla-3.2/build/scylla/seastar/include/seastar/core/future.hh:299: void seastar::future_state_base::set_exception(std::__exception_ptr::
exception_ptr&&): Assertion `_u.st == state::future' failed.
Nov 21 15:20:11 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]: Aborting on shard 6.
Nov 21 15:20:11 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]: Backtrace:
Nov 21 15:20:11 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]: 0x0000000002aa6a42
Nov 21 15:20:11 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]: 0x0000000002a213eb
Nov 21 15:20:11 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]: 0x0000000002a216e5
Nov 21 15:20:11 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]: 0x0000000002a21780
Nov 21 15:20:11 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]: 0x00007f2f9ef63e7f
Nov 21 15:20:11 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]: /opt/scylladb/libreloc/libc.so.6+0x0000000000037e74
Nov 21 15:20:11 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]: /opt/scylladb/libreloc/libc.so.6+0x0000000000022894
Nov 21 15:20:11 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]: /opt/scylladb/libreloc/libc.so.6+0x0000000000022768
Nov 21 15:20:11 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]: /opt/scylladb/libreloc/libc.so.6+0x0000000000030565
Nov 21 15:20:11 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]: 0x0000000002a206c5
Nov 21 15:20:11 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]: 0x00000000017eceac
Nov 21 15:20:11 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]: 0x00000000017b48c1
Nov 21 15:20:11 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]: 0x00000000017c6df4
Nov 21 15:20:11 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]: 0x00000000017c9116
Nov 21 15:20:11 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]: 0x0000000002a1db21
Nov 21 15:20:11 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]: 0x0000000002a1dd2f
Nov 21 15:20:11 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]: 0x0000000002a8603d
Nov 21 15:20:11 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]: 0x0000000002a9a66c
Nov 21 15:20:11 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]: 0x0000000002a0072d
Nov 21 15:20:11 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]: /opt/scylladb/libreloc/libpthread.so.0+0x00000000000085a1
Nov 21 15:20:11 ip-10-0-255-133.eu-west-1.compute.internal scylla[2970]: /opt/scylladb/libreloc/libc.so.6+0x00000000000fb302
Nov 21 15:20:14 ip-10-0-255-133.eu-west-1.compute.internal dhclient[1671]: PRC: Renewing lease on eth0.
Nov 21 15:20:14 ip-10-0-255-133.eu-west-1.compute.internal dhclient[1671]: XMT: Renew on eth0, interval 9770ms.
Nov 21 15:20:14 ip-10-0-255-133.eu-west-1.compute.internal dhclient[1671]: RCV: Reply message on eth0 from fe80::8ad:ecff:fe4b:7644.

Aborting backtrace with decoding:

0x0000000002aa6a42
0x0000000002a213eb
0x0000000002a216e5
0x0000000002a21780
0x00007f2f9ef63e7f
/opt/scylladb/libreloc/libc.so.6+0x0000000000037e74
/opt/scylladb/libreloc/libc.so.6+0x0000000000022894
/opt/scylladb/libreloc/libc.so.6+0x0000000000022768
/opt/scylladb/libreloc/libc.so.6+0x0000000000030565
0x0000000002a206c5
0x00000000017eceac
0x00000000017b48c1
0x00000000017c6df4
0x00000000017c9116
0x0000000002a1db21
0x0000000002a1dd2f
0x0000000002a8603d
0x0000000002a9a66c
0x0000000002a0072d
/opt/scylladb/libreloc/libpthread.so.0+0x00000000000085a1
/opt/scylladb/libreloc/libc.so.6+0x00000000000fb302


void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at /usr/include/c++/9/ext/new_allocator.h:105
seastar::print_with_backtrace(seastar::backtrace_buffer&) at /usr/include/c++/9/ext/new_allocator.h:105
seastar::print_with_backtrace(char const*) at /usr/include/c++/9/ext/new_allocator.h:105
void seastar::install_oneshot_signal_handler<6, &seastar::sigabrt_action>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) at /usr/include/c++/9/ext/new_allocator.h:105
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
seastar::future_state_base::set_to_broken_promise() at /usr/include/c++/9/ext/new_allocator.h:105
seastar::shared_future<seastar::with_clock<seastar::lowres_clock> >::shared_state::get_future(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at /usr/include/c++/9/ext/new_allocator.h:105
seastar::repeat_until_value_type_helper<seastar::futurize<std::result_of<db::commitlog::segment_manager::active_segment(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1} ()>::type>::type>::future_type seastar::repeat_until_value<db::commitlog::segment_manager::active_segment(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}>(std::result_of) at /usr/include/c++/9/ext/new_allocator.h:105
addr2line: Dwarf Error: Could not find abbrev number 88.
_ZN7seastar8futurizeINS_6futureIJN2db9rp_handleEEEEE5applyIZNS2_9commitlog15segment_manager22allocate_when_possibleERKN5utils4UUIDENS_10shared_ptrINS7_12entry_writerEEENSt6chrono10time_pointINS_12lowres_clockENSG_8durationIlSt5ratioILl1ELl1000EEEEEEEUlT_E_JNS_15semaphore_unitsINS_33default_timeout_exception_factoryESI_EEEEES4_OSO_OSt5tupleIJDpT0_EE at /usr/include/c++/9/ext/new_allocator.h:105
_ZN7seastar12continuationIZZNS_6futureIJNS_15semaphore_unitsINS_33default_timeout_exception_factoryENS_12lowres_clockEEEEE9then_implIZN2db9commitlog15segment_manager22allocate_when_possibleERKN5utils4UUIDENS_10shared_ptrINS9_12entry_writerEEENSt6chrono10time_pointIS4_NSI_8durationIlSt5ratioILl1ELl1000EEEEEEEUlT_E_NS1_IJNS8_9rp_handleEEEEEET0_OSP_ENKUlvE_clEvEUlONS_12future_stateIJS5_EEEE_JS5_EE15run_and_disposeEv at /usr/include/c++/9/ext/new_allocator.h:105
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at /usr/include/c++/9/ext/new_allocator.h:105
seastar::reactor::run_some_tasks() [clone .part.0] at /usr/include/c++/9/ext/new_allocator.h:105
seastar::reactor::run() at /usr/include/c++/9/ext/new_allocator.h:105
seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::{lambda()#3}::operator()() const at /usr/include/c++/9/ext/new_allocator.h:105
seastar::posix_thread::start_routine(void*) at /usr/include/c++/9/bits/std_function.h:690

broken_promise backtrace with decoding:

0x2cbd18d
0x2a20354
0x17dcb76
0x2a1db21
0x2a1dd2f
0x2a8603d
0x2a9a66c
0x2a0072d
/opt/scylladb/libreloc/libpthread.so.0+0x85a1
/opt/scylladb/libreloc/libc.so.6+0xfb302

seastar::current_backtrace() at /jenkins/workspace/scylla-3.2/build/scylla/seastar/include/seastar/core/future.hh:299
seastar::report_failed_future(std::__exception_ptr::exception_ptr const&) at /usr/include/c++/9/ext/new_allocator.h:105
seastar::continuation<seastar::future<> seastar::future<>::then_wrapped_impl<seastar::shared_future<seastar::with_clock<seastar::lowres_clock> >::shared_state::get_future(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(seastar::future<>&&)#1}, seastar::future<> >(seastar::shared_future<seastar::with_clock<seastar::lowres_clock> >::shared_state::get_future(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(seastar::future<>&&)#1}&&)::{lambda()#1}::operator()() const::{lambda(seastar::future_state<>&&)#1}>::run_and_dispose() at /jenkins/workspace/scylla-3.2/build/scylla/seastar/include/seastar/core/future.hh:942
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at /usr/include/c++/9/ext/new_allocator.h:105
seastar::reactor::run_some_tasks() [clone .part.0] at /usr/include/c++/9/ext/new_allocator.h:105
seastar::reactor::run() at /usr/include/c++/9/ext/new_allocator.h:105
seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::{lambda()#3}::operator()() const at /usr/include/c++/9/ext/new_allocator.h:105
seastar::posix_thread::start_routine(void*) at /usr/include/c++/9/bits/std_function.h:690
?? ??:0
?? ??:0

One of the coredumps:

           PID: 2970 (scylla)
           UID: 996 (scylla)
           GID: 1001 (scylla)
        Signal: 6 (ABRT)
     Timestamp: Thu 2019-11-21 15:20:11 UTC (1min 44s ago)
  Command Line: /usr/bin/scylla --blocked-reactor-notify-ms 500 --developer-mode 1 --abort-on-internal-error true --log-to-syslog 1 --log-to-stdout 0 --default-log-level info --network-stack posix --cpuset 0-7
    Executable: /opt/scylladb/libexec/scylla
 Control Group: /
       Boot ID: 86918290d86840baab34609724aee6a6
    Machine ID: df877a200226bc47d06f26dae0736ec9
      Hostname: ip-10-0-255-133.eu-west-1.compute.internal
      Coredump: /var/lib/systemd/coredump/core.scylla.996.86918290d86840baab34609724aee6a6.2970.1574349611000000
       Message: Process 2970 (scylla) of user 996 dumped core.
                
                Stack trace of thread 2977:
                #0  0x00007f2f9e57fe75 raise (libc.so.6)
                #1  0x00007f2f9e56a95e abort (libc.so.6)
                #2  0x00007f2f9e56a769 __assert_fail_base.cold (libc.so.6)
                #3  0x00007f2f9e578566 __assert_fail (libc.so.6)
                #4  0x0000000002a206c6 _ZN7seastar17future_state_base13set_exceptionEONSt15__exception_ptr13exception_ptrE (scylla)
                #5  0x00000000017ecead _ZN7seastar13shared_futureIJNS_10with_clockINS_12lowres_clockEEEEE12shared_state10get_futureENSt6chrono10time_pointIS2_NS6_8durationIlSt5ratioILl1ELl1000EEEEEE (scylla)
                #6  0x00000000017b48c2 _ZN7seastar18repeat_until_valueIZN2db9commitlog15segment_manager14active_segmentENSt6chrono10time_pointINS_12lowres_clockENS4_8durationIlSt5ratioILl1ELl1000EEEEEEEUlvE_EENS_30repeat_until_value_type_helperINS_8futurizeINSt9result_ofIFT_vEE4typeEE4typeEE11future_typeESG_ (scylla)
                #7  0x00000000017c6df5 _ZN7seastar8futurizeINS_6futureIJN2db9rp_handleEEEEE5applyIZNS2_9commitlog15segment_manager22allocate_when_possibleERKN5utils4UUIDENS_10shared_ptrINS7_12entry_writerEEENSt6chrono10time_pointINS_12lowres_clockENSG_8durationIlSt5ratioILl1ELl1000EEEEEEEUlT_E_JNS_15semaphore_unitsINS_33default_timeout_exception_factoryESI_EEEEES4_OSO_OSt5tupleIJDpT0_EE (scylla)
                #8  0x00000000017c9117 _ZN7seastar12continuationIZZNS_6futureIJNS_15semaphore_unitsINS_33default_timeout_exception_factoryENS_12lowres_clockEEEEE9then_implIZN2db9commitlog15segment_manager22allocate_when_possibleERKN5utils4UUIDENS_10shared_ptrINS9_12entry_writerEEENSt6chrono10time_pointIS4_NSI_8durationIlSt5ratioILl1ELl1000EEEEEEEUlT_E_NS1_IJNS8_9rp_handleEEEEEET0_OSP_ENKUlvE_clEvEUlONS_12future_stateIJS5_EEEE_JS5_EE15run_and_disposeEv (scylla)
                #9  0x0000000002a1db22 _ZN7seastar7reactor9run_tasksERNS0_10task_queueE (scylla)
                #10 0x0000000002a1dd30 _ZN7seastar7reactor14run_some_tasksEv (scylla)
                #11 0x0000000002a8603e _ZN7seastar7reactor14run_some_tasksEv (scylla)
                #12 0x0000000002a9a66d _ZZN7seastar3smp9configureEN5boost15program_options13variables_mapENS_14reactor_configEENKUlvE1_clEv (scylla)
                #13 0x0000000002a0072e _ZNKSt8functionIFvvEEclEv (scylla)
                #14 0x00007f2f9ef595a2 start_thread (libpthread.so.0)
                #15 0x00007f2f9e643303 __clone (libc.so.6)
                
                Stack trace of thread 2981:
                #0  0x00007f2f9ef62bec read (libpthread.so.0)
                #1  0x0000000002e5cc36 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
                #2  0x0000000002e5cf22 operator() (scylla)
                #3  0x0000000002a0072e _ZNKSt8functionIFvvEEclEv (scylla)
                #4  0x00007f2f9ef595a2 start_thread (libpthread.so.0)
                #5  0x00007f2f9e643303 __clone (libc.so.6)
                
                Stack trace of thread 2980:
                #0  0x00007f2f9ef62bec read (libpthread.so.0)
                #1  0x0000000002e5cc36 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
                #2  0x0000000002e5cf22 operator() (scylla)
                #3  0x0000000002a0072e _ZNKSt8functionIFvvEEclEv (scylla)
                #4  0x00007f2f9ef595a2 start_thread (libpthread.so.0)
                #5  0x00007f2f9e643303 __clone (libc.so.6)
                
                Stack trace of thread 2982:
                #0  0x00007f2f9ef62bec read (libpthread.so.0)
                #1  0x0000000002e5cc36 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
                #2  0x0000000002e5cf22 operator() (scylla)
                #3  0x0000000002a0072e _ZNKSt8functionIFvvEEclEv (scylla)
                #4  0x00007f2f9ef595a2 start_thread (libpthread.so.0)
                #5  0x00007f2f9e643303 __clone (libc.so.6)
                
                Stack trace of thread 2984:
                #0  0x00007f2f9ef62bec read (libpthread.so.0)
                #1  0x0000000002e5cc36 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
                #2  0x0000000002e5cf22 operator() (scylla)
                #3  0x0000000002a0072e _ZNKSt8functionIFvvEEclEv (scylla)
                #4  0x00007f2f9ef595a2 start_thread (libpthread.so.0)
                #5  0x00007f2f9e643303 __clone (libc.so.6)
                
                Stack trace of thread 2979:
                #0  0x00007f2f9ef62bec read (libpthread.so.0)
                #1  0x0000000002e5cc36 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
                #2  0x0000000002e5cf22 operator() (scylla)
                #3  0x0000000002a0072e _ZNKSt8functionIFvvEEclEv (scylla)
                #4  0x00007f2f9ef595a2 start_thread (libpthread.so.0)
                #5  0x00007f2f9e643303 __clone (libc.so.6)

Coredump download instruction:

gsutil cp gs://upload.scylladb.com/core.scylla.996.86918290d86840baab34609724aee6a6.21791.1574350530000000/core.scylla.996.86918290d86840baab34609724aee6a6.21791.1574350530000000.gz .

gunzip /var/lib/systemd/coredump/core.scylla.996.86918290d86840baab34609724aee6a6.21791.1574350530000000.gz
@slivne slivne added bug repair labels Nov 21, 2019
@slivne slivne added this to the 3.2 milestone Nov 21, 2019
@slivne slivne added the Backport 3.1 label Nov 21, 2019
@tgrabiec tgrabiec assigned tgrabiec and unassigned asias and bhalevy Nov 21, 2019
@tgrabiec

This comment has been minimized.

Copy link
Contributor

@tgrabiec tgrabiec commented Nov 21, 2019

Introduced in 90d6c0b. Happens when cache update fails on std::bad_alloc, because we try to invalidate the cache before releasing the partition snapshot. Fix is on the way

avikivity added a commit that referenced this issue Nov 24, 2019
Since 90d6c0b, cache will abort when trying to detach partition
entries while they're updated. This should never happen. It can happen
though, when the update fails on bad_alloc, because the cleanup guard
invalidates the cache before it releases partition snapshots (held by
"update" coroutine).

Fix by destroying the coroutine first.

Fixes #5327.

Tests:
  - row_cache_test (dev)

Message-Id: <1574360259-10132-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit e3d025d)
avikivity added a commit that referenced this issue Nov 24, 2019
Since 90d6c0b, cache will abort when trying to detach partition
entries while they're updated. This should never happen. It can happen
though, when the update fails on bad_alloc, because the cleanup guard
invalidates the cache before it releases partition snapshots (held by
"update" coroutine).

Fix by destroying the coroutine first.

Fixes #5327.

Tests:
  - row_cache_test (dev)

Message-Id: <1574360259-10132-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit e3d025d)
avikivity added a commit that referenced this issue Nov 24, 2019
Since 90d6c0b, cache will abort when trying to detach partition
entries while they're updated. This should never happen. It can happen
though, when the update fails on bad_alloc, because the cleanup guard
invalidates the cache before it releases partition snapshots (held by
"update" coroutine).

Fix by destroying the coroutine first.

Fixes #5327.

Tests:
  - row_cache_test (dev)

Message-Id: <1574360259-10132-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit e3d025d)
@avikivity

This comment has been minimized.

Copy link
Contributor

@avikivity avikivity commented Dec 4, 2019

Already backported, removing backport candidate label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.