coredump during c-s read performance test #2021

roydahan · 2017-01-18T18:27:57Z

Installation details
Scylla version (or git commit hash): 1.5.0
Cluster size: 3 node, i2.2xlarge
OS (RHEL/CentOS/Ubuntu/AWS AMI): AWS AMI (ami-0d73641a)

During ami-perf-regression testing, a coredump was issued for test-read scenario.

The test uses 4 loaders and run c-s read for 50 mins, however the coredump was issued during the write phase of the test after about 1 hour.

Test Details:
First phase - Write:
cassandra-stress write no-warmup cl=QUORUM n=30000000 -schema keyspace=keyspace$2 'replication(factor=3)' -port jmx=6868 -mode cql3 native -rate threads=500 -errors ignore -pop seq=1..30000000

Second phase - Read:
cassandra-stress read no-warmup cl=QUORUM duration=50m -schema keyspace=keyspace$2 'replication(factor=3)' -port jmx=6868 -mode cql3 native -rate threads=500 -errors ignore -pop 'dist=gauss(1..30000000,15000000,1500000)'

Coredump location:
Coredump size is 56GB, zipped with pigz and split to 4GB chunks:
scylladb-users-upload.s3.amazonaws.com/test-read-1.5/xaa
scylladb-users-upload.s3.amazonaws.com/test-read-1.5/xab
scylladb-users-upload.s3.amazonaws.com/test-read-1.5/xac
scylladb-users-upload.s3.amazonaws.com/test-read-1.5/xad

Relevant section from the log:

>2017-01-18 15:02:14,091 cluster          L0646 INFO | Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None): To download it, you may use 'curl --user-a
gent [user-agent] scylladb-users-upload.s3.amazonaws.com/core.scylla.994.4275204cab84413fae9599105c319a83.2486.1484751659000/core.scylla.994.4275204cab84413fae9599105c319a83.2486.1484751659000000 
> core.scylla.994.4275204cab84413fae9599105c319a83.2486.1484751659000000'
2017-01-18 15:02:14,092 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):            PID: 2486 (scylla)
2017-01-18 15:02:14,092 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):            UID: 994 (scylla)
2017-01-18 15:02:14,092 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):            GID: 1001 (scylla)
2017-01-18 15:02:14,092 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):         Signal: 6 (ABRT)
2017-01-18 15:02:14,092 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):      Timestamp: Wed 2017-01-18 15:00:59 UT
C (1min 14s ago)
2017-01-18 15:02:14,092 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):   Command Line: /usr/bin/scylla --log-to-s
yslog 1 --log-to-stdout 0 --default-log-level info --collectd-address=127.0.0.1:25826 --collectd=1 --collectd-poll-period 3000 --network-stack posix --num-io-queues 8 --max-io-requests 64 --cpuset
 0,1,2,3,4,5,6,7
2017-01-18 15:02:14,092 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):     Executable: /usr/bin/scylla
2017-01-18 15:02:14,093 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):  Control Group: /system.slice/scylla-serve
r.service
2017-01-18 15:02:14,093 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):           Unit: scylla-server.service
2017-01-18 15:02:14,093 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):          Slice: system.slice
2017-01-18 15:02:14,093 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):        Boot ID: 4275204cab84413fae9599105c
319a83
2017-01-18 15:02:14,093 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):     Machine ID: 8ea3e6ea03b64d018847fcdf66
9935d4
2017-01-18 15:02:14,093 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):       Hostname: ip-172-30-0-226
2017-01-18 15:02:14,093 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):       Coredump: /var/lib/systemd/coredump/
core.scylla.994.4275204cab84413fae9599105c319a83.2486.1484751659000000
2017-01-18 15:02:14,093 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):        Message: Process 2486 (scylla) of user 994 dumped core.
2017-01-18 15:02:14,094 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 
2017-01-18 15:02:14,094 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 Stack trace of thread 2489:
2017-01-18 15:02:14,094 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #0  0x00007fec73a975f7 raise (libc.so.6)
2017-01-18 15:02:14,094 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #1  0x00007fec73a98e28 abort (libc.so.6)
2017-01-18 15:02:14,094 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #2  0x00000000018702ad _ZN9__gnu_cxx27__verbose_terminate_handlerEv (scylla)
2017-01-18 15:02:14,094 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #3  0x00000000017da696 _ZN10__cxxabiv111__terminateEPFvvE (scylla)
2017-01-18 15:02:14,094 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #4  0x000000000186fff9 __cxa_call_terminate (scylla)
2017-01-18 15:02:14,095 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #5  0x00000000017d9fe5 __gxx_personality_v0 (scylla)
2017-01-18 15:02:14,095 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #6  0x00007fec75eeb903 n/a (libgcc_s.so.1)
2017-01-18 15:02:14,095 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #7  0x00007fec75eebe37 _Unwind_Resume (libgcc_s.so.1)
2017-01-18 15:02:14,095 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #8  0x00000000004f1e66 _ZN12future_stateIJEED4Ev (scylla)
2017-01-18 15:02:14,095 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #9  0x000000000097a2d5 operator() (scylla)
2017-01-18 15:02:14,095 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #10 0x0000000000987b65 _ZZN8logalloc12region_group6updateElENUlT_E_clIPS0_EEDaS1_ (scylla)
2017-01-18 15:02:14,095 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #11 0x0000000000988015 _ZN8logalloc12region_group18do_for_each_parentIZNS0_6updateElEUlT_E_EEPS0_S4_OS2_ (scylla)
2017-01-18 15:02:14,095 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #12 0x000000000098a97b _ZN8logalloc11region_impl7compactEPNS_7segmentE (scylla)
2017-01-18 15:02:14,095 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #13 0x000000000098214f _ZN8logalloc11region_impl7compactEv (scylla)
2017-01-18 15:02:14,096 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #14 0x00000000009839c6 _ZN8logalloc7tracker4impl7reclaimEm (scylla)
2017-01-18 15:02:14,096 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #15 0x0000000000983a47 _ZN8logalloc7tracker7reclaimEm (scylla)
2017-01-18 15:02:14,096 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #16 0x000000000058b336 _ZNKSt8functionIFN6memory17reclaiming_resultEvEEclEv (scylla)
2017-01-18 15:02:14,096 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #17 0x000000000058b39d _ZN6memory9cpu_pages31find_and_unlink_span_reclaimingEj (scylla)
2017-01-18 15:02:14,096 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #18 0x000000000058bb03 allocate_large_and_trim<memory::cpu_pages::allocate_large(unsigned int)::<lambda(unsigned int, unsigned int)> > (scylla)
2017-01-18 15:02:14,096 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #19 0x000000000058c27a _ZN6memory9cpu_pages14allocate_largeEj (scylla)
2017-01-18 15:02:14,096 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #20 0x000000000058cbed _ZN6memory8allocateEm (scylla)
2017-01-18 15:02:14,096 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #21 0x000000000058cd35 _Znam (scylla)
2017-01-18 15:02:14,097 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #22 0x000000000054f964 _ZN9__gnu_cxx13new_allocatorISt10unique_ptrI4taskSt14default_deleteIS2_EEE8allocateEmPKv (scylla)
2017-01-18 15:02:14,097 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #23 0x000000000054faad _ZN15circular_bufferISt10unique_ptrI4taskSt14default_deleteIS1_EESaIS4_EE12maybe_expandEm (scylla)
2017-01-18 15:02:14,097 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #24 0x0000000000554235 _ZN7promiseIJEE10make_readyILNS0_6urgentE1EEEvv (scylla)
2017-01-18 15:02:14,097 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #25 0x000000000097dd69 operator()<future_state<> > (scylla)
2017-01-18 15:02:14,097 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #26 0x00000000004e487f _ZN7reactor9run_tasksER15circular_bufferISt10unique_ptrI4taskSt14default_deleteIS2_EESaIS5_EE (scylla)
2017-01-18 15:02:14,097 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #27 0x000000000052af0b _ZN7reactor3runEv (scylla)
2017-01-18 15:02:14,097 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #28 0x000000000053caf1 _ZZN3smp9configureEN5boost15program_options13variables_mapEENKUlvE1_clEv.constprop.3780 (scylla)

The text was updated successfully, but these errors were encountered:

roydahan · 2017-01-18T18:28:49Z

System is still alive and marked as keep=alive in case someone want to drill down.

avikivity · 2017-01-18T18:31:58Z

Please copy the core somewhere instead.

benoit-canet · 2017-01-18T19:12:43Z

Hi Roy Instruction for core dumps are here I think. https://github.com/scylladb/scylla/wiki/How-to-report-a-Scylla-problem

…

On Wed, Jan 18, 2017 at 7:31 PM, Avi Kivity ***@***.***> wrote: Please copy the core somewhere instead. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2021 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAA5IZSVInKblGxJxYrWnFUUzv5oWZW_ks5rTlqfgaJpZM4LnOsN> .

benoit-canet · 2017-01-18T19:13:53Z

Look in https://gist.github.com/tzach/03e23eb56b64e1f143c8

…

On Wed, Jan 18, 2017 at 8:12 PM, Benoît Canet ***@***.***> wrote: Hi Roy Instruction for core dumps are here I think. https://github.com/scylladb/scylla/wiki/How-to-report-a-Scylla-problem On Wed, Jan 18, 2017 at 7:31 PM, Avi Kivity ***@***.***> wrote: > Please copy the core somewhere instead. > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <#2021 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAA5IZSVInKblGxJxYrWnFUUzv5oWZW_ks5rTlqfgaJpZM4LnOsN> > . >

roydahan · 2017-01-18T19:16:55Z

Already uploaded it.
Please see coredump location in the issue description.

tgrabiec · 2017-01-19T17:14:10Z

@roydahan How do I extract the core file from the fragments? I joined the 4 fragments using cat but pigz refuses to decompress the result:

$ pigz -d za.xz
pigz: skipping: za.xz does not have compressed suffix

tgrabiec · 2017-01-19T18:20:54Z

Better backtrace:

_ZN12future_stateIJEED4Ev at /usr/src/debug/scylla-1.5.0/seastar/core/future.hh:314
 (inlined by) _ZN7promiseIJEED4Ev at /usr/src/debug/scylla-1.5.0/seastar/core/future.hh:429
 (inlined by) later() at /usr/src/debug/scylla-1.5.0/seastar/core/reactor.cc:3739
operator() at /usr/src/debug/scylla-1.5.0/utils/logalloc.cc:2096
 (inlined by) with_gate<logalloc::region_group::release_requests()::<lambda()> > at /usr/src/debug/scylla-1.5.0/seastar/core/gate.hh:120
 (inlined by) logalloc::region_group::release_requests() at /usr/src/debug/scylla-1.5.0/utils/logalloc.cc:2119
_ZZN8logalloc12region_group6updateElENUlT_E_clIPS0_EEDaS1_ at /usr/src/debug/scylla-1.5.0/utils/logalloc.hh:262
_ZN8logalloc12region_group18do_for_each_parentIZNS0_6updateElEUlT_E_EEPS0_S4_OS2_ at /usr/src/debug/scylla-1.5.0/utils/logalloc.hh:416
 (inlined by) logalloc::region_group::update(long) at /usr/src/debug/scylla-1.5.0/utils/logalloc.hh:270
 (inlined by) logalloc::region_group::decrease_usage(boost::heap::detail::node_handle<boost::heap::detail::parent_pointing_heap_node<logalloc::region_impl*>*, boost::heap::detail::make_binomial_heap_base<logalloc::region_impl*, boost::parameter::aux::arg_list<boost::heap::constant_time_size<false>, boost::parameter::aux::arg_list<boost::heap::allocator<std::allocator<logalloc::region_impl*> >, boost::parameter::aux::arg_list<boost::heap::compare<logalloc::region_group::region_evictable_occupancy_ascending_less_comparator>, boost::parameter::aux::empty_arg_list> > > >::type, logalloc::region_impl*&>&, long) at /usr/src/debug/scylla-1.5.0/utils/logalloc.hh:290
logalloc::region_impl::compact(logalloc::segment*) at /usr/src/debug/scylla-1.5.0/utils/logalloc.cc:1228
 (inlined by) logalloc::region_impl::compact_single_segment_locked() at /usr/src/debug/scylla-1.5.0/utils/logalloc.cc:1472
logalloc::region_impl::compact() at /usr/src/debug/scylla-1.5.0/utils/logalloc.cc:1463
 (inlined by) logalloc::tracker::impl::compact_and_evict(unsigned long) at /usr/src/debug/scylla-1.5.0/utils/logalloc.cc:1901
logalloc::tracker::impl::reclaim(unsigned long) at /usr/src/debug/scylla-1.5.0/utils/logalloc.cc:1829
logalloc::tracker::reclaim(unsigned long) at /usr/src/debug/scylla-1.5.0/utils/logalloc.cc:114
 (inlined by) logalloc::tracker::reclaim() at /usr/src/debug/scylla-1.5.0/utils/logalloc.cc:1600
 (inlined by) operator() at /usr/src/debug/scylla-1.5.0/utils/logalloc.cc:107
 (inlined by) _M_invoke at /opt/scylladb/include/c++/5.3.1/functional:1857
std::function<memory::reclaiming_result ()>::operator()() const at /opt/scylladb/include/c++/5.3.1/functional:2271
 (inlined by) memory::reclaimer::do_reclaim() at /usr/src/debug/scylla-1.5.0/seastar/core/memory.hh:110
 (inlined by) memory::cpu_pages::run_reclaimers(memory::reclaimer_scope) at /usr/src/debug/scylla-1.5.0/seastar/core/memory.cc:1010
memory::cpu_pages::find_and_unlink_span_reclaiming(unsigned int) at /usr/src/debug/scylla-1.5.0/seastar/core/memory.cc:564
allocate_large_and_trim<memory::cpu_pages::allocate_large(unsigned int)::<lambda(unsigned int, unsigned int)> > at /usr/src/debug/scylla-1.5.0/seastar/core/memory.cc:578
memory::cpu_pages::allocate_large(unsigned int) at /usr/src/debug/scylla-1.5.0/seastar/core/memory.cc:623
 (inlined by) memory::allocate_large(unsigned long) at /usr/src/debug/scylla-1.5.0/seastar/core/memory.cc:1173
memory::allocate(unsigned long) at /usr/src/debug/scylla-1.5.0/seastar/core/memory.cc:1201
operator new[](unsigned long) at /usr/src/debug/scylla-1.5.0/seastar/core/memory.cc:1511
__gnu_cxx::new_allocator<std::unique_ptr<task, std::default_delete<task> > >::allocate(unsigned long, void const*) at /opt/scylladb/include/c++/5.3.1/ext/new_allocator.h:104
 (inlined by) circular_buffer<std::unique_ptr<task, std::default_delete<task> >, std::allocator<std::unique_ptr<task, std::default_delete<task> > > >::expand(unsigned long) at /usr/src/debug/scylla-1.5.0/seastar/core/circular_buffer.hh:262
circular_buffer<std::unique_ptr<task, std::default_delete<task> >, std::allocator<std::unique_ptr<task, std::default_delete<task> > > >::maybe_expand(unsigned long) at /usr/src/debug/scylla-1.5.0/seastar/core/circular_buffer.hh:292
 (inlined by) circular_buffer<std::unique_ptr<task, std::default_delete<task> >, std::allocator<std::unique_ptr<task, std::default_delete<task> > > >::push_back(std::unique_ptr<task, std::default_delete<task> >&&) at /usr/src/debug/scylla-1.5.0/seastar/core/circular_buffer.hh:341
void promise<>::make_ready<(promise<>::urgent)1>() at /usr/src/debug/scylla-1.5.0/seastar/core/future.hh:1133
 (inlined by) void promise<>::do_set_value<(promise<>::urgent)1>(std::tuple<>) at /usr/src/debug/scylla-1.5.0/seastar/core/future.hh:495
 (inlined by) promise<>::set_urgent_value(std::tuple<>&&) at /usr/src/debug/scylla-1.5.0/seastar/core/future.hh:503
 (inlined by) future_state<>::forward_to(promise<>&) at /usr/src/debug/scylla-1.5.0/seastar/core/future.hh:1111
 (inlined by) future<>::forward_to(promise<>&&) at /usr/src/debug/scylla-1.5.0/seastar/core/future.hh:934
operator()<future_state<> > at /usr/src/debug/scylla-1.5.0/seastar/core/future.hh:872
 (inlined by) run at /usr/src/debug/scylla-1.5.0/seastar/core/future.hh:390
reactor::run_tasks(circular_buffer<std::unique_ptr<task, std::default_delete<task> >, std::allocator<std::unique_ptr<task, std::default_delete<task> > > >&) at /usr/src/debug/scylla-1.5.0/seastar/core/reactor.cc:1971
reactor::run() at /usr/src/debug/scylla-1.5.0/seastar/core/reactor.cc:2404
operator() at /usr/src/debug/scylla-1.5.0/seastar/core/reactor.cc:3344

Looks like later() throws an exception, probably because task allocation fails. We abort because region_group::release_requests() is noexcept. It has to be noexcept, it's called in the free path. Here, from a reclaimer.

tgrabiec · 2017-01-19T19:06:31Z

So, we're likely low on memory when this happens. Do we have logs/metrics for this run?

tgrabiec · 2017-01-20T09:20:20Z

Allocating from reclaimer is dangerous, since there is no guarantee that the allocator will be able to satisfy the allocation without reclaiming (and we can't reenter the reclaimer). This can happen for example if the small pool for the allocation is out of spare objects and needs to do a large allocation to repopulate.

I think the code calling later() should be refactored to use alloc-free waking instead.

roydahan · 2017-01-20T11:37:00Z

We should have the Prometheus's DB in the Jenkins job. I can upload it later if you need it.

tgrabiec · 2017-01-20T12:46:07Z

@roydahan Please do, or just tell me where to find it.

roydahan · 2017-01-22T09:55:52Z

Promethues data attached below:
perf-regression-master-scylla-monitor-node-cccf1559-1.zip

roydahan · 2017-01-22T17:09:29Z

BTW, two more observations:

The coredump doesn't happen in 1.4.1 and 1.6.rc1 but happened in earlier 1.6 build.
This happened also when we ran the above cassandra stress with much less load (rate-threads=100 instead of 500).

tgrabiec · 2017-01-23T11:00:25Z

@roydahan Are those metrics from the same run? They end at 14:24 whereas from the log excerpt you pasted it looks like the crash was at 15:02. I also do not see the IP from the log (54.173.233.146) in the instance set from the metrics recording:

roydahan · 2017-01-23T12:13:38Z

Sorry @tgrabiec, you are correct, it's not from the same run but one run earlier.
unfortunately, I don't have the data from the same run, because I chose to keep the setup alive.

tgrabiec · 2017-01-24T09:52:46Z

I finally got the core dump.

The bug is directly triggered by the fact that the pending task buffer needs to grow above 64K elements and memory reclamation is needed and it is satisfied by compacting memtable segments. The bug exists on 1.3, but it could be that 1.5 made those conditions more likely for some reason to occur.

Because the reclaimer tries to schedule a task using later(), it is bound to fail because circular_buffer::maybe_expand() is re-entered. There is enough of free memory otherwise (200MB), but only in <= 256K chunks.

Action items:

reclaimer needs to be changed to not try to schedule tasks or allocate memory
_pending_tasks should be switched to chunked_fifo so that it doesn't stress memory allocator in case we're flooded by a large number of tasks

tgrabiec · 2017-01-24T09:53:42Z

There could be another bug lurking here, which is causing large number of tasks to be scheduled in the first place. Will dig a bit more.

tgrabiec · 2017-01-26T09:34:33Z

Pending task queue looks like this:

     16243: 0x18904f0 vtable for lambda_task<later()::{lambda()#1}> + 16 
     16091: 0x197fc60 _ZTV12continuationIZN6futureIJEE12then_wrappedIZNS1_16handle_exceptionIZZN8database8do_applyE13lw_shared_ptrIK6schemaERK15frozen_mutationENKUlT_E_clIN2db15replay_positionEEEDaSC_EUlSC_E_EES1_OSC_EUlSI_E_S1_EET0_SI_EUlSI_E_JEE + 16 
     16090: 0x19bab50 _ZTV12continuationIZN6futureIJEE12then_wrappedINS1_12finally_bodyIZN7seastar9with_gateIZN8logalloc12region_group16release_requestsEvEUlvE_EEDaRNS4_4gateEOT_EUlvE_Lb0EEES1_EET0_SC_EUlSC_E_JEE + 16 
     14280: 0x1b36940 _ZTV12continuationIZN6futureIJEE12then_wrappedIZN17smp_message_queue15async_work_itemIZN7seastar7shardedI8databaseE9invoke_onIZN7service13storage_proxy14mutate_locallyERK13lw_shared_ptrIK6schemaERK15frozen_mutationEUlRS7_E_S1_EET0_jOT_EUlvE_E7processEvEUlSP_E_S1_EESN_SP_EUlSP_E_JEE + 16 
      1663: 0x1b3f4b8 _ZTV12continuationIZN6futureIJEE12then_wrappedINS1_12finally_bodyIZN3smp9submit_toIZN7seastar7shardedI8databaseE9invoke_onIZN7service13storage_proxy14mutate_locallyERK13lw_shared_ptrIK6schemaERK15frozen_mutationEUlRS8_E_S1_EET0_jOT_EUlvE_EEN8futurizeINSt9result_ofIFSP_vEE4typeEE4typeEjSQ_EUlvE_Lb0EEES1_EESO_SQ_EUlSQ_E_JEE + 16 
       357: 0x1b3efe0 _ZTV12continuationIZN6futureIJEE12then_wrappedIZNS1_16handle_exceptionIZN7service13storage_proxy22send_to_live_endpointsEmNSt6chrono10time_pointINS6_3_V212steady_clockENS6_8durationIlSt5ratioILl1ELl1000000000EEEEEEEUlNSt15__exception_ptr13exception_ptrEE1_EES1_OT_EUlSJ_E_S1_EET0_SJ_EUlSJ_E_JEE + 16 
       271: 0x19bab28 _ZTV12continuationIZN6futureIJEE4thenIZZN8logalloc12region_group16release_requestsEvENUlvE_clEvEUlvE_S1_EET0_OT_EUlS9_E_JEE + 16 
       267: 0x197ef60 _ZTV12continuationIZN6futureIJEE4thenIZN8database5applyE13lw_shared_ptrIK6schemaERK15frozen_mutationEUlvE_S1_EET0_OT_EUlSE_E_JEE + 16 
       243: 0x18913d8 _ZTV12continuationIZN6futureIJEE4thenIZZN17smp_message_queue16process_incomingEvENKUlPNS3_9work_itemEE_clES5_EUlvE_S1_EET0_OT_EUlSA_E_JEE + 16 
        13: 0x1b3ec30 vtable for continuation<future<> future<>::then<service::storage_proxy::init_messaging_service()::{lambda(rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >)#1}::operator()(rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >) const::{lambda(frozen_mutation const&, shared_ptr<service::storage_proxy>&)#1}::operator()(frozen_mutation const, shared_ptr<service::storage_proxy>)::{lambda()#2}, future<> >(service::storage_proxy::init_messaging_service()::{lambda(rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >)#1}::operator()(rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >) const::{lambda(frozen_mutation const&, shared_ptr<service::storage_proxy>&)#1}::operator()(frozen_mutation const, shared_ptr<service::storage_proxy>)::{lambda()#2}&&)::{lambda(service::storage_proxy::init_messaging_service()::{lambda(rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >)#1}::operator()(rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >) const::{lambda(frozen_mutation const&, shared_ptr<service::storage_proxy>&)#1}::operator()(frozen_mutation const, shared_ptr<service::storage_proxy>)::{lambda()#2})#1}> + 16 
         4: 0x1b3eaf0 _ZTV12continuationIZN6futureIJEE4thenIZZN7service13storage_proxy22send_to_live_endpointsEmNSt6chrono10time_pointINS5_3_V212steady_clockENS5_8durationIlSt5ratioILl1ELl1000000000EEEEEEENKUl13lw_shared_ptrIK15frozen_mutationEE_clESH_EUlvE_S1_EET0_OT_EUlSM_E_JEE + 16 
         2: 0x1b3ec80 vtable for continuation<future<rpc::no_wait_type> future<std::tuple<future<>, future<> > >::then_wrapped<service::storage_proxy::init_messaging_service()::{lambda(rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optiona---T---Type <return> to continue, or q <return> to quit---
l<std::experimental::fundamentals_v1::optional<tracing::trace_info> >)#1}::operator()(rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >) const::{lambda(frozen_mutation const&, shared_ptr<service::storage_proxy>&)#1}::operator()(frozen_mutation const, shared_ptr<service::storage_proxy>)::{lambda(future<std::tuple<future<>, future<> > >&&)#5}, future<rpc::no_wait_type> >(service::storage_proxy::init_messaging_service()::{lambda(rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >)#1}::operator()(rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >) const::{lambda(frozen_mutation const&, shared_ptr<service::storage_proxy>&)#1}::operator()(frozen_mutation const, shared_ptr<service::storage_proxy>)::{lambda(future<std::tuple<future<>, future<> > >&&)#5}&&)::{lambda(future<rpc::no_wait_type>)#1}, std::tuple<future<>, future<> > > + 16 
         2: 0x1ccc7b8 _ZTV12continuationIZN6futureIJEE12then_wrappedIZN3net12send_messageIN3rpc12no_wait_typeEJjmEEEDaPNS3_17messaging_serviceENS3_14messaging_verbENS3_8msg_addrEDpOT0_EUlOT_E_S1_EET0_SF_EUlSF_E_JEE + 16 
         2: 0x1ccda00 vtable for continuation<future<> future<>::then_wrapped<future<>::finally_body<auto seastar::with_gate<auto rpc::recv_helper<net::serializer, net::messaging_verb, std::function<future<rpc::no_wait_type> (rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >)>, future<rpc::no_wait_type>, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >, rpc::do_want_client_info, rpc::do_want_time_point>(rpc::signature<future<rpc::no_wait_type> (frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >)>, std::function<future<rpc::no_wait_type> (rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >)>&&, rpc::do_want_client_info, rpc::do_want_time_point)::{lambda(lw_shared_ptr<rpc::protocol<net::serializer, net::messaging_verb>::server::connection>, std::experimental::fundamentals_v1::optional<std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > >, long, rpc::rcv_buf)#1}::operator()(lw_shared_ptr<rpc::protocol<net::serializer, net::messaging_verb>::server::connection>, std::experimental::fundamentals_v1::optional<std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > >, long, rpc::rcv_buf)::{lambda()#1}::operator()()::{lambda()#1}>(seastar::gate&, auto rpc::recv_helper<net::serializer, net::messaging_verb, std::function<future<rpc::no_wait_type> (rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >)>, future<rpc::no_wait_type>, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >, rpc::do_want_client_info, rpc::do_want_time_point>(rpc::signature<future<rpc::no_wait_type> (frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >)>, std::function<future<rpc::no_wait_type> (rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >)>&&, rpc::do_want_client_info, rpc::do_want_time_point)::{lambda(lw_shared_ptr<rpc::protocol<net::serializer, net::messaging_verb>::server::connection>, std::experimental::fundamentals_v1::optional<std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > >, long, rpc::rcv_buf)#1}::operator()(lw_shared_ptr<rpc::protocol<net::serializer, net::messaging_verb>::server::connection>, std::experimental::fundamentals_v1::optional<std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > >, long, rpc::rcv_buf)::{lambda()#1}::operator()()::{lambda()#1}&&)::{lambda()#1}, false>, future<> >(seastar::gate)::{lambda(seastar::gate)#1}> + 16 
         2: 0x1cce2c0 vtable for continuation<future<> future<>::then<void rpc::protocol<net::serializer, net::messaging_verb>::connection::send_loop<(rpc::protocol<net::serializer, net::messaging_verb>::connection::outgoing_queue_type)0>()::{lambda()#2}::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}, future<> >(void rpc::protocol<net::serializer, net::messaging_verb>::connection::send_loop<(rpc::protocol<net::serializer, net::messaging_verb>::connection::outgoing_queue_type)0>()::{lambda()#2}::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}&&)::{lambda(void rpc::protocol<net::serializer, net::messaging_verb>::connection::send_loop<(rpc::protocol<net::serializer, net::messaging_verb>::connection::outgoing_queue_type)0>()::{lambda()#2}::operator()() const::{lambda()#2}::operator()() const::{lambda()#1})#1}> + 16 
         1: 0x1ccc010 _ZTV12continuationIZN6futureIJ16temporary_bufferIcEEE4thenIZN3rpc12read_rcv_bufER12input_streamIcEjEUlS2_E_S0_IJNS5_7rcv_bufEEEEET0_OT_EUlSE_E_JS2_EE + 16 
         1: 0x1b19020 _ZTV12continuationIZN6futureIJEE4thenIZZN7service20get_schema_for_writeEN5utils4UUIDEN3net8msg_addrEENKUl13lw_shared_ptrIK6schemaEE_clESB_EUlvE_S0_IJSB_EEEET0_OT_EUlSH_E_JEE + 16 
         1: 0x1ca9d08 _ZTV12continuationIZN6futureIJEE12then_wrappedIZ18do_until_continuedIZZN3rpc8protocolIN3net10serializerENS6_14messaging_verbEE6server10connection7processEvENUlvE_clEvEUlvE0_RZZNSB_7processEvENSC_clEvEUlvE_EvOT0_OT_7promiseIJEEEUlS1_E_S1_EESG_SJ_EUlSJ_E_JEE + 16 
         1: 0x1896bf8 vtable for continuation<future<>::wait()::{lambda(future_state<>&&)#1}> + 16 
         1: 0x18b62b0 _ZTV12continuationIZN6futureIJmEE4thenIZN7reactor14write_all_partER17pollable_fd_statePKvmmEUlmE_S0_IJEEEET0_OT_EUlSC_E_JmEE + 16 
         1: 0x18969d8 _ZTV12continuationIZN6futureIJEE4thenIZN13output_streamIcE10poll_flushEvEUlvE_S1_EET0_OT_EUlS8_E_JEE + 16

One problem is that there is a lot of tasks associated with release_requests() (32K). The later() tasks are also associated with release_requests():

(gdb) x/10ag (task*) 0x60018fac2208
0x60018fac2208:	0x18904f0 <_ZTV11lambda_taskIZ5latervEUlvE_E+16>	0x0
0x60018fac2218:	0x1	0x6001a25bcbd8
0x60018fac2228:	0x6001a25bcbd0	0x18904f0 <_ZTV11lambda_taskIZ5latervEUlvE_E+16>
0x60018fac2238:	0x0	0x1
0x60018fac2248:	0x600171897190	0x600171897188
(gdb) x/10ag 0x6001a25bcbd0
0x6001a25bcbd0:	0x19bab28 <_ZTV12continuationIZN6futureIJEE4thenIZZN8logalloc12region_group16release_requestsEvENUlvE_clEvEUlvE_S1_EET0_OT_EUlS9_E_JEE+16>	0x1
0x6001a25bcbe0:	0x0	0x1
0x6001a25bcbf0:	0x6001a3dadfc8	0x6001a3dadfc0
0x6001a25bcc00:	0x6000002efcb0	0x6001a25bc348
0x6001a25bcc10:	0x0	0x0

We shouldn't have as much as 16K pending writes at a time given loader concurrency, so I presume that the writes started to time out and accumulate. The fix for this (5ea235e) was not yet in 1.5 branch.

@roydahan Do we have prometheus metrics for this run?

Another potential problem is that we call release_requests() from region group size update on every size change once execution is permitted and if there are still some blocked tasks. This causes even more tasks to accumulate. I'm going to rework this so that there is at most one notification in flight.

tgrabiec · 2017-01-26T10:07:07Z

There are traces of write timeouts on this node:

(gdb) p service::_the_storage_proxy._instances[0].service._p->_stats.write_timeouts._count
$47 = 1252
(gdb) p service::_the_storage_proxy._instances[1].service._p->_stats.write_timeouts._count
$48 = 27686
(gdb) p service::_the_storage_proxy._instances[2].service._p->_stats.write_timeouts._count
$49 = 10691
(gdb) p service::_the_storage_proxy._instances[3].service._p->_stats.write_timeouts._count
$50 = 27519
(gdb) p service::_the_storage_proxy._instances[4].service._p->_stats.write_timeouts._count
$51 = 36537

tgrabiec · 2017-01-26T15:34:20Z

Metrics from "perf-regression-master-scylla-monitor-node-cccf1559-1.zip" seem to belong to a successful run, cause both write and read phases completed, with clients disconnecting before each node goes down:

avikivity · 2017-01-26T15:41:39Z

32k writes, even if each write is 1k, should still not overwhelm a shard. It's just 32M.

We need to improve in this area, but I don't think this is the root cause here.

avikivity · 2017-01-26T15:44:28Z

@tgrabiec a possible improvement for the task queue is to allocate a fixed size vector, and also support a singly-linked-list as part of the task structure. Try to enqueue to the fixed ring buffer, if it fails, use the linked list. This will give good performance when not stressed, and allocation-free operation when stressed.

tgrabiec · 2017-01-26T18:37:35Z

@avikivity The fact that we have 32K tasks means that writes are timing out as the client concurrency is way lower than that. When we reach 32K tasks and there are writes blocked on dirty, we hit the reclaimer bug resulting in a crash. But this bug is there since 1.3 so this suggests that on earlier versions we had less blocking of writes (and/or also less pending tasks due to less timeouts).

I'm still investigating why writes are blocking so much. So far the problem seems to be related to CPU being saturated.

slivne · 2017-01-29T09:41:51Z

On previous versions the flush logic for memtables was different - so we probably flushed "earlier" in the case of a single keyspace.table

…

On Thu, Jan 26, 2017 at 8:37 PM, Tomasz Grabiec ***@***.***> wrote: @avikivity <https://github.com/avikivity> The fact that we have 32K tasks means that writes are timing out as the client concurrency is way lower than that. When we reach 32K tasks and there are writes blocked on dirty, we hit the reclaimer bug resulting in a crash. But this bug is there since 1.3 so this suggests that on earlier versions we had less blocking of writes (and/or also less pending tasks due to less timeouts). I'm still investigating why writes are blocking so much. So far the problem seems to be related to CPU being saturated. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#2021 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADThCNO4V_wI25bW5AdHziLQm-4_g07hks5rWOfygaJpZM4LnOsN> .

Before, the logic for releasing writes blocked on dirty worked like this: 1) When region group size changes and it is not under pressure and there are some requests blocked, then schedule request releasing task 2) request releasing task, if no pressure, runs one request and if there are still blocked requests, schedules next request releasing task If requests don't change the size of the region group, then either some request executes or there is a request releasing task scheduled. The amount of scheduled tasks is at most 1, there is a single thread of excution. However, if requests themselves would change the size of the group, then each such change would schedule yet another request releasing thread, growing the task queue size by one. The group size can also change when memory is reclaimed from the groups (e.g. when contains sparse segments). Compaction may start many request releasing threads due to group size updates. Such behavior is detrimental for performance and stability if there are a lot of blocked requests. This can happen on 1.5 even with modest concurrency becuase timed out requests stay in the queue. This is less likely on 1.6 where they are dropped from the queue. The releasing of tasks may start to dominate over other processes in the system. When the amount of scheduled tasks reaches 1000, polling stops and server becomes unresponsive until all of the released requests are done, which is either when they start to block on dirty memory again or run out of blocked requests. It may take a while to reach pressure condition after memtable flush if it brings virtual dirty much below the threshold, which is currently the case for workloads with overwrites producing sparse regions. Refs #2021. Fix by ensuring there is at most one request releasing thread at a time. There will be one releasing fiber per region group which is woken up when pressure is lifted. It executes blocked requests until pressure occurs.

Before, the logic for releasing writes blocked on dirty worked like this: 1) When region group size changes and it is not under pressure and there are some requests blocked, then schedule request releasing task 2) request releasing task, if no pressure, runs one request and if there are still blocked requests, schedules next request releasing task If requests don't change the size of the region group, then either some request executes or there is a request releasing task scheduled. The amount of scheduled tasks is at most 1, there is a single thread of excution. However, if requests themselves would change the size of the group, then each such change would schedule yet another request releasing thread, growing the task queue size by one. The group size can also change when memory is reclaimed from the groups (e.g. when contains sparse segments). Compaction may start many request releasing threads due to group size updates. Such behavior is detrimental for performance and stability if there are a lot of blocked requests. This can happen on 1.5 even with modest concurrency becuase timed out requests stay in the queue. This is less likely on 1.6 where they are dropped from the queue. The releasing of tasks may start to dominate over other processes in the system. When the amount of scheduled tasks reaches 1000, polling stops and server becomes unresponsive until all of the released requests are done, which is either when they start to block on dirty memory again or run out of blocked requests. It may take a while to reach pressure condition after memtable flush if it brings virtual dirty much below the threshold, which is currently the case for workloads with overwrites producing sparse regions. Refs #2021. Fix by ensuring there is at most one request releasing thread at a time. There will be one releasing fiber per region group which is woken up when pressure is lifted. It executes blocked requests until pressure occurs. The logic for notification across hierachy was replaced by calling region_group::notify_relief() from region_group::update() on the broadest relieved group.

"Before, the logic for releasing writes blocked on dirty worked like this: 1) When region group size changes and it is not under pressure and there are some requests blocked, then schedule request releasing task 2) request releasing task, if no pressure, runs one request and if there are still blocked requests, schedules next request releasing task If requests don't change the size of the region group, then either some request executes or there is a request releasing task scheduled. The amount of scheduled tasks is at most 1, there is a single releasing thread. However, if requests themselves would change the size of the group, then each such change would schedule yet another request releasing thread, growing the task queue size by one. The group size can also change when memory is reclaimed from the groups (e.g. when contains sparse segments). Compaction may start many request releasing threads due to group size updates. Such behavior is detrimental for performance and stability if there are a lot of blocked requests. This can happen on 1.5 even with modest concurrency because timed out requests stay in the queue. This is less likely on 1.6 where they are dropped from the queue. The releasing of tasks may start to dominate over other processes in the system. When the amount of scheduled tasks reaches 1000, polling stops and server becomes unresponsive until all of the released requests are done, which is either when they start to block on dirty memory again or run out of blocked requests. It may take a while to reach pressure condition after memtable flush if it brings virtual dirty much below the threshold, which is currently the case for workloads with overwrites producing sparse regions. I saw this happening in a write workload from issue #2021 where the number of request releasing threads grew into thousands. Fix by ensuring there is at most one request releasing thread at a time. There will be one releasing fiber per region group which is woken up when pressure is lifted. It executes blocked requests until pressure occurs." * tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev: tests: lsa: Add test for reclaimer starting and stopping tests: lsa: Add request releasing stress test lsa: Avoid avalanche releasing of requests lsa: Move definitions to .cc lsa: Simplify hard pressure notification management lsa: Do not start or stop reclaiming on hard pressure tests: lsa: Adjust to take into account that reclaimers are run synchronously lsa: Document and annotate reclaimer notification callbacks tests: lsa: Use with_timeout() in quiesce()

"Before, the logic for releasing writes blocked on dirty worked like this: 1) When region group size changes and it is not under pressure and there are some requests blocked, then schedule request releasing task 2) request releasing task, if no pressure, runs one request and if there are still blocked requests, schedules next request releasing task If requests don't change the size of the region group, then either some request executes or there is a request releasing task scheduled. The amount of scheduled tasks is at most 1, there is a single releasing thread. However, if requests themselves would change the size of the group, then each such change would schedule yet another request releasing thread, growing the task queue size by one. The group size can also change when memory is reclaimed from the groups (e.g. when contains sparse segments). Compaction may start many request releasing threads due to group size updates. Such behavior is detrimental for performance and stability if there are a lot of blocked requests. This can happen on 1.5 even with modest concurrency because timed out requests stay in the queue. This is less likely on 1.6 where they are dropped from the queue. The releasing of tasks may start to dominate over other processes in the system. When the amount of scheduled tasks reaches 1000, polling stops and server becomes unresponsive until all of the released requests are done, which is either when they start to block on dirty memory again or run out of blocked requests. It may take a while to reach pressure condition after memtable flush if it brings virtual dirty much below the threshold, which is currently the case for workloads with overwrites producing sparse regions. I saw this happening in a write workload from issue #2021 where the number of request releasing threads grew into thousands. Fix by ensuring there is at most one request releasing thread at a time. There will be one releasing fiber per region group which is woken up when pressure is lifted. It executes blocked requests until pressure occurs." * tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev: tests: lsa: Add test for reclaimer starting and stopping tests: lsa: Add request releasing stress test lsa: Avoid avalanche releasing of requests lsa: Move definitions to .cc lsa: Simplify hard pressure notification management lsa: Do not start or stop reclaiming on hard pressure tests: lsa: Adjust to take into account that reclaimers are run synchronously lsa: Document and annotate reclaimer notification callbacks tests: lsa: Use with_timeout() in quiesce() (cherry picked from commit 7a00dd6)

tzach · 2017-02-08T11:32:53Z

@tgrabiec can we close this issue?

tgrabiec · 2017-02-09T11:11:03Z

@tzach I think so, but @roydahan needs to confirm by rerunning the test (a couple of times)

roydahan · 2017-02-19T18:58:05Z

I'll verify that when running 1.7 perf testing.

roydahan · 2017-04-13T10:58:45Z

Verified on 1.7rc2, seems like issue is fixed.

roydahan added the bug label Jan 18, 2017

roydahan assigned slivne Jan 18, 2017

slivne assigned tgrabiec and unassigned slivne Jan 23, 2017

tzach added this to the 1.5 milestone Feb 8, 2017

tgrabiec mentioned this issue Mar 3, 2017

Fail-free notifications scylladb/seastar#254

Open

roydahan closed this as completed Apr 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coredump during c-s read performance test #2021

coredump during c-s read performance test #2021

roydahan commented Jan 18, 2017 •

edited

roydahan commented Jan 18, 2017

avikivity commented Jan 18, 2017 via email

benoit-canet commented Jan 18, 2017 via email

benoit-canet commented Jan 18, 2017 via email

roydahan commented Jan 18, 2017

tgrabiec commented Jan 19, 2017

tgrabiec commented Jan 19, 2017

tgrabiec commented Jan 19, 2017

tgrabiec commented Jan 20, 2017 •

edited

roydahan commented Jan 20, 2017

tgrabiec commented Jan 20, 2017

roydahan commented Jan 22, 2017

roydahan commented Jan 22, 2017

tgrabiec commented Jan 23, 2017

roydahan commented Jan 23, 2017

tgrabiec commented Jan 24, 2017

tgrabiec commented Jan 24, 2017

tgrabiec commented Jan 26, 2017

tgrabiec commented Jan 26, 2017

tgrabiec commented Jan 26, 2017

avikivity commented Jan 26, 2017

avikivity commented Jan 26, 2017

tgrabiec commented Jan 26, 2017

slivne commented Jan 29, 2017 via email

tzach commented Feb 8, 2017

tgrabiec commented Feb 9, 2017

roydahan commented Feb 19, 2017

roydahan commented Apr 13, 2017

coredump during c-s read performance test #2021

coredump during c-s read performance test #2021

Comments

roydahan commented Jan 18, 2017 • edited

roydahan commented Jan 18, 2017

avikivity commented Jan 18, 2017 via email

benoit-canet commented Jan 18, 2017 via email

benoit-canet commented Jan 18, 2017 via email

roydahan commented Jan 18, 2017

tgrabiec commented Jan 19, 2017

tgrabiec commented Jan 19, 2017

tgrabiec commented Jan 19, 2017

tgrabiec commented Jan 20, 2017 • edited

roydahan commented Jan 20, 2017

tgrabiec commented Jan 20, 2017

roydahan commented Jan 22, 2017

roydahan commented Jan 22, 2017

tgrabiec commented Jan 23, 2017

roydahan commented Jan 23, 2017

tgrabiec commented Jan 24, 2017

tgrabiec commented Jan 24, 2017

tgrabiec commented Jan 26, 2017

tgrabiec commented Jan 26, 2017

tgrabiec commented Jan 26, 2017

avikivity commented Jan 26, 2017

avikivity commented Jan 26, 2017

tgrabiec commented Jan 26, 2017

slivne commented Jan 29, 2017 via email

tzach commented Feb 8, 2017

tgrabiec commented Feb 9, 2017

roydahan commented Feb 19, 2017

roydahan commented Apr 13, 2017

roydahan commented Jan 18, 2017 •

edited

tgrabiec commented Jan 20, 2017 •

edited