Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coredump during c-s read performance test #2021

Closed
roydahan opened this issue Jan 18, 2017 · 28 comments
Closed

coredump during c-s read performance test #2021

roydahan opened this issue Jan 18, 2017 · 28 comments
Assignees
Labels
Milestone

Comments

@roydahan
Copy link

roydahan commented Jan 18, 2017

Installation details
Scylla version (or git commit hash): 1.5.0
Cluster size: 3 node, i2.2xlarge
OS (RHEL/CentOS/Ubuntu/AWS AMI): AWS AMI (ami-0d73641a)

During ami-perf-regression testing, a coredump was issued for test-read scenario.

The test uses 4 loaders and run c-s read for 50 mins, however the coredump was issued during the write phase of the test after about 1 hour.

Test Details:
First phase - Write:
cassandra-stress write no-warmup cl=QUORUM n=30000000 -schema keyspace=keyspace$2 'replication(factor=3)' -port jmx=6868 -mode cql3 native -rate threads=500 -errors ignore -pop seq=1..30000000

Second phase - Read:
cassandra-stress read no-warmup cl=QUORUM duration=50m -schema keyspace=keyspace$2 'replication(factor=3)' -port jmx=6868 -mode cql3 native -rate threads=500 -errors ignore -pop 'dist=gauss(1..30000000,15000000,1500000)'

Coredump location:
Coredump size is 56GB, zipped with pigz and split to 4GB chunks:
scylladb-users-upload.s3.amazonaws.com/test-read-1.5/xaa
scylladb-users-upload.s3.amazonaws.com/test-read-1.5/xab
scylladb-users-upload.s3.amazonaws.com/test-read-1.5/xac
scylladb-users-upload.s3.amazonaws.com/test-read-1.5/xad

Relevant section from the log:

>2017-01-18 15:02:14,091 cluster          L0646 INFO | Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None): To download it, you may use 'curl --user-a
gent [user-agent] scylladb-users-upload.s3.amazonaws.com/core.scylla.994.4275204cab84413fae9599105c319a83.2486.1484751659000/core.scylla.994.4275204cab84413fae9599105c319a83.2486.1484751659000000 
> core.scylla.994.4275204cab84413fae9599105c319a83.2486.1484751659000000'
2017-01-18 15:02:14,092 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):            PID: 2486 (scylla)
2017-01-18 15:02:14,092 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):            UID: 994 (scylla)
2017-01-18 15:02:14,092 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):            GID: 1001 (scylla)
2017-01-18 15:02:14,092 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):         Signal: 6 (ABRT)
2017-01-18 15:02:14,092 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):      Timestamp: Wed 2017-01-18 15:00:59 UT
C (1min 14s ago)
2017-01-18 15:02:14,092 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):   Command Line: /usr/bin/scylla --log-to-s
yslog 1 --log-to-stdout 0 --default-log-level info --collectd-address=127.0.0.1:25826 --collectd=1 --collectd-poll-period 3000 --network-stack posix --num-io-queues 8 --max-io-requests 64 --cpuset
 0,1,2,3,4,5,6,7
2017-01-18 15:02:14,092 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):     Executable: /usr/bin/scylla
2017-01-18 15:02:14,093 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):  Control Group: /system.slice/scylla-serve
r.service
2017-01-18 15:02:14,093 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):           Unit: scylla-server.service
2017-01-18 15:02:14,093 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):          Slice: system.slice
2017-01-18 15:02:14,093 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):        Boot ID: 4275204cab84413fae9599105c
319a83
2017-01-18 15:02:14,093 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):     Machine ID: 8ea3e6ea03b64d018847fcdf66
9935d4
2017-01-18 15:02:14,093 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):       Hostname: ip-172-30-0-226
2017-01-18 15:02:14,093 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):       Coredump: /var/lib/systemd/coredump/
core.scylla.994.4275204cab84413fae9599105c319a83.2486.1484751659000000
2017-01-18 15:02:14,093 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):        Message: Process 2486 (scylla) of user 994 dumped core.
2017-01-18 15:02:14,094 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 
2017-01-18 15:02:14,094 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 Stack trace of thread 2489:
2017-01-18 15:02:14,094 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #0  0x00007fec73a975f7 raise (libc.so.6)
2017-01-18 15:02:14,094 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #1  0x00007fec73a98e28 abort (libc.so.6)
2017-01-18 15:02:14,094 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #2  0x00000000018702ad _ZN9__gnu_cxx27__verbose_terminate_handlerEv (scylla)
2017-01-18 15:02:14,094 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #3  0x00000000017da696 _ZN10__cxxabiv111__terminateEPFvvE (scylla)
2017-01-18 15:02:14,094 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #4  0x000000000186fff9 __cxa_call_terminate (scylla)
2017-01-18 15:02:14,095 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #5  0x00000000017d9fe5 __gxx_personality_v0 (scylla)
2017-01-18 15:02:14,095 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #6  0x00007fec75eeb903 n/a (libgcc_s.so.1)
2017-01-18 15:02:14,095 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #7  0x00007fec75eebe37 _Unwind_Resume (libgcc_s.so.1)
2017-01-18 15:02:14,095 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #8  0x00000000004f1e66 _ZN12future_stateIJEED4Ev (scylla)
2017-01-18 15:02:14,095 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #9  0x000000000097a2d5 operator() (scylla)
2017-01-18 15:02:14,095 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #10 0x0000000000987b65 _ZZN8logalloc12region_group6updateElENUlT_E_clIPS0_EEDaS1_ (scylla)
2017-01-18 15:02:14,095 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #11 0x0000000000988015 _ZN8logalloc12region_group18do_for_each_parentIZNS0_6updateElEUlT_E_EEPS0_S4_OS2_ (scylla)
2017-01-18 15:02:14,095 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #12 0x000000000098a97b _ZN8logalloc11region_impl7compactEPNS_7segmentE (scylla)
2017-01-18 15:02:14,095 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #13 0x000000000098214f _ZN8logalloc11region_impl7compactEv (scylla)
2017-01-18 15:02:14,096 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #14 0x00000000009839c6 _ZN8logalloc7tracker4impl7reclaimEm (scylla)
2017-01-18 15:02:14,096 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #15 0x0000000000983a47 _ZN8logalloc7tracker7reclaimEm (scylla)
2017-01-18 15:02:14,096 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #16 0x000000000058b336 _ZNKSt8functionIFN6memory17reclaiming_resultEvEEclEv (scylla)
2017-01-18 15:02:14,096 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #17 0x000000000058b39d _ZN6memory9cpu_pages31find_and_unlink_span_reclaimingEj (scylla)
2017-01-18 15:02:14,096 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #18 0x000000000058bb03 allocate_large_and_trim<memory::cpu_pages::allocate_large(unsigned int)::<lambda(unsigned int, unsigned int)> > (scylla)
2017-01-18 15:02:14,096 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #19 0x000000000058c27a _ZN6memory9cpu_pages14allocate_largeEj (scylla)
2017-01-18 15:02:14,096 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #20 0x000000000058cbed _ZN6memory8allocateEm (scylla)
2017-01-18 15:02:14,096 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #21 0x000000000058cd35 _Znam (scylla)
2017-01-18 15:02:14,097 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #22 0x000000000054f964 _ZN9__gnu_cxx13new_allocatorISt10unique_ptrI4taskSt14default_deleteIS2_EEE8allocateEmPKv (scylla)
2017-01-18 15:02:14,097 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #23 0x000000000054faad _ZN15circular_bufferISt10unique_ptrI4taskSt14default_deleteIS1_EESaIS4_EE12maybe_expandEm (scylla)
2017-01-18 15:02:14,097 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #24 0x0000000000554235 _ZN7promiseIJEE10make_readyILNS0_6urgentE1EEEvv (scylla)
2017-01-18 15:02:14,097 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #25 0x000000000097dd69 operator()<future_state<> > (scylla)
2017-01-18 15:02:14,097 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #26 0x00000000004e487f _ZN7reactor9run_tasksER15circular_bufferISt10unique_ptrI4taskSt14default_deleteIS2_EESaIS5_EE (scylla)
2017-01-18 15:02:14,097 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #27 0x000000000052af0b _ZN7reactor3runEv (scylla)
2017-01-18 15:02:14,097 cluster          L0650 ERROR| Node perf-regression-master-scylla-db-node-281f77b2-2 [54.173.233.146 | 172.30.0.226] (seed: None):                 #28 0x000000000053caf1 _ZZN3smp9configureEN5boost15program_options13variables_mapEENKUlvE1_clEv.constprop.3780 (scylla)


@roydahan
Copy link
Author

System is still alive and marked as keep=alive in case someone want to drill down.

@avikivity
Copy link
Member

avikivity commented Jan 18, 2017 via email

@benoit-canet
Copy link

benoit-canet commented Jan 18, 2017 via email

@benoit-canet
Copy link

benoit-canet commented Jan 18, 2017 via email

@roydahan
Copy link
Author

Already uploaded it.
Please see coredump location in the issue description.

@tgrabiec
Copy link
Contributor

@roydahan How do I extract the core file from the fragments? I joined the 4 fragments using cat but pigz refuses to decompress the result:

$ pigz -d za.xz
pigz: skipping: za.xz does not have compressed suffix

@tgrabiec
Copy link
Contributor

Better backtrace:

_ZN12future_stateIJEED4Ev at /usr/src/debug/scylla-1.5.0/seastar/core/future.hh:314
 (inlined by) _ZN7promiseIJEED4Ev at /usr/src/debug/scylla-1.5.0/seastar/core/future.hh:429
 (inlined by) later() at /usr/src/debug/scylla-1.5.0/seastar/core/reactor.cc:3739
operator() at /usr/src/debug/scylla-1.5.0/utils/logalloc.cc:2096
 (inlined by) with_gate<logalloc::region_group::release_requests()::<lambda()> > at /usr/src/debug/scylla-1.5.0/seastar/core/gate.hh:120
 (inlined by) logalloc::region_group::release_requests() at /usr/src/debug/scylla-1.5.0/utils/logalloc.cc:2119
_ZZN8logalloc12region_group6updateElENUlT_E_clIPS0_EEDaS1_ at /usr/src/debug/scylla-1.5.0/utils/logalloc.hh:262
_ZN8logalloc12region_group18do_for_each_parentIZNS0_6updateElEUlT_E_EEPS0_S4_OS2_ at /usr/src/debug/scylla-1.5.0/utils/logalloc.hh:416
 (inlined by) logalloc::region_group::update(long) at /usr/src/debug/scylla-1.5.0/utils/logalloc.hh:270
 (inlined by) logalloc::region_group::decrease_usage(boost::heap::detail::node_handle<boost::heap::detail::parent_pointing_heap_node<logalloc::region_impl*>*, boost::heap::detail::make_binomial_heap_base<logalloc::region_impl*, boost::parameter::aux::arg_list<boost::heap::constant_time_size<false>, boost::parameter::aux::arg_list<boost::heap::allocator<std::allocator<logalloc::region_impl*> >, boost::parameter::aux::arg_list<boost::heap::compare<logalloc::region_group::region_evictable_occupancy_ascending_less_comparator>, boost::parameter::aux::empty_arg_list> > > >::type, logalloc::region_impl*&>&, long) at /usr/src/debug/scylla-1.5.0/utils/logalloc.hh:290
logalloc::region_impl::compact(logalloc::segment*) at /usr/src/debug/scylla-1.5.0/utils/logalloc.cc:1228
 (inlined by) logalloc::region_impl::compact_single_segment_locked() at /usr/src/debug/scylla-1.5.0/utils/logalloc.cc:1472
logalloc::region_impl::compact() at /usr/src/debug/scylla-1.5.0/utils/logalloc.cc:1463
 (inlined by) logalloc::tracker::impl::compact_and_evict(unsigned long) at /usr/src/debug/scylla-1.5.0/utils/logalloc.cc:1901
logalloc::tracker::impl::reclaim(unsigned long) at /usr/src/debug/scylla-1.5.0/utils/logalloc.cc:1829
logalloc::tracker::reclaim(unsigned long) at /usr/src/debug/scylla-1.5.0/utils/logalloc.cc:114
 (inlined by) logalloc::tracker::reclaim() at /usr/src/debug/scylla-1.5.0/utils/logalloc.cc:1600
 (inlined by) operator() at /usr/src/debug/scylla-1.5.0/utils/logalloc.cc:107
 (inlined by) _M_invoke at /opt/scylladb/include/c++/5.3.1/functional:1857
std::function<memory::reclaiming_result ()>::operator()() const at /opt/scylladb/include/c++/5.3.1/functional:2271
 (inlined by) memory::reclaimer::do_reclaim() at /usr/src/debug/scylla-1.5.0/seastar/core/memory.hh:110
 (inlined by) memory::cpu_pages::run_reclaimers(memory::reclaimer_scope) at /usr/src/debug/scylla-1.5.0/seastar/core/memory.cc:1010
memory::cpu_pages::find_and_unlink_span_reclaiming(unsigned int) at /usr/src/debug/scylla-1.5.0/seastar/core/memory.cc:564
allocate_large_and_trim<memory::cpu_pages::allocate_large(unsigned int)::<lambda(unsigned int, unsigned int)> > at /usr/src/debug/scylla-1.5.0/seastar/core/memory.cc:578
memory::cpu_pages::allocate_large(unsigned int) at /usr/src/debug/scylla-1.5.0/seastar/core/memory.cc:623
 (inlined by) memory::allocate_large(unsigned long) at /usr/src/debug/scylla-1.5.0/seastar/core/memory.cc:1173
memory::allocate(unsigned long) at /usr/src/debug/scylla-1.5.0/seastar/core/memory.cc:1201
operator new[](unsigned long) at /usr/src/debug/scylla-1.5.0/seastar/core/memory.cc:1511
__gnu_cxx::new_allocator<std::unique_ptr<task, std::default_delete<task> > >::allocate(unsigned long, void const*) at /opt/scylladb/include/c++/5.3.1/ext/new_allocator.h:104
 (inlined by) circular_buffer<std::unique_ptr<task, std::default_delete<task> >, std::allocator<std::unique_ptr<task, std::default_delete<task> > > >::expand(unsigned long) at /usr/src/debug/scylla-1.5.0/seastar/core/circular_buffer.hh:262
circular_buffer<std::unique_ptr<task, std::default_delete<task> >, std::allocator<std::unique_ptr<task, std::default_delete<task> > > >::maybe_expand(unsigned long) at /usr/src/debug/scylla-1.5.0/seastar/core/circular_buffer.hh:292
 (inlined by) circular_buffer<std::unique_ptr<task, std::default_delete<task> >, std::allocator<std::unique_ptr<task, std::default_delete<task> > > >::push_back(std::unique_ptr<task, std::default_delete<task> >&&) at /usr/src/debug/scylla-1.5.0/seastar/core/circular_buffer.hh:341
void promise<>::make_ready<(promise<>::urgent)1>() at /usr/src/debug/scylla-1.5.0/seastar/core/future.hh:1133
 (inlined by) void promise<>::do_set_value<(promise<>::urgent)1>(std::tuple<>) at /usr/src/debug/scylla-1.5.0/seastar/core/future.hh:495
 (inlined by) promise<>::set_urgent_value(std::tuple<>&&) at /usr/src/debug/scylla-1.5.0/seastar/core/future.hh:503
 (inlined by) future_state<>::forward_to(promise<>&) at /usr/src/debug/scylla-1.5.0/seastar/core/future.hh:1111
 (inlined by) future<>::forward_to(promise<>&&) at /usr/src/debug/scylla-1.5.0/seastar/core/future.hh:934
operator()<future_state<> > at /usr/src/debug/scylla-1.5.0/seastar/core/future.hh:872
 (inlined by) run at /usr/src/debug/scylla-1.5.0/seastar/core/future.hh:390
reactor::run_tasks(circular_buffer<std::unique_ptr<task, std::default_delete<task> >, std::allocator<std::unique_ptr<task, std::default_delete<task> > > >&) at /usr/src/debug/scylla-1.5.0/seastar/core/reactor.cc:1971
reactor::run() at /usr/src/debug/scylla-1.5.0/seastar/core/reactor.cc:2404
operator() at /usr/src/debug/scylla-1.5.0/seastar/core/reactor.cc:3344

Looks like later() throws an exception, probably because task allocation fails. We abort because region_group::release_requests() is noexcept. It has to be noexcept, it's called in the free path. Here, from a reclaimer.

@tgrabiec
Copy link
Contributor

So, we're likely low on memory when this happens. Do we have logs/metrics for this run?

@tgrabiec
Copy link
Contributor

tgrabiec commented Jan 20, 2017

Allocating from reclaimer is dangerous, since there is no guarantee that the allocator will be able to satisfy the allocation without reclaiming (and we can't reenter the reclaimer). This can happen for example if the small pool for the allocation is out of spare objects and needs to do a large allocation to repopulate.

I think the code calling later() should be refactored to use alloc-free waking instead.

@roydahan
Copy link
Author

We should have the Prometheus's DB in the Jenkins job. I can upload it later if you need it.

@tgrabiec
Copy link
Contributor

@roydahan Please do, or just tell me where to find it.

@roydahan
Copy link
Author

Promethues data attached below:
perf-regression-master-scylla-monitor-node-cccf1559-1.zip

@roydahan
Copy link
Author

BTW, two more observations:

  1. The coredump doesn't happen in 1.4.1 and 1.6.rc1 but happened in earlier 1.6 build.
  2. This happened also when we ran the above cassandra stress with much less load (rate-threads=100 instead of 500).

@tgrabiec
Copy link
Contributor

@roydahan Are those metrics from the same run? They end at 14:24 whereas from the log excerpt you pasted it looks like the crash was at 15:02. I also do not see the IP from the log (54.173.233.146) in the instance set from the metrics recording:

screenshot from 2017-01-23 11-59-47

@roydahan
Copy link
Author

Sorry @tgrabiec, you are correct, it's not from the same run but one run earlier.
unfortunately, I don't have the data from the same run, because I chose to keep the setup alive.

@slivne slivne assigned tgrabiec and unassigned slivne Jan 23, 2017
@tgrabiec
Copy link
Contributor

I finally got the core dump.

The bug is directly triggered by the fact that the pending task buffer needs to grow above 64K elements and memory reclamation is needed and it is satisfied by compacting memtable segments. The bug exists on 1.3, but it could be that 1.5 made those conditions more likely for some reason to occur.

Because the reclaimer tries to schedule a task using later(), it is bound to fail because circular_buffer::maybe_expand() is re-entered. There is enough of free memory otherwise (200MB), but only in <= 256K chunks.

Action items:

  • reclaimer needs to be changed to not try to schedule tasks or allocate memory
  • _pending_tasks should be switched to chunked_fifo so that it doesn't stress memory allocator in case we're flooded by a large number of tasks

@tgrabiec
Copy link
Contributor

There could be another bug lurking here, which is causing large number of tasks to be scheduled in the first place. Will dig a bit more.

@tgrabiec
Copy link
Contributor

Pending task queue looks like this:

     16243: 0x18904f0 vtable for lambda_task<later()::{lambda()#1}> + 16 
     16091: 0x197fc60 _ZTV12continuationIZN6futureIJEE12then_wrappedIZNS1_16handle_exceptionIZZN8database8do_applyE13lw_shared_ptrIK6schemaERK15frozen_mutationENKUlT_E_clIN2db15replay_positionEEEDaSC_EUlSC_E_EES1_OSC_EUlSI_E_S1_EET0_SI_EUlSI_E_JEE + 16 
     16090: 0x19bab50 _ZTV12continuationIZN6futureIJEE12then_wrappedINS1_12finally_bodyIZN7seastar9with_gateIZN8logalloc12region_group16release_requestsEvEUlvE_EEDaRNS4_4gateEOT_EUlvE_Lb0EEES1_EET0_SC_EUlSC_E_JEE + 16 
     14280: 0x1b36940 _ZTV12continuationIZN6futureIJEE12then_wrappedIZN17smp_message_queue15async_work_itemIZN7seastar7shardedI8databaseE9invoke_onIZN7service13storage_proxy14mutate_locallyERK13lw_shared_ptrIK6schemaERK15frozen_mutationEUlRS7_E_S1_EET0_jOT_EUlvE_E7processEvEUlSP_E_S1_EESN_SP_EUlSP_E_JEE + 16 
      1663: 0x1b3f4b8 _ZTV12continuationIZN6futureIJEE12then_wrappedINS1_12finally_bodyIZN3smp9submit_toIZN7seastar7shardedI8databaseE9invoke_onIZN7service13storage_proxy14mutate_locallyERK13lw_shared_ptrIK6schemaERK15frozen_mutationEUlRS8_E_S1_EET0_jOT_EUlvE_EEN8futurizeINSt9result_ofIFSP_vEE4typeEE4typeEjSQ_EUlvE_Lb0EEES1_EESO_SQ_EUlSQ_E_JEE + 16 
       357: 0x1b3efe0 _ZTV12continuationIZN6futureIJEE12then_wrappedIZNS1_16handle_exceptionIZN7service13storage_proxy22send_to_live_endpointsEmNSt6chrono10time_pointINS6_3_V212steady_clockENS6_8durationIlSt5ratioILl1ELl1000000000EEEEEEEUlNSt15__exception_ptr13exception_ptrEE1_EES1_OT_EUlSJ_E_S1_EET0_SJ_EUlSJ_E_JEE + 16 
       271: 0x19bab28 _ZTV12continuationIZN6futureIJEE4thenIZZN8logalloc12region_group16release_requestsEvENUlvE_clEvEUlvE_S1_EET0_OT_EUlS9_E_JEE + 16 
       267: 0x197ef60 _ZTV12continuationIZN6futureIJEE4thenIZN8database5applyE13lw_shared_ptrIK6schemaERK15frozen_mutationEUlvE_S1_EET0_OT_EUlSE_E_JEE + 16 
       243: 0x18913d8 _ZTV12continuationIZN6futureIJEE4thenIZZN17smp_message_queue16process_incomingEvENKUlPNS3_9work_itemEE_clES5_EUlvE_S1_EET0_OT_EUlSA_E_JEE + 16 
        13: 0x1b3ec30 vtable for continuation<future<> future<>::then<service::storage_proxy::init_messaging_service()::{lambda(rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >)#1}::operator()(rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >) const::{lambda(frozen_mutation const&, shared_ptr<service::storage_proxy>&)#1}::operator()(frozen_mutation const, shared_ptr<service::storage_proxy>)::{lambda()#2}, future<> >(service::storage_proxy::init_messaging_service()::{lambda(rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >)#1}::operator()(rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >) const::{lambda(frozen_mutation const&, shared_ptr<service::storage_proxy>&)#1}::operator()(frozen_mutation const, shared_ptr<service::storage_proxy>)::{lambda()#2}&&)::{lambda(service::storage_proxy::init_messaging_service()::{lambda(rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >)#1}::operator()(rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >) const::{lambda(frozen_mutation const&, shared_ptr<service::storage_proxy>&)#1}::operator()(frozen_mutation const, shared_ptr<service::storage_proxy>)::{lambda()#2})#1}> + 16 
         4: 0x1b3eaf0 _ZTV12continuationIZN6futureIJEE4thenIZZN7service13storage_proxy22send_to_live_endpointsEmNSt6chrono10time_pointINS5_3_V212steady_clockENS5_8durationIlSt5ratioILl1ELl1000000000EEEEEEENKUl13lw_shared_ptrIK15frozen_mutationEE_clESH_EUlvE_S1_EET0_OT_EUlSM_E_JEE + 16 
         2: 0x1b3ec80 vtable for continuation<future<rpc::no_wait_type> future<std::tuple<future<>, future<> > >::then_wrapped<service::storage_proxy::init_messaging_service()::{lambda(rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optiona---T---Type <return> to continue, or q <return> to quit---
l<std::experimental::fundamentals_v1::optional<tracing::trace_info> >)#1}::operator()(rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >) const::{lambda(frozen_mutation const&, shared_ptr<service::storage_proxy>&)#1}::operator()(frozen_mutation const, shared_ptr<service::storage_proxy>)::{lambda(future<std::tuple<future<>, future<> > >&&)#5}, future<rpc::no_wait_type> >(service::storage_proxy::init_messaging_service()::{lambda(rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >)#1}::operator()(rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >) const::{lambda(frozen_mutation const&, shared_ptr<service::storage_proxy>&)#1}::operator()(frozen_mutation const, shared_ptr<service::storage_proxy>)::{lambda(future<std::tuple<future<>, future<> > >&&)#5}&&)::{lambda(future<rpc::no_wait_type>)#1}, std::tuple<future<>, future<> > > + 16 
         2: 0x1ccc7b8 _ZTV12continuationIZN6futureIJEE12then_wrappedIZN3net12send_messageIN3rpc12no_wait_typeEJjmEEEDaPNS3_17messaging_serviceENS3_14messaging_verbENS3_8msg_addrEDpOT0_EUlOT_E_S1_EET0_SF_EUlSF_E_JEE + 16 
         2: 0x1ccda00 vtable for continuation<future<> future<>::then_wrapped<future<>::finally_body<auto seastar::with_gate<auto rpc::recv_helper<net::serializer, net::messaging_verb, std::function<future<rpc::no_wait_type> (rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >)>, future<rpc::no_wait_type>, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >, rpc::do_want_client_info, rpc::do_want_time_point>(rpc::signature<future<rpc::no_wait_type> (frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >)>, std::function<future<rpc::no_wait_type> (rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >)>&&, rpc::do_want_client_info, rpc::do_want_time_point)::{lambda(lw_shared_ptr<rpc::protocol<net::serializer, net::messaging_verb>::server::connection>, std::experimental::fundamentals_v1::optional<std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > >, long, rpc::rcv_buf)#1}::operator()(lw_shared_ptr<rpc::protocol<net::serializer, net::messaging_verb>::server::connection>, std::experimental::fundamentals_v1::optional<std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > >, long, rpc::rcv_buf)::{lambda()#1}::operator()()::{lambda()#1}>(seastar::gate&, auto rpc::recv_helper<net::serializer, net::messaging_verb, std::function<future<rpc::no_wait_type> (rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >)>, future<rpc::no_wait_type>, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >, rpc::do_want_client_info, rpc::do_want_time_point>(rpc::signature<future<rpc::no_wait_type> (frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >)>, std::function<future<rpc::no_wait_type> (rpc::client_info const&, rpc::opt_time_point, frozen_mutation, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, gms::inet_address, unsigned int, unsigned long, rpc::optional<std::experimental::fundamentals_v1::optional<tracing::trace_info> >)>&&, rpc::do_want_client_info, rpc::do_want_time_point)::{lambda(lw_shared_ptr<rpc::protocol<net::serializer, net::messaging_verb>::server::connection>, std::experimental::fundamentals_v1::optional<std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > >, long, rpc::rcv_buf)#1}::operator()(lw_shared_ptr<rpc::protocol<net::serializer, net::messaging_verb>::server::connection>, std::experimental::fundamentals_v1::optional<std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > >, long, rpc::rcv_buf)::{lambda()#1}::operator()()::{lambda()#1}&&)::{lambda()#1}, false>, future<> >(seastar::gate)::{lambda(seastar::gate)#1}> + 16 
         2: 0x1cce2c0 vtable for continuation<future<> future<>::then<void rpc::protocol<net::serializer, net::messaging_verb>::connection::send_loop<(rpc::protocol<net::serializer, net::messaging_verb>::connection::outgoing_queue_type)0>()::{lambda()#2}::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}, future<> >(void rpc::protocol<net::serializer, net::messaging_verb>::connection::send_loop<(rpc::protocol<net::serializer, net::messaging_verb>::connection::outgoing_queue_type)0>()::{lambda()#2}::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}&&)::{lambda(void rpc::protocol<net::serializer, net::messaging_verb>::connection::send_loop<(rpc::protocol<net::serializer, net::messaging_verb>::connection::outgoing_queue_type)0>()::{lambda()#2}::operator()() const::{lambda()#2}::operator()() const::{lambda()#1})#1}> + 16 
         1: 0x1ccc010 _ZTV12continuationIZN6futureIJ16temporary_bufferIcEEE4thenIZN3rpc12read_rcv_bufER12input_streamIcEjEUlS2_E_S0_IJNS5_7rcv_bufEEEEET0_OT_EUlSE_E_JS2_EE + 16 
         1: 0x1b19020 _ZTV12continuationIZN6futureIJEE4thenIZZN7service20get_schema_for_writeEN5utils4UUIDEN3net8msg_addrEENKUl13lw_shared_ptrIK6schemaEE_clESB_EUlvE_S0_IJSB_EEEET0_OT_EUlSH_E_JEE + 16 
         1: 0x1ca9d08 _ZTV12continuationIZN6futureIJEE12then_wrappedIZ18do_until_continuedIZZN3rpc8protocolIN3net10serializerENS6_14messaging_verbEE6server10connection7processEvENUlvE_clEvEUlvE0_RZZNSB_7processEvENSC_clEvEUlvE_EvOT0_OT_7promiseIJEEEUlS1_E_S1_EESG_SJ_EUlSJ_E_JEE + 16 
         1: 0x1896bf8 vtable for continuation<future<>::wait()::{lambda(future_state<>&&)#1}> + 16 
         1: 0x18b62b0 _ZTV12continuationIZN6futureIJmEE4thenIZN7reactor14write_all_partER17pollable_fd_statePKvmmEUlmE_S0_IJEEEET0_OT_EUlSC_E_JmEE + 16 
         1: 0x18969d8 _ZTV12continuationIZN6futureIJEE4thenIZN13output_streamIcE10poll_flushEvEUlvE_S1_EET0_OT_EUlS8_E_JEE + 16 

One problem is that there is a lot of tasks associated with release_requests() (32K). The later() tasks are also associated with release_requests():

(gdb) x/10ag (task*) 0x60018fac2208
0x60018fac2208:	0x18904f0 <_ZTV11lambda_taskIZ5latervEUlvE_E+16>	0x0
0x60018fac2218:	0x1	0x6001a25bcbd8
0x60018fac2228:	0x6001a25bcbd0	0x18904f0 <_ZTV11lambda_taskIZ5latervEUlvE_E+16>
0x60018fac2238:	0x0	0x1
0x60018fac2248:	0x600171897190	0x600171897188
(gdb) x/10ag 0x6001a25bcbd0
0x6001a25bcbd0:	0x19bab28 <_ZTV12continuationIZN6futureIJEE4thenIZZN8logalloc12region_group16release_requestsEvENUlvE_clEvEUlvE_S1_EET0_OT_EUlS9_E_JEE+16>	0x1
0x6001a25bcbe0:	0x0	0x1
0x6001a25bcbf0:	0x6001a3dadfc8	0x6001a3dadfc0
0x6001a25bcc00:	0x6000002efcb0	0x6001a25bc348
0x6001a25bcc10:	0x0	0x0

We shouldn't have as much as 16K pending writes at a time given loader concurrency, so I presume that the writes started to time out and accumulate. The fix for this (5ea235e) was not yet in 1.5 branch.

@roydahan Do we have prometheus metrics for this run?

Another potential problem is that we call release_requests() from region group size update on every size change once execution is permitted and if there are still some blocked tasks. This causes even more tasks to accumulate. I'm going to rework this so that there is at most one notification in flight.

@tgrabiec
Copy link
Contributor

There are traces of write timeouts on this node:

(gdb) p service::_the_storage_proxy._instances[0].service._p->_stats.write_timeouts._count
$47 = 1252
(gdb) p service::_the_storage_proxy._instances[1].service._p->_stats.write_timeouts._count
$48 = 27686
(gdb) p service::_the_storage_proxy._instances[2].service._p->_stats.write_timeouts._count
$49 = 10691
(gdb) p service::_the_storage_proxy._instances[3].service._p->_stats.write_timeouts._count
$50 = 27519
(gdb) p service::_the_storage_proxy._instances[4].service._p->_stats.write_timeouts._count
$51 = 36537

@tgrabiec
Copy link
Contributor

Metrics from "perf-regression-master-scylla-monitor-node-cccf1559-1.zip" seem to belong to a successful run, cause both write and read phases completed, with clients disconnecting before each node goes down:

screenshot from 2017-01-26 16-32-56

screenshot from 2017-01-26 16-32-42

@avikivity
Copy link
Member

32k writes, even if each write is 1k, should still not overwhelm a shard. It's just 32M.

We need to improve in this area, but I don't think this is the root cause here.

@avikivity
Copy link
Member

@tgrabiec a possible improvement for the task queue is to allocate a fixed size vector, and also support a singly-linked-list as part of the task structure. Try to enqueue to the fixed ring buffer, if it fails, use the linked list. This will give good performance when not stressed, and allocation-free operation when stressed.

@tgrabiec
Copy link
Contributor

@avikivity The fact that we have 32K tasks means that writes are timing out as the client concurrency is way lower than that. When we reach 32K tasks and there are writes blocked on dirty, we hit the reclaimer bug resulting in a crash. But this bug is there since 1.3 so this suggests that on earlier versions we had less blocking of writes (and/or also less pending tasks due to less timeouts).

I'm still investigating why writes are blocking so much. So far the problem seems to be related to CPU being saturated.

@slivne
Copy link
Contributor

slivne commented Jan 29, 2017 via email

tgrabiec added a commit that referenced this issue Jan 30, 2017
Before, the logic for releasing writes blocked on dirty worked like this:

  1) When region group size changes and it is not under pressure and there
     are some requests blocked, then schedule request releasing task

  2) request releasing task, if no pressure, runs one request and if there are
     still blocked requests, schedules next request releasing task

If requests don't change the size of the region group, then either some request
executes or there is a request releasing task scheduled. The amount of scheduled
tasks is at most 1, there is a single thread of excution.

However, if requests themselves would change the size of the group, then each
such change would schedule yet another request releasing thread, growing the task
queue size by one.

The group size can also change when memory is reclaimed from the groups (e.g.
when contains sparse segments). Compaction may start many request releasing
threads due to group size updates.

Such behavior is detrimental for performance and stability if there are a lot
of blocked requests. This can happen on 1.5 even with modest concurrency
becuase timed out requests stay in the queue. This is less likely on 1.6 where
they are dropped from the queue.

The releasing of tasks may start to dominate over other processes in the
system. When the amount of scheduled tasks reaches 1000, polling stops and
server becomes unresponsive until all of the released requests are done, which
is either when they start to block on dirty memory again or run out of blocked
requests. It may take a while to reach pressure condition after memtable flush
if it brings virtual dirty much below the threshold, which is currently the
case for workloads with overwrites producing sparse regions.

Refs #2021.

Fix by ensuring there is at most one request releasing thread at a time. There
will be one releasing fiber per region group which is woken up when pressure is
lifted. It executes blocked requests until pressure occurs.
avikivity pushed a commit that referenced this issue Feb 2, 2017
Before, the logic for releasing writes blocked on dirty worked like
this:

  1) When region group size changes and it is not under pressure and
     there are some requests blocked, then schedule request releasing
     task

  2) request releasing task, if no pressure, runs one request and if
     there are still blocked requests, schedules next request
     releasing task

If requests don't change the size of the region group, then either
some request executes or there is a request releasing task
scheduled. The amount of scheduled tasks is at most 1, there is a
single thread of excution.

However, if requests themselves would change the size of the group,
then each such change would schedule yet another request releasing
thread, growing the task queue size by one.

The group size can also change when memory is reclaimed from the
groups (e.g.  when contains sparse segments). Compaction may start
many request releasing threads due to group size updates.

Such behavior is detrimental for performance and stability if there
are a lot of blocked requests. This can happen on 1.5 even with modest
concurrency becuase timed out requests stay in the queue. This is less
likely on 1.6 where they are dropped from the queue.

The releasing of tasks may start to dominate over other processes in
the system. When the amount of scheduled tasks reaches 1000, polling
stops and server becomes unresponsive until all of the released
requests are done, which is either when they start to block on dirty
memory again or run out of blocked requests. It may take a while to
reach pressure condition after memtable flush if it brings virtual
dirty much below the threshold, which is currently the case for
workloads with overwrites producing sparse regions.

Refs #2021.

Fix by ensuring there is at most one request releasing thread at a
time. There will be one releasing fiber per region group which is
woken up when pressure is lifted. It executes blocked requests until
pressure occurs.

The logic for notification across hierachy was replaced by calling
region_group::notify_relief() from region_group::update() on the
broadest relieved group.
avikivity added a commit that referenced this issue Feb 2, 2017
"Before, the logic for releasing writes blocked on dirty worked like this:

  1) When region group size changes and it is not under pressure and there
     are some requests blocked, then schedule request releasing task

  2) request releasing task, if no pressure, runs one request and if there are
     still blocked requests, schedules next request releasing task

If requests don't change the size of the region group, then either some request
executes or there is a request releasing task scheduled. The amount of scheduled
tasks is at most 1, there is a single releasing thread.

However, if requests themselves would change the size of the group, then each
such change would schedule yet another request releasing thread, growing the task
queue size by one.

The group size can also change when memory is reclaimed from the groups (e.g.
when contains sparse segments). Compaction may start many request releasing
threads due to group size updates.

Such behavior is detrimental for performance and stability if there are a lot
of blocked requests. This can happen on 1.5 even with modest concurrency
because timed out requests stay in the queue. This is less likely on 1.6 where
they are dropped from the queue.

The releasing of tasks may start to dominate over other processes in the
system. When the amount of scheduled tasks reaches 1000, polling stops and
server becomes unresponsive until all of the released requests are done, which
is either when they start to block on dirty memory again or run out of blocked
requests. It may take a while to reach pressure condition after memtable flush
if it brings virtual dirty much below the threshold, which is currently the
case for workloads with overwrites producing sparse regions.

I saw this happening in a write workload from issue #2021 where the number of
request releasing threads grew into thousands.

Fix by ensuring there is at most one request releasing thread at a time. There
will be one releasing fiber per region group which is woken up when pressure is
lifted. It executes blocked requests until pressure occurs."

* tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev:
  tests: lsa: Add test for reclaimer starting and stopping
  tests: lsa: Add request releasing stress test
  lsa: Avoid avalanche releasing of requests
  lsa: Move definitions to .cc
  lsa: Simplify hard pressure notification management
  lsa: Do not start or stop reclaiming on hard pressure
  tests: lsa: Adjust to take into account that reclaimers are run synchronously
  lsa: Document and annotate reclaimer notification callbacks
  tests: lsa: Use with_timeout() in quiesce()
tgrabiec pushed a commit that referenced this issue Feb 3, 2017
"Before, the logic for releasing writes blocked on dirty worked like this:

  1) When region group size changes and it is not under pressure and there
     are some requests blocked, then schedule request releasing task

  2) request releasing task, if no pressure, runs one request and if there are
     still blocked requests, schedules next request releasing task

If requests don't change the size of the region group, then either some request
executes or there is a request releasing task scheduled. The amount of scheduled
tasks is at most 1, there is a single releasing thread.

However, if requests themselves would change the size of the group, then each
such change would schedule yet another request releasing thread, growing the task
queue size by one.

The group size can also change when memory is reclaimed from the groups (e.g.
when contains sparse segments). Compaction may start many request releasing
threads due to group size updates.

Such behavior is detrimental for performance and stability if there are a lot
of blocked requests. This can happen on 1.5 even with modest concurrency
because timed out requests stay in the queue. This is less likely on 1.6 where
they are dropped from the queue.

The releasing of tasks may start to dominate over other processes in the
system. When the amount of scheduled tasks reaches 1000, polling stops and
server becomes unresponsive until all of the released requests are done, which
is either when they start to block on dirty memory again or run out of blocked
requests. It may take a while to reach pressure condition after memtable flush
if it brings virtual dirty much below the threshold, which is currently the
case for workloads with overwrites producing sparse regions.

I saw this happening in a write workload from issue #2021 where the number of
request releasing threads grew into thousands.

Fix by ensuring there is at most one request releasing thread at a time. There
will be one releasing fiber per region group which is woken up when pressure is
lifted. It executes blocked requests until pressure occurs."

* tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev:
  tests: lsa: Add test for reclaimer starting and stopping
  tests: lsa: Add request releasing stress test
  lsa: Avoid avalanche releasing of requests
  lsa: Move definitions to .cc
  lsa: Simplify hard pressure notification management
  lsa: Do not start or stop reclaiming on hard pressure
  tests: lsa: Adjust to take into account that reclaimers are run synchronously
  lsa: Document and annotate reclaimer notification callbacks
  tests: lsa: Use with_timeout() in quiesce()

(cherry picked from commit 7a00dd6)
tgrabiec pushed a commit that referenced this issue Feb 3, 2017
"Before, the logic for releasing writes blocked on dirty worked like this:

  1) When region group size changes and it is not under pressure and there
     are some requests blocked, then schedule request releasing task

  2) request releasing task, if no pressure, runs one request and if there are
     still blocked requests, schedules next request releasing task

If requests don't change the size of the region group, then either some request
executes or there is a request releasing task scheduled. The amount of scheduled
tasks is at most 1, there is a single releasing thread.

However, if requests themselves would change the size of the group, then each
such change would schedule yet another request releasing thread, growing the task
queue size by one.

The group size can also change when memory is reclaimed from the groups (e.g.
when contains sparse segments). Compaction may start many request releasing
threads due to group size updates.

Such behavior is detrimental for performance and stability if there are a lot
of blocked requests. This can happen on 1.5 even with modest concurrency
because timed out requests stay in the queue. This is less likely on 1.6 where
they are dropped from the queue.

The releasing of tasks may start to dominate over other processes in the
system. When the amount of scheduled tasks reaches 1000, polling stops and
server becomes unresponsive until all of the released requests are done, which
is either when they start to block on dirty memory again or run out of blocked
requests. It may take a while to reach pressure condition after memtable flush
if it brings virtual dirty much below the threshold, which is currently the
case for workloads with overwrites producing sparse regions.

I saw this happening in a write workload from issue #2021 where the number of
request releasing threads grew into thousands.

Fix by ensuring there is at most one request releasing thread at a time. There
will be one releasing fiber per region group which is woken up when pressure is
lifted. It executes blocked requests until pressure occurs."

* tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev:
  tests: lsa: Add test for reclaimer starting and stopping
  tests: lsa: Add request releasing stress test
  lsa: Avoid avalanche releasing of requests
  lsa: Move definitions to .cc
  lsa: Simplify hard pressure notification management
  lsa: Do not start or stop reclaiming on hard pressure
  tests: lsa: Adjust to take into account that reclaimers are run synchronously
  lsa: Document and annotate reclaimer notification callbacks
  tests: lsa: Use with_timeout() in quiesce()

(cherry picked from commit 7a00dd6)
tgrabiec pushed a commit that referenced this issue Feb 3, 2017
"Before, the logic for releasing writes blocked on dirty worked like this:

  1) When region group size changes and it is not under pressure and there
     are some requests blocked, then schedule request releasing task

  2) request releasing task, if no pressure, runs one request and if there are
     still blocked requests, schedules next request releasing task

If requests don't change the size of the region group, then either some request
executes or there is a request releasing task scheduled. The amount of scheduled
tasks is at most 1, there is a single releasing thread.

However, if requests themselves would change the size of the group, then each
such change would schedule yet another request releasing thread, growing the task
queue size by one.

The group size can also change when memory is reclaimed from the groups (e.g.
when contains sparse segments). Compaction may start many request releasing
threads due to group size updates.

Such behavior is detrimental for performance and stability if there are a lot
of blocked requests. This can happen on 1.5 even with modest concurrency
because timed out requests stay in the queue. This is less likely on 1.6 where
they are dropped from the queue.

The releasing of tasks may start to dominate over other processes in the
system. When the amount of scheduled tasks reaches 1000, polling stops and
server becomes unresponsive until all of the released requests are done, which
is either when they start to block on dirty memory again or run out of blocked
requests. It may take a while to reach pressure condition after memtable flush
if it brings virtual dirty much below the threshold, which is currently the
case for workloads with overwrites producing sparse regions.

I saw this happening in a write workload from issue #2021 where the number of
request releasing threads grew into thousands.

Fix by ensuring there is at most one request releasing thread at a time. There
will be one releasing fiber per region group which is woken up when pressure is
lifted. It executes blocked requests until pressure occurs."

* tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev:
  tests: lsa: Add test for reclaimer starting and stopping
  tests: lsa: Add request releasing stress test
  lsa: Avoid avalanche releasing of requests
  lsa: Move definitions to .cc
  lsa: Simplify hard pressure notification management
  lsa: Do not start or stop reclaiming on hard pressure
  tests: lsa: Adjust to take into account that reclaimers are run synchronously
  lsa: Document and annotate reclaimer notification callbacks
  tests: lsa: Use with_timeout() in quiesce()

(cherry picked from commit 7a00dd6)
@tzach
Copy link
Contributor

tzach commented Feb 8, 2017

@tgrabiec can we close this issue?

@tzach tzach added this to the 1.5 milestone Feb 8, 2017
@tgrabiec
Copy link
Contributor

tgrabiec commented Feb 9, 2017

@tzach I think so, but @roydahan needs to confirm by rerunning the test (a couple of times)

@roydahan
Copy link
Author

I'll verify that when running 1.7 perf testing.

@roydahan
Copy link
Author

Verified on 1.7rc2, seems like issue is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants